Durham Performance Workshop 2023

# Durham Performance Workshop 2023 - [Workshop website](https://tobiasweinzierl.webspace.durham.ac.uk/research/workshops/performance-analysis-workshop-series-2023/) - [Performance tools docs](https://docs.nersc.gov/tools/performance/) ## 18 May Wrap-up - Fortran `CONTIGUOUS` keyword to array segments to tell compiler it can vectorize - `mpiexec --bind-to` to control pinning - Vtune ITT API useful for annotating code regions ## CONQUEST ### Compiling on COSMA - They recommend building your code using the Intel toolchain for the workshop. Presumably we are going to use Intel performance tools :heart: - I ran into compatibility issues with the latest oneAPI module. There wasn't a compatible `fftw3` module with it. As a work-around, Alastair recommended loading the modules you want separately, instead of the oneAPI bundle. For building CONQUEST I loaded these modules on top of the defaults: ```bash # Modules for building CONQUEST in performance analysis workshop 2023 module load intel_comp/2022.3.0 compiler mpi mkl ``` UPDATE we don't need ~~module load fftw/3.3.10cosma8~~ , seems that `mkl` provides a `fftw3` interface. That gets us rid of `fftw3` version clashes, so `module load oneAPI/2022.3.0` should work instead of the list above. - The `blas`, `lapack`, and `scalapack` dependencies of CONQUEST are all included in `mkl`. However, linking into `mkl` is a bit painful. - There is an [online tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.v7mi02) for generating the linker commands - After fiddling with the options, these worked: - Link line: `${MKLROOT}/lib/intel64/libmkl_blas95_lp64.a ${MKLROOT}/lib/intel64/libmkl_lapack95_lp64.a -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl` - Compiler options: `-I${MKLROOT}/include/intel64/lp64 -I"${MKLROOT}/include"` - I noticed there is a mix of hard-coded integer types and default integer type in CONQUEST. The default integer type has to be 32-bit (i4), otherwise you get compile errors. I think it is usually the default, so you shouldn't need to worry about it, but don't change it! - We get MPI errors like this when running on the `bluefield` partition using the Intel toolchain (compiler, mpi) ``` Abort(337232527) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack: PMPI_Recv(171).................: MPI_Recv(buf=0x7ffc90f04190, count=928, MPI_INTEGER, src=1, tag=1, MPI_COMM_WORLD, status=0xcabdf0) failed MPID_Recv(650).................: MPIDI_recv_unsafe(164).........: MPIDI_OFI_handle_cq_error(1106): OFI poll failed (ofi_events.c:1106:MPIDI_OFI_handle_cq_error:Truncation error) ``` - Error does not happen on `cosma8` login nodes - Running on `cosma7` partition works without errors. Jobs get demoted to a low priority queue but seem to start right away. - Building with gnu compiler is difficult because of lack of compatible modules. There is openblas, but no lapack or scalapack. Also the choice of openmpi and fftw versions is limited. - There is a version of openmpi compatible with intel 2018 compiler. Haven't tried it yet. ### Profiling #### Vtune :::spoiler Application Performance Snapshot ![Vtune Application Performance Snapshot](https://i.imgur.com/YAPbqej.png) ::: - Not a huge amount of info here. Note I ran this with 4 threads but we don't expect the code to use them. :::spoiler Job script to run vtune collections as an array job on cosma7 ```bash #!/bin/bash -l #SBATCH --ntasks 8 # The number of cores you need... #SBATCH --cpus-per-task 1 # The number of cores you need... #SBATCH -J conquest_test #Give it something meaningful. #SBATCH -o standard_output_file.%J.out #SBATCH -e standard_error_file.%J.err #SBATCH -p cosma7 #or some other partition, e.g. cosma, cosma6, etc. #SBATCH -A do009 #e.g. dp004 #SBATCH --exclusive #SBATCH -t 0:30:00 #SBATCH --array=1-3 module purge module load intel_comp/2022.3.0 compiler mpi mkl vtune module load fftw/3.3.10cosma8 module load ucx/1.13.0rc2 APP=/cosma/home/do008/dc-kosk1/CONQUEST-release/bin/Conquest_default export OMP_NUM_THREADS=1 # Add collections you want to run to this array collections=("hotspots" "hpc-performance" "memory-access") CMD=${collections[$(($SLURM_ARRAY_TASK_ID-1))]} # I_MPI_GTOOL sets what vtune will collect and what MPI rank to target export I_MPI_GTOOL="vtune -c $CMD -r $CMD:0" mpiexec -np $SLURM_NTASKS $APP ``` ::: - This runs on 8 ranks but vtune only profiles on rank 0, using `I_MPI_GTOOL`. We would expect MPI ranks to have identical performance, although this is also worth checking. - I tried to run the `hotspots`, `hpc-performance` and `memory-access` collections. However I got no output from `memory-access`. - There is also a `threading` collection that may be useful later. :::spoiler hotspots summary ![hotspots summary](https://i.imgur.com/uvlRMsP.png) ::: :::spoiler hotspots bottom-up ![hotspots bottom-up](https://i.imgur.com/ET901sn.png) ::: - Bottom-up view shows most time being spent in `calc_matrix_elements_module` and `angular_coeff_routines`. Both doing memory access, not compute. - Time spent in fft routines is very small. :::spoiler hotspots Caller/Callee for FFTs ![hotspots fft](https://i.imgur.com/TPCBVmj.png) ::: - Shows very little time spent in FFTs, 1.6% of total time in `fft3` subroutine - Of all time spent in `fft_module` 74% is spent in `rearrange_data` - Only 26% spent in calling fftw! - From Callees, we can see call tree going down to `MKL FFT` :::spoiler hotspots Callers tree from MKL FFT. ![](https://i.imgur.com/cXroVRi.png) ::: :::spoiler hpc-performance summary ![hpc-performance summary](https://i.imgur.com/zOSXBaB.png) ::: - Vectorization seems pretty good, 61% of FP operations are vectorized. The 39% scalar operations are flagged, maybe worth having a look at where they occur. Not clear how much of this is in libraries and how much in conquest. - FP read arithmetic intensity of 0.45 is flagged, but again does not sound terrible. Worth having a look with Advisor? :::spoiler memory-access No data yet :unamused: ::: #### MAQAO ::: spoiler Global ![](https://hackmd.io/_uploads/rJRCjxY4h.png) ::: ::: spoiler Application Categorization ![](https://hackmd.io/_uploads/Syue2gK4h.png) ::: - 37% In Math libraries ::: spoiler Scalability ![](https://hackmd.io/_uploads/H1aY2gFEn.png) ::: - Extra threads are doing nothing ::: spoiler Hot Functions (including libraries) ![](https://hackmd.io/_uploads/SyAwTxtVh.png) ::: ::: spoiler Hot loops (not including libraries?) ![](https://hackmd.io/_uploads/S1zValYN2.png) ::: #### SCALASCA/SCOREP - Scalasca has not been compiled on COSMA/DINE for the 2022 intel compiler we were using previously. - Build works with the same `system.make` file as before by `module load oneAPI/2021.3.0` instead of `intel_comp/2022.3.0 compiler mpi mkl` - Then `module load scalasca/2.6.1 scorep/8.1` - In `system.make` I changed ``` # Set compilers FC=scorep --user mpif90 F77=scorep --user mpif77 ``` - Building cube locally was a huge pain. I finally managed with spack - Use pre-installed qt. I installed it on my Ubuntu using [these instructions](https://askubuntu.com/questions/1404263/how-do-you-install-qt-on-ubuntu22-04) - Note that I did not have to manually set the qt version. However there was a `qmake` in my `/usr/bin` that was in reality a link to another executable. The `qmake` spack needs was installed in `/usr/lib/qt5/bin/`. Once I removed the link from `/usr/bin` and added `/usr/lib/qt5/bin` to `$PATH`, spack install ran successfully with the spec `cube%gcc@11.3.0 ^dbus@1.12.20 ^qt@5.15.3` - I also noticed that on Ubuntu `apt` installs many packages to `/usr/lib/x86_64-linux-gnu` and `spack external find` does a poor job of finding them, even when `/usr/lib/x86_64-linux-gnu` is in `$PATH`. :::spoiler Score-p profiling ![](https://hackmd.io/_uploads/H1bGUJ24h.png) ::: - Seems to point to same issues as previous profiling with Vtune and MAQAO. TODO: Investigate more

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.