# Durham Performance Workshop 2023
- [Workshop website](https://tobiasweinzierl.webspace.durham.ac.uk/research/workshops/performance-analysis-workshop-series-2023/)
- [Performance tools docs](https://docs.nersc.gov/tools/performance/)
## 18 May Wrap-up
- Fortran `CONTIGUOUS` keyword to array segments to tell compiler it can vectorize
- `mpiexec --bind-to` to control pinning
- Vtune ITT API useful for annotating code regions
## CONQUEST
### Compiling on COSMA
- They recommend building your code using the Intel toolchain for the workshop. Presumably we are going to use Intel performance tools :heart:
- I ran into compatibility issues with the latest oneAPI module. There wasn't a compatible `fftw3` module with it. As a work-around, Alastair recommended loading the modules you want separately, instead of the oneAPI bundle. For building CONQUEST I loaded these modules on top of the defaults:
```bash
# Modules for building CONQUEST in performance analysis workshop 2023
module load intel_comp/2022.3.0 compiler mpi mkl
```
UPDATE we don't need ~~module load fftw/3.3.10cosma8~~ , seems that `mkl` provides a `fftw3` interface. That gets us rid of `fftw3` version clashes, so `module load oneAPI/2022.3.0` should work instead of the list above.
- The `blas`, `lapack`, and `scalapack` dependencies of CONQUEST are all included in `mkl`. However, linking into `mkl` is a bit painful.
- There is an [online tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.v7mi02) for generating the linker commands
- After fiddling with the options, these worked:
- Link line: `${MKLROOT}/lib/intel64/libmkl_blas95_lp64.a ${MKLROOT}/lib/intel64/libmkl_lapack95_lp64.a -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl`
- Compiler options: `-I${MKLROOT}/include/intel64/lp64 -I"${MKLROOT}/include"`
- I noticed there is a mix of hard-coded integer types and default integer type in CONQUEST. The default integer type has to be 32-bit (i4), otherwise you get compile errors. I think it is usually the default, so you shouldn't need to worry about it, but don't change it!
- We get MPI errors like this when running on the `bluefield` partition using the Intel toolchain (compiler, mpi)
```
Abort(337232527) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(171).................: MPI_Recv(buf=0x7ffc90f04190, count=928, MPI_INTEGER, src=1, tag=1, MPI_COMM_WORLD, status=0xcabdf0) failed
MPID_Recv(650).................:
MPIDI_recv_unsafe(164).........:
MPIDI_OFI_handle_cq_error(1106): OFI poll failed (ofi_events.c:1106:MPIDI_OFI_handle_cq_error:Truncation error)
```
- Error does not happen on `cosma8` login nodes
- Running on `cosma7` partition works without errors. Jobs get demoted to a low priority queue but seem to start right away.
- Building with gnu compiler is difficult because of lack of compatible modules. There is openblas, but no lapack or scalapack. Also the choice of openmpi and fftw versions is limited.
- There is a version of openmpi compatible with intel 2018 compiler. Haven't tried it yet.
### Profiling
#### Vtune
:::spoiler Application Performance Snapshot

:::
- Not a huge amount of info here. Note I ran this with 4 threads but we don't expect the code to use them.
:::spoiler Job script to run vtune collections as an array job on cosma7
```bash
#!/bin/bash -l
#SBATCH --ntasks 8 # The number of cores you need...
#SBATCH --cpus-per-task 1 # The number of cores you need...
#SBATCH -J conquest_test #Give it something meaningful.
#SBATCH -o standard_output_file.%J.out
#SBATCH -e standard_error_file.%J.err
#SBATCH -p cosma7 #or some other partition, e.g. cosma, cosma6, etc.
#SBATCH -A do009 #e.g. dp004
#SBATCH --exclusive
#SBATCH -t 0:30:00
#SBATCH --array=1-3
module purge
module load intel_comp/2022.3.0 compiler mpi mkl vtune
module load fftw/3.3.10cosma8
module load ucx/1.13.0rc2
APP=/cosma/home/do008/dc-kosk1/CONQUEST-release/bin/Conquest_default
export OMP_NUM_THREADS=1
# Add collections you want to run to this array
collections=("hotspots" "hpc-performance" "memory-access")
CMD=${collections[$(($SLURM_ARRAY_TASK_ID-1))]}
# I_MPI_GTOOL sets what vtune will collect and what MPI rank to target
export I_MPI_GTOOL="vtune -c $CMD -r $CMD:0"
mpiexec -np $SLURM_NTASKS $APP
```
:::
- This runs on 8 ranks but vtune only profiles on rank 0, using `I_MPI_GTOOL`. We would expect MPI ranks to have identical performance, although this is also worth checking.
- I tried to run the `hotspots`, `hpc-performance` and `memory-access` collections. However I got no output from `memory-access`.
- There is also a `threading` collection that may be useful later.
:::spoiler hotspots summary

:::
:::spoiler hotspots bottom-up

:::
- Bottom-up view shows most time being spent in `calc_matrix_elements_module` and `angular_coeff_routines`. Both doing memory access, not compute.
- Time spent in fft routines is very small.
:::spoiler hotspots Caller/Callee for FFTs

:::
- Shows very little time spent in FFTs, 1.6% of total time in `fft3` subroutine
- Of all time spent in `fft_module` 74% is spent in `rearrange_data`
- Only 26% spent in calling fftw!
- From Callees, we can see call tree going down to `MKL FFT`
:::spoiler hotspots Callers tree from MKL FFT.

:::
:::spoiler hpc-performance summary

:::
- Vectorization seems pretty good, 61% of FP operations are vectorized. The 39% scalar operations are flagged, maybe worth having a look at where they occur. Not clear how much of this is in libraries and how much in conquest.
- FP read arithmetic intensity of 0.45 is flagged, but again does not sound terrible. Worth having a look with Advisor?
:::spoiler memory-access
No data yet :unamused:
:::
#### MAQAO
::: spoiler Global

:::
::: spoiler Application Categorization

:::
- 37% In Math libraries
::: spoiler Scalability

:::
- Extra threads are doing nothing
::: spoiler Hot Functions (including libraries)

:::
::: spoiler Hot loops (not including libraries?)

:::
#### SCALASCA/SCOREP
- Scalasca has not been compiled on COSMA/DINE for the 2022 intel compiler we were using previously.
- Build works with the same `system.make` file as before by `module load oneAPI/2021.3.0` instead of `intel_comp/2022.3.0 compiler mpi mkl`
- Then `module load scalasca/2.6.1 scorep/8.1`
- In `system.make` I changed
```
# Set compilers
FC=scorep --user mpif90
F77=scorep --user mpif77
```
- Building cube locally was a huge pain. I finally managed with spack
- Use pre-installed qt. I installed it on my Ubuntu using [these instructions](https://askubuntu.com/questions/1404263/how-do-you-install-qt-on-ubuntu22-04)
- Note that I did not have to manually set the qt version. However there was a `qmake` in my `/usr/bin` that was in reality a link to another executable. The `qmake` spack needs was installed in `/usr/lib/qt5/bin/`. Once I removed the link from `/usr/bin` and added `/usr/lib/qt5/bin` to `$PATH`, spack install ran successfully with the spec `cube%gcc@11.3.0 ^dbus@1.12.20 ^qt@5.15.3`
- I also noticed that on Ubuntu `apt` installs many packages to `/usr/lib/x86_64-linux-gnu` and `spack external find` does a poor job of finding them, even when `/usr/lib/x86_64-linux-gnu` is in `$PATH`.
:::spoiler Score-p profiling

:::
- Seems to point to same issues as previous profiling with Vtune and MAQAO. TODO: Investigate more