# Performance Profiling Notes
## Outline
1. Running a benchmark with a profiler
2. Sampling profilers (Nsight & Vtune)
3. Collecting roofline data (advisor-roofline)
## Profilers in excalibur-tests
The `excalibur-tests` framework allows you to run a profiler together with a benchmark application. To do this, you can set the profiler attribute on the command line using the `-S profiler=...` syntax
We support profilers that can be spack installed without a lincense and don't require modifying the source code or the build system. Currently supported values for the profiler attribute are:
- `advisor-roofline`: it produces a roofline model of your program using Intel Advisor;
- `nsight`: it runs the code with the NVIDIA Nsight Systems profiler;
- `vtune`: it runs the code with the Intel VTune profiler.
For more details, see the [User documentation](https://ukri-excalibur.github.io/excalibur-tests/use/)
Disclaimer: The authors are not affiliated with Intel or Nvidia (but have found these tools reasonably useful in their day jobs)
## Profiling with Nsight
- NVIDIA Nsight Systems is a low-overhead sampling profiler that supports both CPU and GPU applications.
- Supports both x86 and ARM architectures
- We collect `nsys profile --trace=cuda,mpi,nvtx,openmp,osrt,opengl,syscall`
```bash
reframe -c path/to/stream -r -S profiler=nsight
```
- Spack installs the `nvidia-nsight-systems` package in the background, including the GUI
- The paths to the collected profile data, and to the GUI launcher are written into `rfm_job.out`
- To run the GUI remotely, you need to login with `ssh -X`. It wil be slow.
- You can (spack) install the GUI locally to view the data.
## Roofline analysis with Advisor
- Intel Advisor is a tool for on-node performance optimisation. It does analysis for efficient Vectorization, Threading, Memory Usage, and Accelerator Offloading
- Since ~2018 it has had support for automated roofline analysis
- Is restricted to x86 CPU architecture
- Won't run on the MPI launcher (because it does on-node analysis). In our benchmarks we have to override it. It can run inside an MPI job on a single rank but we don't currently support it, hopefully will be available in the future.
```python
@run_before('run')
def replace_launcher(self):
self.job.launcher = getlauncher('local')()
```
- We collect `advisor -collect roofline`
```bash
reframe -c path/to/stream -r -S profiler=advisor-roofline
```
- Similar to Nsight, the GUI is installed by Spack but is slow to run remotely.
### Analysis
- We get 3 kernels (add, scale, triad). Copy kernel does no Flops so it doesn't show up.
- Second scale kernel entry is a check STREAM runs before launching the kernels.
- We don't seem to be able to show source code on ARCHER2.
- As a backup, show profile on local (Intel) machine.
- AI values for kernels are roughly
- Add (2 reads, 1 write, 1 FLOP): 1/24 = 0.0417
- Scale (1 read, 1 write, 1 FLOP): 1/16 = 0.0625
- Triad: (2 reads, 1 write, 2 FLOPs) 2/24 = 0.0833