CIUK 2024 Profiling

# Performance Profiling Notes ## Outline 1. Running a benchmark with a profiler 2. Sampling profilers (Nsight & Vtune) 3. Collecting roofline data (advisor-roofline) ## Profilers in excalibur-tests The `excalibur-tests` framework allows you to run a profiler together with a benchmark application. To do this, you can set the profiler attribute on the command line using the `-S profiler=...` syntax We support profilers that can be spack installed without a lincense and don't require modifying the source code or the build system. Currently supported values for the profiler attribute are: - `advisor-roofline`: it produces a roofline model of your program using Intel Advisor; - `nsight`: it runs the code with the NVIDIA Nsight Systems profiler; - `vtune`: it runs the code with the Intel VTune profiler. For more details, see the [User documentation](https://ukri-excalibur.github.io/excalibur-tests/use/) Disclaimer: The authors are not affiliated with Intel or Nvidia (but have found these tools reasonably useful in their day jobs) ## Profiling with Nsight - NVIDIA Nsight Systems is a low-overhead sampling profiler that supports both CPU and GPU applications. - Supports both x86 and ARM architectures - We collect `nsys profile --trace=cuda,mpi,nvtx,openmp,osrt,opengl,syscall` ```bash reframe -c path/to/stream -r -S profiler=nsight ``` - Spack installs the `nvidia-nsight-systems` package in the background, including the GUI - The paths to the collected profile data, and to the GUI launcher are written into `rfm_job.out` - To run the GUI remotely, you need to login with `ssh -X`. It wil be slow. - You can (spack) install the GUI locally to view the data. ## Roofline analysis with Advisor - Intel Advisor is a tool for on-node performance optimisation. It does analysis for efficient Vectorization, Threading, Memory Usage, and Accelerator Offloading - Since ~2018 it has had support for automated roofline analysis - Is restricted to x86 CPU architecture - Won't run on the MPI launcher (because it does on-node analysis). In our benchmarks we have to override it. It can run inside an MPI job on a single rank but we don't currently support it, hopefully will be available in the future. ```python @run_before('run') def replace_launcher(self): self.job.launcher = getlauncher('local')() ``` - We collect `advisor -collect roofline` ```bash reframe -c path/to/stream -r -S profiler=advisor-roofline ``` - Similar to Nsight, the GUI is installed by Spack but is slow to run remotely. ### Analysis - We get 3 kernels (add, scale, triad). Copy kernel does no Flops so it doesn't show up. - Second scale kernel entry is a check STREAM runs before launching the kernels. - We don't seem to be able to show source code on ARCHER2. - As a backup, show profile on local (Intel) machine. - AI values for kernels are roughly - Add (2 reads, 1 write, 1 FLOP): 1/24 = 0.0417 - Scale (1 read, 1 write, 1 FLOP): 1/16 = 0.0625 - Triad: (2 reads, 1 write, 2 FLOPs) 2/24 = 0.0833