Omnitools: Performance Analysis Tools for AMD GPUs, LUMI training Estonia

# Omnitools: Performance Analysis Tools for AMD GPUs, Adastra training ## Omnitrace * Load Omnitrace Use the proper modules from Adastra as also the project (if any), the instructions below are generic * Allocate resources with `salloc` `salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_465000532 --time=00:35:00` * Check the various options and their values and also a second command for description `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace` `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description` * Create an Omnitrace configuration file with description per option `srun -n 1 omnitrace-avail -G omnitrace.cfg --all` * Declare to use this configuration file: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` * Get the training examples: `cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .`  * Compile and execute Jacobi * `cd HPCTrainingExamples/HIP/jacobi` * `make clean;make` * Binary `jacobi_hip`  * Now execute the binary * `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1` * Check the duration ### Binary rewriting (to be used with MPI codes and decreases overhead) * Allocation: `salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00` * Binary rewriting: `srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip` or ` srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip` * We created a new instrumented binary called jacobi.inst * Executing the new instrumented binary: `time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1` and check the duration * See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt` * See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt * Check the MPI calls ### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto. * Where are all the MPI calls? * If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers) ### Call-stack Edit your omnitrace.cfg: ``` OMNITRACE_USE_SAMPLING = true; OMNITRACE_SAMPLING_FREQ = 100 ``` Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto. ### Hardware counters * See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all` * Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize ### Kernel timings * Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt`