# Omnitools: Performance Analysis Tools for AMD GPUs, Adastra training
## Omnitrace
* Load Omnitrace
Use the proper modules from Adastra as also the project (if any), the instructions below are generic
* Allocate resources with `salloc`
`salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_465000532 --time=00:35:00`
* Check the various options and their values and also a second command for description
`srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace`
`srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description`
* Create an Omnitrace configuration file with description per option
`srun -n 1 omnitrace-avail -G omnitrace.cfg --all`
* Declare to use this configuration file:
`export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg`
* Get the training examples:
`cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .`
<!--
* Compile and execute saxpy
* `cd HPCTrainingExamples/HIP/saxpy`
* `hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.cpp`
* `time srun -n 1 ./saxpy`
* Check the duration
-->
* Compile and execute Jacobi
* `cd HPCTrainingExamples/HIP/jacobi`
* `make clean;make`
* Binary `jacobi_hip`
<!--
* Need to make some changes to the makefile
* ``MPICC=$(PREP) `which CC` ``
* `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include`
* `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich`
* comment out
* ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)``
-->
* Now execute the binary
* `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1`
* Check the duration
### Binary rewriting (to be used with MPI codes and decreases overhead)
* Allocation: `salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00`
* Binary rewriting: `srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip`
or
` srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip`
* We created a new instrumented binary called jacobi.inst
* Executing the new instrumented binary: `time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1` and check the duration
* See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt`
* See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt
* Check the MPI calls
### Visualization
* Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
* Where are all the MPI calls?
* If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)
### Call-stack
Edit your omnitrace.cfg:
```
OMNITRACE_USE_SAMPLING = true;
OMNITRACE_SAMPLING_FREQ = 100
```
Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.
### Hardware counters
* See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all`
* Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy`
* Execute: `srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize
### Kernel timings
* Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information)
* In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt`