# Omnitools: Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2024
<!--## Logistics
* Access to LUMI: ssh username@lumi.csc.fi
* Project account: project_465000532
* Slides: /project/project_465000532/slides/
* Working space: /scratch/project_465000532/$USER
-->
## Omnitrace
* Load Omnitrace
<!--
```
module load LUMI/22.08 partition/G rocm/5.3.3
module use /project/project_465000532/software/omnitrace_rocm533/share/modulefiles/
module load omnitrace/1.10.0
```
-->
* Allocate resources with `salloc`
`salloc -N 1 --ntasks=1 --partition=gpu-dev --gpus=1 -A XXX --time=00:35:00`
* Check the various options and their values and also a second command for description
`srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace`
`srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description`
* Create an Omnitrace configuration file with description per option
`srun -n 1 omnitrace-avail -G omnitrace.cfg --all`
* Declare to use this configuration file:
`export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg`
* Get the training examples:
`cp -r /your_path/exercises/HPCTrainingExamples/ .`
<!--
* Compile and execute saxpy
* `cd HPCTrainingExamples/HIP/saxpy`
* `hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.cpp`
* `time srun -n 1 ./saxpy`
* Check the duration
-->
* Compile and execute Jacobi
* `cd HPCTrainingExamples/HIP/jacobi`
* `make clean;make -f Makefile.cray`
* Binary `Jacobi_hip`
<!--
* Need to make some changes to the makefile
* ``MPICC=$(PREP) `which CC` ``
* `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include`
* `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich`
* comment out
* ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)``
-->
* Now execute the binary
* `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1`
* Check the duration
### Dynamic instrumentation (it will take long time/fail)
* Execute dynamic instrumentation: `time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1` and check the duration
<!-- * Execute dynamic instrumentation: `time srun -n 1 --gpus 1 omnitrace-instrument -- ./Jacobi_hip -g 1 1` and check the duration (may fail?) -->
* About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '`
* Available functions to instrument (**it can take long time**): `srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1`
* the simulate option means that it will not execute the binary
### Binary rewriting (to be used with MPI codes and decreases overhead)
* Allocation: `salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00`
* Binary rewriting: `srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip`
or
` srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip`
* We created a new instrumented binary called jacobi.inst
* Activating profiling:
* Edit the omnitrace.cfg file and edit to tru the parameter: `OMNITRACE_PROFILE = true`
* Executing the new instrumented binary: `time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1` and check the duration
* See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt`
* See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt
* Check the MPI calls
### Visualization
* Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
* Where are all the MPI calls?
* If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)
### Call-stack
Edit your omnitrace.cfg:
```
OMNITRACE_USE_SAMPLING = true;
OMNITRACE_SAMPLING_FREQ = 100
```
Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.
### Hardware counters
* See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all`
* Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy`
* Execute: `srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize
### Kernel timings
* Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information)
* In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt`
### Causal Profiling
`cp -r /project/project_465000532/exercises/causal .`
We use an example, you can see the source code here: `.causal/source`
Basically this tool, `.causal/causal-cpu-omni` creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one.
Execute:
```
./causal-cpu-omni
./causal-cpu-omni 50
```
Read the script `.causal/run-causal-demo.sh`
* We create a new cfg file called `causal.cfg` to activate the causal profiling and we declare it in `OMNITRACE_CONFIG_FILE`
* We define virtual speedups like: `SPEEDUPS="0,0,10,20-40:5,50,60-90:15"`, this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%.
* First call:
```
omnitrace-causal \
${RESET} \
-n 5 \
-s ${SPEEDUPS} \
-m func \
-- \
./causal-cpu-omni "${@}"
```
We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m).
* Second call:
```
omnitrace-causal \
${RESET} \
-n 10 \
-s ${SPEEDUPS} \
-m func \
-S "causal.cpp" \
-o experiment.func \
-- \
./causal-cpu-omni "${@}"
```
As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func.
* Third call:
```
omnitrace-causal \
${RESET} \
-n 10 \
-s ${SPEEDUPS} \
-m line \
-S "causal.cpp" \
-F "cpu_(slow|fast)_func" \
-o experiment.line \
-- \
./causal-cpu-omni "${@}"
```
In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the `experiment.line` file.
* Fourth call:
```
omnitrace-causal \
${RESET} \
-n 2 \
-s ${SPEEDUPS} \
-m line \
-S "causal.cpp" \
-F "cpu_slow_func" "cpu_fast_func" \
-o experiment.line.e2e \
-e \
-- \
./causal-cpu-omni "${@}"
```
The major difference here is the option `-e` which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups.
Run the `run-causal-demo.sh` (it can take some time), then a directory `omnitrace-output/` is created with subdirectories.
Need to install in a location: `pip install omnitrace-causal-viewer`
You can download this on your laptop (or ssh forward to LUMI) and execute the command:
`omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/`
The results are already located in `.causal/results`
You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files.
## Omniperf
* Load Omniperf:
```
module load cray-python
module load LUMI/22.08 partition/G rocm/5.3.3
module use /project/project_465000532/software/omniperf_rocm533/modulefiles/
module load omniperf/1.0.8
```
* Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed
```
salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_465000532 --time=00:30:00
cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
make
cd bin
srun -n 1 --gpus 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
```
### Execution with all the analysis
* Run `srun -n 1 --gpus 1 omniperf profile -h` to see all the options
* Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it
```
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt
```
### Roofline only execution
* If you want to only roofline analysis, then execute: `srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv`
* When you use the `--roof-only` option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8
* If you want to create another PDF file with the kernel names, then add in the above command the `--kernel-names` this will create another PDF file with the markers and their names.
```
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt
```
There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file `dgemm_analyze.txt`
* We can select specific IP Blocks, like:
```
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2
```
But you need to know the code of the IP Block
### Visualize
* If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:
```
omniperf analyze -p workloads/dgemm/mi200/ --gui
```
* Open the web page: http://IP:8050/ The IP will be displayed in the output
### Too large profiling data
* Use a tool like rocprof to get the top 10 kernels
* Then in the above commands add the option `-k 0` where 0 is the `id` of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.