Omnitools: Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2024

Omnitrace

Load Omnitrace

Allocate resources with salloc

salloc -N 1 --ntasks=1 --partition=gpu-dev --gpus=1 -A XXX --time=00:35:00

Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace.cfg --all

Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

Get the training examples:

cp -r /your_path/exercises/HPCTrainingExamples/ .

Compile and execute Jacobi
- cd HPCTrainingExamples/HIP/jacobi
- make clean;make -f Makefile.cray
- Binary Jacobi_hip

Now execute the binary
- time srun -n 1 --gpus 1 Jacobi_hip -g 1 1
Check the duration

Dynamic instrumentation (it will take long time/fail)

Execute dynamic instrumentation: time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1 and check the duration

About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: nm --demangle Jacobi_hip | egrep -i ' (t|u) '
Available functions to instrument (it can take long time): srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
- the simulate option means that it will not execute the binary

Binary rewriting (to be used with MPI codes and decreases overhead)

Allocation: salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00
Binary rewriting: srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip
or
srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip
We created a new instrumented binary called jacobi.inst
Activating profiling:
- Edit the omnitrace.cfg file and edit to tru the parameter: OMNITRACE_PROFILE = true
Executing the new instrumented binary: time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1 and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt
See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt or wallclock-1.txt
Check the MPI calls

Visualization

Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
Where are all the MPI calls?
If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all
Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
Execute: srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Causal Profiling

cp -r /project/project_465000532/exercises/causal .

We use an example, you can see the source code here: .causal/source
Basically this tool, .causal/causal-cpu-omni creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one.

Execute:

./causal-cpu-omni
./causal-cpu-omni 50

Read the script .causal/run-causal-demo.sh

We create a new cfg file called causal.cfg to activate the causal profiling and we declare it in OMNITRACE_CONFIG_FILE
We define virtual speedups like: SPEEDUPS="0,0,10,20-40:5,50,60-90:15", this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%.
First call:

omnitrace-causal        \
        ${RESET}        \
        -n 5            \
        -s ${SPEEDUPS}  \
        -m func         \
        --              \
        ./causal-cpu-omni "${@}"

We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m).

Second call:

omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m func         \
        -S "causal.cpp" \
        -o experiment.func \
         --              \
        ./causal-cpu-omni "${@}"

As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func.

Third call:

omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_(slow|fast)_func" \
        -o experiment.line \
        --                 \
        ./causal-cpu-omni "${@}"

In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the experiment.line file.

Fourth call:

omnitrace-causal        \
        ${RESET}        \
        -n 2           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_slow_func" "cpu_fast_func" \
        -o experiment.line.e2e \
        -e                 \
        --                 \
        ./causal-cpu-omni "${@}"

The major difference here is the option -e which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups.

Run the run-causal-demo.sh (it can take some time), then a directory omnitrace-output/ is created with subdirectories.

Need to install in a location: pip install omnitrace-causal-viewer

You can download this on your laptop (or ssh forward to LUMI) and execute the command:

omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/

The results are already located in .causal/results

You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files.

Omniperf

Load Omniperf:

module load cray-python
module load LUMI/22.08 partition/G rocm/5.3.3 
module use /project/project_465000532/software/omniperf_rocm533/modulefiles/ 
module load omniperf/1.0.8

Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed

salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_465000532 --time=00:30:00
cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
make
cd bin
srun -n 1 --gpus 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

Execution with all the analysis

Run srun -n 1 --gpus 1 omniperf profile -h to see all the options
Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

Roofline only execution

If you want to only roofline analysis, then execute: srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
When you use the --roof-only option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8
If you want to create another PDF file with the kernel names, then add in the above command the --kernel-names this will create another PDF file with the markers and their names.

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt

We can select specific IP Blocks, like:

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2

But you need to know the code of the IP Block

Visualize

If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:

omniperf analyze -p workloads/dgemm/mi200/ --gui

Open the web page: http://IP:8050/ The IP will be displayed in the output

Too large profiling data

Use a tool like rocprof to get the top 10 kernels
Then in the above commands add the option -k 0 where 0 is the id of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.