You would need to replace the project accounts with the ones that are for your system

HIP Exercises

git clone https://github.com/amd/HPCTrainingExamples.git

We assume that you have already allocated resources with salloc

cp -r /project/project_462000125/exercises/AMD/HPCTrainingExamples/ .

salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_462000125

module load craype-accel-amd-gfx90a
module load PrgEnv-amd
module load rocm

Hipify

We’ll use the same HPCTrainingExamples that were downloaded for the first exercise.

Get a node allocation.

salloc -N 1 --ntasks=1 --gpus=1 -p small-g -A project_462000125 --t 00:10:00`

A batch version of the example is also shown.

Hipify Examples

Exercise 1: Manual code conversion from CUDA to HIP (10 min)

Choose one or more of the CUDA samples in

HPCTrainingExamples/HIPIFY/mini-nbody/cuda

directory. Manually convert it to HIP. Tip: for example, the cudaMalloc will be called hipMalloc.
Some code suggestions include nbody-block.cu, nbody-orig.cu, nbody-soa.cu

You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture.

hipify-perl nbody-block.cu > nbody-block.cpp
hipify-perl -inplace nbody-block.cu
Now two files are created/modified, the nbody-block.cu.prehip and nbody-block.cu

Exercise 2: Code conversion from CUDA to HIP using HIPify tools (10 min)

Use the hipify-perl script to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl is in $ROCM_PATH/hip/bin directory and should be in your path.

First test the conversion to see what will be converted

hipify-perl -no-output -print-stats nbody-orig.cu

You'll see the statistics of HIP APIs that will be generated.

[HIPIFY] info: file 'nbody-orig.cu' statisitics:
  CONVERTED refs count: 10
  TOTAL lines of code: 91
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
  cudaFree => hipFree: 1
  cudaMalloc => hipMalloc: 1
  cudaMemcpy => hipMemcpy: 2
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1

hipify-perl is in $ROCM_PATH/hip/bin directory and should be in your path. In some versions of ROCm, the script is called hipify-perl.

Now let's actually do the conversion.

hipify-perl nbody-orig.cu > nbody-orig.cpp

Compile the HIP programs.

hipcc --offlod-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig`

The #define SHMOO fixes some timer printouts. Add --offload-arch=<gpu_type> to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node.

Fix any compiler issues, for example, if there was something that didn’t hipify correctly.
Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX.

Run the program

srun ./nbody-orig

A batch version of Exercise 2 is:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --gpus=1
#SBATCH -p small-g
#SBATCH -A project_462000125
#SBATCH -t 00:10:00

module load craype-accel-amd-gfx90a
module load rocm

cd HPCTrainingExamples/mini-nbody/cuda
hipify-perl -print-stats nbody-orig.cu > nbody-orig.cpp
hipcc --offload-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig
srun ./nbody-orig
cd ../../..

Notes:

Hipify tools do not check correctness
hipconvertinplace-perl is a convenience script that does hipify-perl -inplace -print-stats command in a directory

Omnitools: Performance Analysis Tools for AMD GPUs

Omnitrace

Note: Just installed Omnitrace on /scratch/omnitrace but not tested yet.

source /scratch/omnitrace/share/omnitrace/setup-env.sh

Load Omnitrace

module load LUMI/22.08 partition/G rocm/5.3.3 
module use /project/project_465000532/software/omnitrace_rocm533/share/modulefiles/
module load omnitrace/1.10.0

Allocate resources with salloc

salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:35:00

Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace.cfg --all

Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

Get the training examples:

cp -r /project/project_462000125/exercises/HPCTrainingExamples/ .

Compile and execute Jacobi
- cd HPCTrainingExamples/HIP/jacobi
- make clean;make
- Binary jacobi_hip

Now execute the binary
- time srun -n 1 --gpus 1 Jacobi_hip -g 1 1
Check the duration

Dynamic instrumentation (it will take long time/fail)

Execute dynamic instrumentation: time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1 and check the duration

About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: nm --demangle Jacobi_hip | egrep -i ' (t|u) '
Available functions to instrument (it can take long time): srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
- the simulate option means that it will not execute the binary

Binary rewriting (to be used with MPI codes and decreases overhead)

Allocation: salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_462000125 --time=00:35:00
Binary rewriting: srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip
or
srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip
We created a new instrumented binary called jacobi.inst
Executing the new instrumented binary: time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1 and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt
See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt or wallclock-1.txt
Check the MPI calls

Visualization

Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
Where are all the MPI calls?
If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all
Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
Execute: srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Causal Profiling

cp -r /project/project_462000125/exercises/causal .

We use an example, you can see the source code here: .causal/source
Basically this tool, .causal/causal-cpu-omni creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one.

Execute:

./causal-cpu-omni
./causal-cpu-omni 50

Read the script .causal/run-causal-demo.sh

We create a new cfg file called causal.cfg to activate the causal profiling and we declare it in OMNITRACE_CONFIG_FILE
We define virtual speedups like: SPEEDUPS="0,0,10,20-40:5,50,60-90:15", this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%.
First call:

omnitrace-causal        \
        ${RESET}        \
        -n 5            \
        -s ${SPEEDUPS}  \
        -m func         \
        --              \
        ./causal-cpu-omni "${@}"

We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m).

Second call:

omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m func         \
        -S "causal.cpp" \
        -o experiment.func \
         --              \
        ./causal-cpu-omni "${@}"

As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func.

Third call:

omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_(slow|fast)_func" \
        -o experiment.line \
        --                 \
        ./causal-cpu-omni "${@}"

In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the experiment.line file.

Fourth call:

omnitrace-causal        \
        ${RESET}        \
        -n 2           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_slow_func" "cpu_fast_func" \
        -o experiment.line.e2e \
        -e                 \
        --                 \
        ./causal-cpu-omni "${@}"

The major difference here is the option -e which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups.

Run the run-causal-demo.sh (it can take some time), then a directory omnitrace-output/ is created with subdirectories.

Need to install in a location: pip install omnitrace-causal-viewer

You can download this on your laptop (or ssh forward to LUMI) and execute the command:

omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/

The results are already located in .causal/results

You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files.

Omniperf

On uan1:

module load cray-python
export PATH=/scratch/omniperf/1.0.8/bin/:$PATH
export PYTHONPATH=/scratch/omniperf/python:$PYTHONPATH

Load Omniperf:

module load cray-python
module load LUMI/22.08 partition/G rocm/5.3.3 
module use /project/project_465000532/software/omniperf_rocm533/modulefiles/ 
module load omniperf/1.0.8

Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed

salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:30:00
cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
make
cd bin
srun -n 1 --gpus 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

Execution with all the analysis

Run srun -n 1 --gpus 1 omniperf profile -h to see all the options
Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

Roofline only execution

If you want to only roofline analysis, then execute: srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
When you use the --roof-only option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8
If you want to create another PDF file with the kernel names, then add in the above command the --kernel-names this will create another PDF file with the markers and their names.

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt

We can select specific IP Blocks, like:

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2

But you need to know the code of the IP Block

Visualize

If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:

omniperf analyze -p workloads/dgemm/mi200/ --gui

Open the web page: http://IP:8050/ The IP will be displayed in the output

Too large profiling data

Use a tool like rocprof to get the top 10 kernels
Then in the above commands add the option -k 0 where 0 is the id of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.