You would need to replace the project accounts with the ones that are for your system

HIP Exercises

git clone https://github.com/amd/HPCTrainingExamples.git

We assume that you have already allocated resources with salloc

cp -r /project/project_462000125/exercises/AMD/HPCTrainingExamples/ .

salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_462000125

module load craype-accel-amd-gfx90a
module load PrgEnv-amd
module load rocm

Hipify

We’ll use the same HPCTrainingExamples that were downloaded for the first exercise.

Get a node allocation.

salloc -N 1 --ntasks=1 --gpus=1 -p small-g -A project_462000125 --t 00:10:00`

A batch version of the example is also shown.

Hipify Examples

Exercise 1: Manual code conversion from CUDA to HIP (10 min)

Choose one or more of the CUDA samples in

HPCTrainingExamples/HIPIFY/mini-nbody/cuda

directory. Manually convert it to HIP. Tip: for example, the cudaMalloc will be called hipMalloc.
Some code suggestions include nbody-block.cu, nbody-orig.cu, nbody-soa.cu

You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture.

hipify-perl nbody-block.cu > nbody-block.cpp
hipify-perl -inplace nbody-block.cu
Now two files are created/modified, the nbody-block.cu.prehip and nbody-block.cu

Exercise 2: Code conversion from CUDA to HIP using HIPify tools (10 min)

Use the hipify-perl script to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl is in $ROCM_PATH/hip/bin directory and should be in your path.

First test the conversion to see what will be converted

hipify-perl -no-output -print-stats nbody-orig.cu

You'll see the statistics of HIP APIs that will be generated.

[HIPIFY] info: file 'nbody-orig.cu' statisitics:
  CONVERTED refs count: 10
  TOTAL lines of code: 91
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
  cudaFree => hipFree: 1
  cudaMalloc => hipMalloc: 1
  cudaMemcpy => hipMemcpy: 2
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1

hipify-perl is in $ROCM_PATH/hip/bin directory and should be in your path. In some versions of ROCm, the script is called hipify-perl.

Now let's actually do the conversion.

hipify-perl nbody-orig.cu > nbody-orig.cpp

Compile the HIP programs.

hipcc --offlod-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig`

The #define SHMOO fixes some timer printouts. Add --offload-arch=<gpu_type> to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node.

  • Fix any compiler issues, for example, if there was something that didn’t hipify correctly.
  • Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX.

Run the program

srun ./nbody-orig

A batch version of Exercise 2 is:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --gpus=1
#SBATCH -p small-g
#SBATCH -A project_462000125
#SBATCH -t 00:10:00

module load craype-accel-amd-gfx90a
module load rocm

cd HPCTrainingExamples/mini-nbody/cuda
hipify-perl -print-stats nbody-orig.cu > nbody-orig.cpp
hipcc --offload-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig
srun ./nbody-orig
cd ../../..

Notes:

  • Hipify tools do not check correctness
  • hipconvertinplace-perl is a convenience script that does hipify-perl -inplace -print-stats command in a directory

Omnitools: Performance Analysis Tools for AMD GPUs

Omnitrace

Note: Just installed Omnitrace on /scratch/omnitrace but not tested yet.

source /scratch/omnitrace/share/omnitrace/setup-env.sh

  • Load Omnitrace
module load LUMI/22.08 partition/G rocm/5.3.3 
module use /project/project_465000532/software/omnitrace_rocm533/share/modulefiles/
module load omnitrace/1.10.0
  • Allocate resources with salloc

salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:35:00

  • Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

  • Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace.cfg --all

  • Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

  • Get the training examples:

cp -r /project/project_462000125/exercises/HPCTrainingExamples/ .

  • Compile and execute Jacobi
    • cd HPCTrainingExamples/HIP/jacobi
    • make clean;make
    • Binary jacobi_hip
  • Now execute the binary

    • time srun -n 1 --gpus 1 Jacobi_hip -g 1 1
  • Check the duration

Dynamic instrumentation (it will take long time/fail)

  • Execute dynamic instrumentation: time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1 and check the duration
  • About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: nm --demangle Jacobi_hip | egrep -i ' (t|u) '
  • Available functions to instrument (it can take long time): srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
    • the simulate option means that it will not execute the binary

Binary rewriting (to be used with MPI codes and decreases overhead)

  • Allocation: salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_462000125 --time=00:35:00

  • Binary rewriting: srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip
    or
    srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip

  • We created a new instrumented binary called jacobi.inst

  • Executing the new instrumented binary: time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1 and check the duration

  • See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt

  • See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt or wallclock-1.txt

  • Check the MPI calls

Visualization

  • Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
  • Where are all the MPI calls?
  • If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

  • See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all
  • Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
  • Execute: srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

  • Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
  • In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Causal Profiling

cp -r /project/project_462000125/exercises/causal .

We use an example, you can see the source code here: .causal/source
Basically this tool, .causal/causal-cpu-omni creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one.

Execute:

./causal-cpu-omni
./causal-cpu-omni 50

Read the script .causal/run-causal-demo.sh

  • We create a new cfg file called causal.cfg to activate the causal profiling and we declare it in OMNITRACE_CONFIG_FILE

  • We define virtual speedups like: SPEEDUPS="0,0,10,20-40:5,50,60-90:15", this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%.

  • First call:

omnitrace-causal        \
        ${RESET}        \
        -n 5            \
        -s ${SPEEDUPS}  \
        -m func         \
        --              \
        ./causal-cpu-omni "${@}"

We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m).

  • Second call:
omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m func         \
        -S "causal.cpp" \
        -o experiment.func \
         --              \
        ./causal-cpu-omni "${@}"

As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func.

  • Third call:
omnitrace-causal        \
        ${RESET}        \
        -n 10           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_(slow|fast)_func" \
        -o experiment.line \
        --                 \
        ./causal-cpu-omni "${@}"

In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the experiment.line file.

  • Fourth call:
omnitrace-causal        \
        ${RESET}        \
        -n 2           \
        -s ${SPEEDUPS}  \
        -m line         \
        -S "causal.cpp" \
        -F "cpu_slow_func" "cpu_fast_func" \
        -o experiment.line.e2e \
        -e                 \
        --                 \
        ./causal-cpu-omni "${@}"

The major difference here is the option -e which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups.

Run the run-causal-demo.sh (it can take some time), then a directory omnitrace-output/ is created with subdirectories.

Need to install in a location: pip install omnitrace-causal-viewer

You can download this on your laptop (or ssh forward to LUMI) and execute the command:

omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/

The results are already located in .causal/results

You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files.

Omniperf

  • On uan1:
module load cray-python
export PATH=/scratch/omniperf/1.0.8/bin/:$PATH
export PYTHONPATH=/scratch/omniperf/python:$PYTHONPATH
  • Load Omniperf:
module load cray-python
module load LUMI/22.08 partition/G rocm/5.3.3 
module use /project/project_465000532/software/omniperf_rocm533/modulefiles/ 
module load omniperf/1.0.8
  • Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed
salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:30:00
cp -r /project/project_465000532/exercises/HPCTrainingExamples/ .
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
make
cd bin
srun -n 1 --gpus 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

Execution with all the analysis

  • Run srun -n 1 --gpus 1 omniperf profile -h to see all the options

  • Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

Roofline only execution

  • If you want to only roofline analysis, then execute: srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

  • When you use the --roof-only option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8

  • If you want to create another PDF file with the kernel names, then add in the above command the --kernel-names this will create another PDF file with the markers and their names.

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt

  • We can select specific IP Blocks, like:
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2

But you need to know the code of the IP Block

Visualize

  • If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:
omniperf analyze -p workloads/dgemm/mi200/ --gui
  • Open the web page: http://IP:8050/ The IP will be displayed in the output

Too large profiling data

  • Use a tool like rocprof to get the top 10 kernels
  • Then in the above commands add the option -k 0 where 0 is the id of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.