LUMI-G training (16/02/2023)

We assume that you have already allocated resources with salloc

cp -r /projappl/project_465000388/exercises/AMD/HIP-Examples/ .

salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_465000388

module rm rocm
module load craype-accel-amd-gfx90a
module load PrgEnv-amd
module load rocm

Basic examples

cd HIP-Examples/vectorAdd

Examine files here – README, Makefile and vectoradd_hip.cpp Notice that Makefile requires HIP_PATH to be set. Check with module show rocm or echo $HIP_PATH Also, the Makefile builds and runs the code. We’ll do the steps separately. Check also the HIPFLAGS in the Makefile.

make vectoradd_hip.exe
srun -n 1 ./vectoradd_hip.exe

We can use SLURM submission script, let's call it hip_batch.sh:

#!/bin/bash
#SBATCH -p small-g
#SBATCH -N 1
#SBATCH --gpus=1
#SBATCH -t 10:00
#SBATCH -A project_465000388

module load craype-accel-amd-gfx90a
module load rocm
cd $HOME/HPCTrainingExamples/HIP/vectorAdd 

make vectoradd_hip.exe
srun -n 1 --gpus 1 ./vectoradd_hip.exe

Submit the script
sbatch hip_batch.sh

Check for output in slurm-<job-id>.out or error in slurm-<job-id>.err

Compile and run with Cray compiler

CC -x hip vectoradd_hip.cpp -o vectoradd_hip.exe
srun -n 1 --gpus 1 ./vectoradd_hip.exe

Now let’s try the cuda-stream example. This example is from the original McCalpin code as ported to CUDA by Nvidia. This version has been ported to use HIP. See add4 for another similar stream example.

cd HIP-Examples/cuda-stream
make
srun -n 1 ./stream

Note that it builds with the hipcc compiler. You should get a report of the Copy, Scale, Add, and Triad cases.
Check that we need to declare target GPU for MI250x, is the --offload-arch=gfx90a

Hipify example

We’ll use the same HIP-Examples that were downloaded for the first exercise Get a node allocation.

Exercise 1: Manual code conversion from CUDA to HIP (10 min)

Choose one or more of the CUDA samples in HIP-Examples/mini-nbody/cuda repository and manually convert them to HIP. Tip: for example, the cudaMalloc will be called hipMalloc.
Some code suggestions include mini-nbody/cuda/<nbody-block.cu,nbody-orig.cu,nbody-soa.cu>

The CUDA samples are located in HIP-Examples/mini-nbody/cuda

Manually convert the source code of your choice to HIP

You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture.

Exercise 2: Code conversion from CUDA to HIP using HIPify tools (10 min)

Use the hipify-perl -inplace -print-stats to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl.sh is in $ROCM_PATH/hip/bin directory and should be in your path.

For example, if helloworld.cu is a CUDA program, run hipify-perl.sh -inplace –print-stats helloworld.cu. You’ll see a helloworld.cu.prehip file that is the original and the helloworld.cu file now has HIP calls. b. You’ll also see statistics of HIP APIs that were converted.
For example, for hipify-perl -inplace -print-stats nbody-orig.cu:

[HIPIFY] info: file 'nbody-orig.cu' statistics:
  CONVERTED refs count: 7
  TOTAL lines of code: 91
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
  cudaFree => hipFree: 1
  cudaMalloc => hipMalloc: 1
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1

Compile the HIP programs
Fix any compiler issues, for example, if there was something that didn’t hipify correctly.
Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX.
For the nbody-orig.cu code, compile with hipcc -DSHMOO -I ../ nbody-orig.cu -o nbody-orig. The #define SHMOO fixes some timer printouts. Add –offload-arch=<gpu_type> to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node.
Run the programs.

HIPFort

HIPFort is not installed by default, if you want to install it, follow the instructions:

git clone https://github.com/ROCmSoftwarePlatform/hipfort hipfort-source
mkdir hipfort-build; cd hipfort-build

export HIPFORT_INSTALL_DIR=$PWD/hipfort

cmake -DHIPFORT_INSTALL_DIR=${HIPFORT_INSTALL_DIR} ../hipfort-source
make install

export PATH=${HIPFORT_INSTALL_DIR}/bin:$PATH

Compile and execute HIPFort example:

cd hipfort-source/test/f2003/vecadd
hipfc -v --offload-arch=gfx90a hip_implementation.cpp main.f03
srun -n 1 --gpus 1 ./a.out

Debugging

The first exercise will be the same as the one covered in the presentation so that we
can focus on the mechanics. Then there will be additional exercises to explore further
or you can start debugging your own applications.

Get the exercise: git clone https://github.com/AMD/HPCTrainingExamples.git

Go to HPCTrainingExamples/HIP/saxpy

Edit the saxpy.cpp file and comment out the two hipMalloc lines.

41 //hipMalloc(&d_x, size);
42 //hipMalloc(&d_y, size);

Add a synchronization after the kernel call

47 hipDeviceSynchronize();

Now let's try using rocgdb to find the error.

Compile the code with

hipcc --offload-arch=gfx90a -o saxpy saxpy.cpp

Allocate a compute node.
Run the code

srun ./saxpy

Output

:0:rocdevice.cpp            :2614: 283728925446 us: 24406: [tid:0x148c691ca700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b                                    srun: error: nid007303: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=2831785.0

How do we find the error? Let's start up the debugger. First, we’ll recompile the code to help the debugging process. We also set the number of CPU OpenMP threads to reduce the number of threads seen by the debugger.

hipcc -ggdb -O0 --offload-arch=gfx90a -o saxpy saxpy.cpp
export OMP_NUM_THREADS=1

We have two options for running the debugger. We can use an interactive session, or we can just simply use a regular srun command.

srun rocgdb saxpy

The interactive approach uses:

srun --interactive --pty [--jobid=<jobid>] bash 
rocgdb ./saxpy

We need to supply the jobid if we have more than one job so that it knows which to use.

We can also choose to use one of the Text User Interfaces (TUI) or Graphics User Interfaces (GUI). We look to see what is available.

which cgdb
   -- not found
   -- run with cgdb -d rocgdb <executable>
which ddd
    -- not found
    -- run with ddd --debugger rocgdb
which gdbgui
   -- not found
   -- run with gdbgui --gdb-cmd /opt/rocm/bin/rocgdb
rocgdb –tui
 -- found

We have the TUI interface for rocgdb. We need an interactive session on the compute node to run with this interface. We do this by using the following command.

srun --interactive --pty [-jobid=<jobid>] bash 
rocgdb -tui ./saxpy

The following is based on using the standard gdb interface. Using the TUI or GUI interfaces should be similar.
You should see some output like the following once the debugger starts.

[output]
GNU gdb (rocm-rel-5.1-36) 11.2
Copyright (C) 2022 Free Software Foundation, Inc. 
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.                                                                          
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./saxpy...

Now it is waiting for us to tell it what to do. We'll go for broke and just type run

(gdb) run
 
[output] 
Thread 3 "saxpy" received signal SIGSEGV, Segmentation fault.[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554a001094 in saxpy (n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31
31              y[i] += a*x[i];

The line number 31 is a clue. Now let’s dive a little deeper by getting the GPU thread trace

(gdb) info threads [ shorthand - i th ]

 [output]
 Id   Target Id                                 Frame
  1    Thread 0x15555552d300 (LWP 40477) "saxpy" 0x000015554b67ebc9 in ?? ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
  2    Thread 0x15554a9ac700 (LWP 40485) "saxpy" 0x00001555533e1c47 in ioctl () 
   from /lib64/libc.so.6
* 3    AMDGPU Wave 1:2:1:1 (0,0,0)/0 "saxpy"     0x000015554a001094 in saxpy ( 
    n=<optimized out>, x=<optimized out>, incx=<optimized out>,
    y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31
  4    AMDGPU Wave 1:2:1:2 (0,0,0)/1 "saxpy"     0x000015554a001094 in saxpy ( 
    n=<optimized out>, x=<optimized out>, incx=<optimized out>, 
    y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 
  5    AMDGPU Wave 1:2:1:3 (1,0,0)/0 "saxpy"     0x000015554a001094 in saxpy (
    n=<optimized out>, x=<optimized out>, incx=<optimized out>, 
    y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31
  6    AMDGPU Wave 1:2:1:4 (1,0,0)/1 "saxpy"     0x000015554a001094 in saxpy ( 
    n=<optimized out>, x=<optimized out>, incx=<optimized out>,
    y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31

Note that the GPU threads are also shown! Switch to thread 1 (CPU)

(gdb) thread 1 [ shorthand - t 1]
[output] 
[Switching to thread 1 (Thread 0x15555552d300 (LWP 47136))]
#0  0x000015554b67ebc9 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1

where

…

#12 0x0000155553b5b419 in hipDeviceSynchronize ()
   from /opt/rocm/lib/libamdhip64.so.5
#13 0x000000000020d6fd in main () at saxpy.cpp:47

(gdb) break saxpy.cpp:44 [ shorthand – b saxpy.cpp:46]

[output] 
Breakpoint 2 at 0x21a830: file saxpy.cpp, line 46

(gdb) run [ shorthand – r ]

Breakpoint 1, main () at saxpy.cpp:46
46          saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1);

From here we can investigate the input to the kernel and see that the memory has not been allocated.
Restart the program in the debugger.

srun --interactive --pty [-jobid=<jobid>] rocgdb ./saxpy
(gdb) list 22,42

(gdb) b 39

[output] 

Breakpoint 1 at 0x219ea2: file saxpy.cpp, line 44.

Must have optimized out some lines. We want to stop at the start of the routine before the allocations.

(gdb) b main
Breakpoint 2 at 0x219ea2: file saxpy.cpp, line 36.

Better!

(gdb) run
[output] 
Starting program ...
...
Breakpoint 2, main() at saxpy.cpp:36
36          int n=256;

(gdb) p d_x
[output] 
$1 = (float *) 0x219cd0 <_start>

Should have intialized the pointer to NULL!

(gdb) n
[output] 
37         std::size_t size = sizeof(float)*n;

(gdb) n
[output] 
Breakpoint 1, main () at saxpy.cpp:44
44         int num_groups = 2;

(gdb) p d_x
[output] 
$2 = (float *) 0x219cd0 <_start>
(gdb) p *x@5

Prints out the next 5 values pointed to by x

[output] 
$3 = {-2.43e-33, 2.4e-33, -1.93e22, 556, 2.163e-36}

Random values printed out – not initialized!

(gdb) b 30

(gdb) c

[output] 
Thread 5 “saxpy” hit Breakpoint 3 ….
30      if (i < n)

(gdb) info threads

Shows both CPU and GPU threads
(gdb) p x

[output] 
$4 = (const float *) 0x219cd0 <_start>

(gdb) p *x@5

[output] 
$5 = {-2.43e-33, 2.4e-33, -1.93e22, 556, 2.163e-36}

(gdb) n

(gdb) n

(gdb) n

Until reach line 31

(gdb) p I

[output] 
$6 = 0

(gdb) p y[0]

[output] 
$7 = -2.12e14

(gdb) p x[0]

[output] 
$8 = -2.43e-33

(gdb) p a
[output] 
$9 = 1

We can see that there are multiple problems with this kernel. X and Y are not initialized. Each value of X is multiplied by 1.0 and then added to the existing value of Y.
Additional exercises:

Fix this code and step through it with rocgdb to verify that it is working correctly.
Pick one of the OpenMP (or OpenACC) examples in HPCTrainingExamples/Pragma_Examples/ [OpenMP|OpenACC]/Make/C/[reduction|saxpy|vecadd] and run through it to verify correctness. Change the code to cause an error and find it with the debugger. (See Stubbs video for an introduction to debugging OpenMP or OpenACC code. Hint: it is not that much different).

Profiling Tools

Rocprof

Get the exercise:

git clone https://github.com/AMD/HPCTrainingExamples.git

We assume you have reserved resources with salloc

Compile and run the code

cd HPCTrainingExamples/HIPIFY/mini-nbody/hip

Can compile and run all with

./HIP-nbody-orig.sh

Or just run compile and run one case

 hipcc --offload-arch=gfx90a -I../ -DSHMOO nbody-orig.cpp -o nbody-orig
srun -n 1 ./nbody-orig 65536
65536, 161.871

Note that we can find the compile line and the executable name by checking the file HIP-nbody-orig.sh or by watching the output from running the script.

cat HIP-nbody-orig.sh

...

EXE=nbody-orig
...
./$EXE 65536
...

The binary is called nbody-orig
Use rocprof with --stats

srun -n 1 rocprof --stats nbody-orig 65536

RPL: on '221130_200946' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_200946_3670592'
RPL: result dir '/tmp/rpl_data_221130_200946_3670592/input_results_221130_200946'
ROCProfiler: input from "/tmp/rpl_data_221130_200946_3670592/input.xml"
  0 metrics
65536, 159.960

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_200946_3670592/input_results_221130_200946
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating

Files with the prefix results are created

Check the files results.csv

You can see information for each kernel call with their duration

 cat results.csv
 
"Index","KernelName","gpu-id","queue-id","queue-index","pid","tid","grd","wgr","lds","scr","arch_vgpr","accum_vgpr","sgpr","wave_size","sig","obj","DispatchNs","BeginNs","EndNs","CompleteNs","DurationNs"
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372809346673,1591372809872935,1591372836189584,1591372836215944,26316649
1,"bodyForce(Body*, float, int) [clone .kd]",0,0,2,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372836774261,1591372837000949,1591372863116796,1591372863132315,26115847
2,"bodyForce(Body*, float, int) [clone .kd]",0,0,4,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372863652552,1591372863877281,1591372889980009,1591372889994436,26102728
3,"bodyForce(Body*, float, int) [clone .kd]",0,0,6,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372890512133,1591372890735562,1591372916796147,1591372916817087,26060585
4,"bodyForce(Body*, float, int) [clone .kd]",0,0,8,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372917332974,1591372917556629,1591372943652575,1591372943667909,26095946
5,"bodyForce(Body*, float, int) [clone .kd]",0,0,10,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372944181896,1591372944405378,1591372970475883,1591372970491020,26070505
6,"bodyForce(Body*, float, int) [clone .kd]",0,0,12,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372971009527,1591372971233309,1591372997318181,1591372997339821,26084872
7,"bodyForce(Body*, float, int) [clone .kd]",0,0,14,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372997856209,1591372998080743,1591373024164495,1591373024180993,26083752
8,"bodyForce(Body*, float, int) [clone .kd]",0,0,16,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373024701060,1591373024924818,1591373051023611,1591373051040364,26098793
9,"bodyForce(Body*, float, int) [clone .kd]",0,0,18,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373051559851,1591373051782878,1591373077878145,1591373077902255,26095267

Check the statistics result file, one line per kernel

cat results.stats.csv
 
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"bodyForce(Body*, float, int) [clone .kd]",10,261124944,26112494,100.0

Profile the HIP calls with --hip-trace

srun -n 1 rocprof --stats --hip-trace nbody-orig 65536
RPL: on '221130_201416' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_201416_3670892'
RPL: result dir '/tmp/rpl_data_221130_201416_3670892/input_results_221130_201416'
ROCTracer (pid=3670915):
    HIP-trace()
65536, 161.051
hsa_copy_deps: 0
scan ops data 29:30                                                                                                    File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating
dump json 19:20
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating
dump json 51:52
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating
dump json 9:10
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating

Now we have new files with the hip in their name like below, check the file results.hip_stats.csv

 cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",20,486845521,24342276,99.89375113830629
"hipLaunchKernel",10,467008,46700,0.09582337501179998
"hipMalloc",1,30570,30570,0.006272527610042495
"hipFree",1,14210,14210,0.0029156891507590398
"__hipPushCallConfiguration",10,3510,351,0.0007202018943817191
"__hipPopCallConfiguration",10,2520,252,0.0005170680267355932

Profile also the HSA API with the --hsa-trace

srun -n 1 rocprof --stats --hip-trace --hsa-trace nbody-orig 65536
RPL: on '221130_201737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_201737_3671219'
RPL: result dir '/tmp/rpl_data_221130_201737_3671219/input_results_221130_201737'
ROCProfiler: input from "/tmp/rpl_data_221130_201737_3671219/input.xml"
  0 metrics
ROCTracer (pid=3671242):
    HSA-trace()
    HSA-activity-trace()
    HIP-trace()
65536, 155.978

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_201737_3671219/input_results_221130_201737
hsa_copy_deps: 1
scan hsa API data 5953:5954                                                                                                    hsa_copy_deps: 0
scan hip API data 51:52                                                                                                    File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating
dump json 9:10
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hsa_stats.csv' is generating
dump json 5963:5964
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating
dump json 19:20
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating
dump json 51:52
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating

See the content of the file results.hsa_stats.csv

cat results.hsa_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hsa_signal_wait_scacquire",50,264955082,5299101,82.69977799679005
"hsa_queue_create",1,39868279,39868279,12.443987854568068
"hsa_amd_memory_async_copy",20,4917141,245857,1.5347751249357586
"hsa_amd_signal_async_handler",20,4262555,213127,1.3304608069344652
"hsa_signal_store_screlease",40,1945998,48649,0.60739956889069
"hsa_amd_memory_lock_to_pool",20,1418202,70910,0.44265990170591873
"hsa_amd_memory_unlock",20,723957,36197,0.22596691758953363
"hsa_agent_get_info",80,671007,8387,0.20943976433821374
"hsa_amd_memory_pool_allocate",5,597926,119585,0.18662917157599068
"hsa_system_get_info",5005,367139,73,0.11459419296574767
"hsa_executable_load_agent_code_object",2,216629,108314,0.0676158768966984
"hsa_executable_freeze",2,156380,78190,0.048810504729771616
"hsa_amd_agents_allow_access",4,89470,22367,0.02792605101785821
"hsa_signal_create",57,51500,903,0.016074568318092074
"hsa_code_object_reader_create_from_memory",2,20940,10470,0.006535950690890253
"hsa_isa_get_info_alt",2,18180,9090,0.005674478680056581
"hsa_signal_load_relaxed",236,17880,75,0.005580840418009442
"hsa_system_get_major_extension_table",3,14920,4973,0.004656942899144344
"hsa_amd_profiling_get_async_copy_time",40,13430,335,0.004191872864310224
"hsa_executable_create_alt",2,9030,4515,0.0028185116876188626
"hsa_amd_memory_pool_get_info",106,7570,71,0.002362805478989456
"hsa_amd_agent_iterate_memory_pools",27,7440,275,0.0023222288987690297
"hsa_amd_memory_pool_free",1,4710,4710,0.0014701207141400712
"hsa_executable_get_symbol_by_name",15,3599,239,0.0011233470170255023
"hsa_queue_add_write_index_screlease",20,3500,175,0.0010924463905499467
"hsa_amd_profiling_get_dispatch_time",20,3110,155,0.0009707166498886669
"hsa_signal_silent_store_relaxed",40,2800,70,0.0008739571124399574
"hsa_amd_agent_memory_pool_get_info",19,2490,131,0.0007771975749912477
"hsa_iterate_agents",2,2250,1125,0.0007022869653535371
"hsa_queue_load_read_index_relaxed",20,2240,112,0.0006991656899519659
"hsa_signal_destroy",20,2190,109,0.0006835593129441095
"hsa_queue_load_read_index_scacquire",20,1790,89,0.0005587082968812585
"hsa_executable_symbol_get_info",30,1540,51,0.00048067641184197657
"hsa_amd_profiling_async_copy_enable",1,440,440,0.00013733611766913615
"hsa_agent_iterate_isas",1,400,400,0.00012485101606285104
"hsa_amd_profiling_set_profiler_enabled",1,140,140,4.3697855621997866e-05
"hsa_dispatch",10,0,0,0.0

Download the results.json file on your laptop

From your laptop:
scp username@lumi.csc.fi:/path/results.json .

Visit the web page:

https://ui.perfetto.dev/

Click on the top left menu, "Open Trace File on the left top"
Select the file results.json

Zoom in/out: W/S
Move left/right: A/D

Read about the counters: vim /opt/rocm/rocprofiler/lib/gfx_metrics.xml

Create a file with the contents:

cat rocprof_counters.txt
pmc : Wavefronts VALUInsts VFetchInsts VWriteInsts VALUUtilization VALUBusy WriteSize
pmc : SALUInsts SFetchInsts LDSInsts FlatLDSInsts GDSInsts SALUBusy FetchSize
pmc : L2CacheHit MemUnitBusy MemUnitStalled WriteUnitStalled ALUStalledByLDS LDSBankConflict

Execute with using the counters

srun -n 1 rocprof --timestamp on -i rocprof_counters.txt  nbody-orig 65536
RPL: on '221130_205737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file 'rocprof_counters.txt'
RPL: output dir '/tmp/rpl_data_221130_205737_3673574'
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input0.xml"
  gpu_index =
  kernel =
  range =
  7 metrics
    Wavefronts, VALUInsts, VFetchInsts, VWriteInsts, VALUUtilization, VALUBusy, WriteSize
65536, 155.389

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input1.xml"
  gpu_index =
  kernel =
  range =
  7 metrics
    SALUInsts, SFetchInsts, LDSInsts, FlatLDSInsts, GDSInsts, SALUBusy, FetchSize
65536, 156.996

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input2.xml"
  gpu_index =
  kernel =
  range =
  6 metrics
    L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, ALUStalledByLDS, LDSBankConflict
65536, 155.264

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/rocprof_counters.csv' is generating

Contents of the rocprof_counters.csv file

cat rocprof_counters.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,VFetchInsts,VWriteInsts,VALUUtilization,VALUBusy,WriteSize,SALUInsts,SFetchInsts,LDSInsts,FlatLDSInsts,GDSInsts,SALUBusy,FetchSize,L2CacheHit,MemUnitBusy,MemUnitStalled,WriteUnitStalled,ALUStalledByLDS,LDSBankConflict,DispatchNs,BeginNs,EndNs,CompleteNs
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3673711,3673711,65536,256,0,0,20,4,16,64,0x0,0x7f2b4d282500,2048.0000000000,1212443.0000000000,12.0000000000,12.0000000000,100.0000000000,68.1476813493,7872.0000000000,131228.5000000000,65553.0000000000,0.0000000000,0.0000000000,0.0000000000,6.3483148000,9429.1875000000,96.5684331443,0.0250344612,0.0044357832,0.0102024550,0.0000000000,0.0000000000,1594244102859719,1594244111978746,1594244138305243,1594244138330792
...

Omnitrace

We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises

Reserve a GPU
Load Omnitrace

Declare PATH and LD_LIBRARY_PATH
It is temporarily installed here: /project/project_465000388/software/omnitrace/1.7.3/
Execute:

export PATH=/project/project_465000388/software/omnitrace/1.7.3/bin:$PATH
export LD_LIBRARY_PATH=/project/project_465000388/software/omnitrace/1.7.3/lib:$LD_LIBRARY_PATH
Allocate resources with salloc
Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all

Declare to use this configuration file: export OMNITRACE_CONFIG_FILE=/path/omnitrace_all.cfg
Get the file https://github.com/ROCm-Developer-Tools/HIP/tree/develop/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp

or cp /project/project_465000388/exercises/AMD/MatrixTranspose.cpp .

Compile hipcc --offload-arch=gfx90a -o MatrixTranspose MatrixTranspose.cpp
Execute the binary: time srun -n 1 --gpus 1 ./MatrixTranspose and check the duration

Dynamic instrumentation

Execute dynamic instrumentation: time srun –n 1 –-gpus 1 omnitrace -- ./MatrixTranspose and check the duration
Check what the binary calls and gets instrumented: nm --demangle MatrixTranspose | egrep -i ' (t|u) '
Available functions to instrument: srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose
- the simulate option means that it will not execute the binary

Binary rewriting

Binary rewriting: srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose
- We created a new instrumented binary called matrix.inst
Executing the new instrumented binary: time srun -n 1 --gpus 1 ./matrix.inst and check the duration
See the list of the instrumented GPU calls: cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt

Visualization

Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file

Hardware counters

See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all
Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
Execute: srun -n 1 --gpus 1 ./matrix.inst and copy the perfetto file and visualize

Sampling

Activate in your configuration file OMNITRACE_USE_SAMPLING = true and OMNITRACE_SAMPLING_FREQ = 100, execute and visualize

Kernel timings

Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Omniperf

We have made built the Omniperf without GUI support for use in the exercises

Load Omniperf:
export PATH=/project/project_465000388/software/omniperf/bin/:$PATH
export PYTHONPATH=/project/project_465000388/software/omniperf/python-libs:$PYTHONPATH
module load cray-python
module load rocm
Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed

salloc -p standard-g --gpus 1 -t 00:40:00 -A project_465000388
git clone https://github.com/AMD/HPCTrainingExamples.git
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
make
cd bin
srun -n 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

Run srun -n 1 --gpus 1 omniperf profile -h to see all the options
Now is created a workload in the directory workloads with the name dgemmoh I mean for (the argument of the -n). So, we can analyze it

 srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt

If you want to only roofline analysis, then execute: srun -n 1 --gpus 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv

There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt

We can select specific IP Blocks, like:

srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2

But you need to know the code of the IP Block

If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:

omniperf analyze -p workloads/dgemm/mi200/ --gui

Open the web page: http://IP:8050/ The IP will be displayed in the output
Use another cod, for example: https://github.com/amd/HPCTrainingExamples/blob/main/HIP/saxpy/saxpy.cpp