Offsite training

Get the initial material

cp -r /Shared/HPCTrainingExamples/ .

ROCgdb

Save the following code in a file called let's say saxpy.hip or use the file from HPCTrainingExamples/HIP/saxpy/

#include <hip/hip_runtime.h>

__constant__ float a = 1.0f;
const float a_h = 1.0f;

void init (int n, float *x, float *y)
{
    for (std::size_t i = 0; i < n; ++i)
    {
	x[i] = 1.0;
	y[i] = 0.0;
    }
}

__global__
void saxpy (int n, float const* x, int incx, float* y, int incy)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
}

int main()
{
    int n = 256;
    std::size_t size = sizeof(float)*n;

    float *h_x = new float [n];
    float *h_y = new float [n];
    init(n, h_x, h_y);


    float* d_x;
    float *d_y;
    hipMalloc(&d_x, size);
    hipMalloc(&d_y, size);
    hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
    hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);
    hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
    hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);

    int num_groups = 2;
    int group_size = 128;
    saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1);
    hipDeviceSynchronize();
    hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost);


    delete [] h_x;
    delete [] h_y;
    hipFree(d_x);
    hipFree(d_y);

    return 0;
}

Compile

hipcc --offload-arch=gfx90a -o saxpy saxpy.hip

Declare your GPU

export HIP_VISIBLE_DEVICES=X

Execute

./saxpy

Execute saxpy with rocgdb

rocgdb saxpy

Inside rocgdb

(gdb) run

Compile with debugging

hipcc -ggdb --offload-arch=gfx90a -o saxpy saxpy.hip

Start rocgdb with the saxpy

rocgdb saxpy

Inside rocgdb

(gdb) run
(gdb) i th
(gdb) t 1
(gdb) where 
(gdb) b saxpy.cpp:22

Compile for -O0

hipcc -ggdb -O0 --offload-arch=gfx90a -o saxpy saxpy.cpp

Inside rocgdb

(gdb) r
(gdb) i th
(gdb) show architecture
(gdb) b saxpy
(gdb) c
(gdb) show architecture
(gdb) info agents
(gdb) info queues
(gdb) info dispatches

Omnitrace

Load Omniperf

module load rocm
module load omnitrace

Before you execute any Omnitrace call, select a specific GPU:
export HIP_VISIBLE_DEVICES=X
Check the various options and their values and also a second command for description

omnitrace-avail --categories omnitrace --brief --description

Create an Omnitrace configuration file with description per option

omnitrace-avail -G omnitrace.cfg --all

Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

Activate the TIMEMORY option in the omnitrace.cfg, edit the file, find the OMNITRACE_USE_TIMEMORY and declare it equal to true

OMNITRACE_USE_TIMEMORY = true

Get the training examples:

git clone https://github.com/amd/HPCTrainingExamples.git

Compile and execute Jacobi
- cd HPCTrainingExamples/HIP/jacobi
- make clean;make
- Binary jacobi_hip

Now execute the binary
- time mpirun -np 1 Jacobi_hip -g 1 1
Check the duration

Binary rewriting (to be used with MPI codes and decreases overhead)

Binary rewriting: omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip
We created a new instrumented binary called jacobi.inst
Executing the new instrumented binary: time mpirun -n 1 omnitrace-run -- ./jacobi.inst -g 1 1 and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt
See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt or wallclock-1.txt
Check the MPI calls

Visualization

Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
Where are all the MPI calls?
If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

See a list of all the counters: omnitrace-avail --all
Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
Execute: omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Omniperf

Add this in your path:
export PATH=/opt/conda/bin/:$PATH
Load Omniperf

module load rocm
module load omniperf

Go to the Omniperf examples:

cd HPCTrainingExamples/OmniperfExamples/

Enter each directory and read the instructions even from a web page: https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples
Before you execute any Omniper call, select a specific GPU:
export HIP_VISIBLE_DEVICES=X

Offsite training

ROCgdb

Omnitrace

Binary rewriting (to be used with MPI codes and decreases overhead)

Visualization

Call-stack

Hardware counters

Kernel timings

Omniperf

Read more

Rocprofv3

Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2025

Omnitools: Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial

Omnitrace by Example