Offsite training

  • Get the initial material

cp -r /Shared/HPCTrainingExamples/ .

ROCgdb

  • Save the following code in a file called let's say saxpy.hip or use the file from HPCTrainingExamples/HIP/saxpy/
#include <hip/hip_runtime.h>

__constant__ float a = 1.0f;
const float a_h = 1.0f;

void init (int n, float *x, float *y)
{
    for (std::size_t i = 0; i < n; ++i)
    {
	x[i] = 1.0;
	y[i] = 0.0;
    }
}

__global__
void saxpy (int n, float const* x, int incx, float* y, int incy)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
}

int main()
{
    int n = 256;
    std::size_t size = sizeof(float)*n;

    float *h_x = new float [n];
    float *h_y = new float [n];
    init(n, h_x, h_y);


    float* d_x;
    float *d_y;
    hipMalloc(&d_x, size);
    hipMalloc(&d_y, size);
    hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
    hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);
    hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
    hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);

    int num_groups = 2;
    int group_size = 128;
    saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1);
    hipDeviceSynchronize();
    hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost);


    delete [] h_x;
    delete [] h_y;
    hipFree(d_x);
    hipFree(d_y);

    return 0;
}
  • Compile

hipcc --offload-arch=gfx90a -o saxpy saxpy.hip

  • Declare your GPU

export HIP_VISIBLE_DEVICES=X

  • Execute

./saxpy

  • Execute saxpy with rocgdb

rocgdb saxpy

  • Inside rocgdb
(gdb) run

  • Compile with debugging

hipcc -ggdb --offload-arch=gfx90a -o saxpy saxpy.hip

  • Start rocgdb with the saxpy

rocgdb saxpy

  • Inside rocgdb
(gdb) run
(gdb) i th
(gdb) t 1
(gdb) where 
(gdb) b saxpy.cpp:22
  • Compile for -O0

hipcc -ggdb -O0 --offload-arch=gfx90a -o saxpy saxpy.cpp

  • Inside rocgdb
(gdb) r
(gdb) i th
(gdb) show architecture
(gdb) b saxpy
(gdb) c
(gdb) show architecture
(gdb) info agents
(gdb) info queues
(gdb) info dispatches

Omnitrace

  • Load Omniperf
module load rocm
module load omnitrace
  • Before you execute any Omnitrace call, select a specific GPU:
    export HIP_VISIBLE_DEVICES=X

  • Check the various options and their values and also a second command for description

omnitrace-avail --categories omnitrace --brief --description

  • Create an Omnitrace configuration file with description per option

omnitrace-avail -G omnitrace.cfg --all

  • Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

  • Activate the TIMEMORY option in the omnitrace.cfg, edit the file, find the OMNITRACE_USE_TIMEMORY and declare it equal to true

OMNITRACE_USE_TIMEMORY = true

  • Get the training examples:

git clone https://github.com/amd/HPCTrainingExamples.git

  • Compile and execute Jacobi
    • cd HPCTrainingExamples/HIP/jacobi
    • make clean;make
    • Binary jacobi_hip
  • Now execute the binary

    • time mpirun -np 1 Jacobi_hip -g 1 1
  • Check the duration

Binary rewriting (to be used with MPI codes and decreases overhead)

  • Binary rewriting: omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip

  • We created a new instrumented binary called jacobi.inst

  • Executing the new instrumented binary: time mpirun -n 1 omnitrace-run -- ./jacobi.inst -g 1 1 and check the duration

  • See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt

  • See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt or wallclock-1.txt

  • Check the MPI calls

Visualization

  • Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
  • Where are all the MPI calls?
  • If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

  • See a list of all the counters: omnitrace-avail --all
  • Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
  • Execute: omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

  • Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
  • In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

Omniperf

  • Add this in your path:
    export PATH=/opt/conda/bin/:$PATH

  • Load Omniperf

module load rocm
module load omniperf
  • Go to the Omniperf examples:
cd HPCTrainingExamples/OmniperfExamples/