# Offsite training
* Get the initial material
`cp -r /Shared/HPCTrainingExamples/ .`
## ROCgdb
* Save the following code in a file called let's say `saxpy.hip` or use the file from ` HPCTrainingExamples/HIP/saxpy/`
```
#include <hip/hip_runtime.h>
__constant__ float a = 1.0f;
const float a_h = 1.0f;
void init (int n, float *x, float *y)
{
for (std::size_t i = 0; i < n; ++i)
{
x[i] = 1.0;
y[i] = 0.0;
}
}
__global__
void saxpy (int n, float const* x, int incx, float* y, int incy)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n)
y[i] += a*x[i];
}
int main()
{
int n = 256;
std::size_t size = sizeof(float)*n;
float *h_x = new float [n];
float *h_y = new float [n];
init(n, h_x, h_y);
float* d_x;
float *d_y;
hipMalloc(&d_x, size);
hipMalloc(&d_y, size);
hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);
hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);
hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);
int num_groups = 2;
int group_size = 128;
saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1);
hipDeviceSynchronize();
hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost);
delete [] h_x;
delete [] h_y;
hipFree(d_x);
hipFree(d_y);
return 0;
}
```
* Compile
`hipcc --offload-arch=gfx90a -o saxpy saxpy.hip`
* Declare your GPU
`export HIP_VISIBLE_DEVICES=X`
* Execute
`./saxpy`
* Execute `saxpy` with rocgdb
`rocgdb saxpy`
* Inside rocgdb
```
(gdb) run
```
* Compile with debugging
`hipcc -ggdb --offload-arch=gfx90a -o saxpy saxpy.hip`
* Start rocgdb with the saxpy
`rocgdb saxpy`
* Inside rocgdb
```
(gdb) run
(gdb) i th
(gdb) t 1
(gdb) where
(gdb) b saxpy.cpp:22
```
* Compile for -O0
`hipcc -ggdb -O0 --offload-arch=gfx90a -o saxpy saxpy.cpp`
* Inside rocgdb
```
(gdb) r
(gdb) i th
(gdb) show architecture
(gdb) b saxpy
(gdb) c
(gdb) show architecture
(gdb) info agents
(gdb) info queues
(gdb) info dispatches
```
## Omnitrace
* Load Omniperf
```
module load rocm
module load omnitrace
```
* Before you execute any Omnitrace call, select a specific GPU:
`export HIP_VISIBLE_DEVICES=X`
* Check the various options and their values and also a second command for description
`omnitrace-avail --categories omnitrace --brief --description`
* Create an Omnitrace configuration file with description per option
`omnitrace-avail -G omnitrace.cfg --all`
* Declare to use this configuration file:
`export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg`
* Activate the TIMEMORY option in the `omnitrace.cfg`, edit the file, find the `OMNITRACE_USE_TIMEMORY` and declare it equal to true
`OMNITRACE_USE_TIMEMORY = true`
* Get the training examples:
`git clone https://github.com/amd/HPCTrainingExamples.git`
* Compile and execute Jacobi
* `cd HPCTrainingExamples/HIP/jacobi`
* `make clean;make`
* Binary `jacobi_hip`
<!--
* Need to make some changes to the makefile
* ``MPICC=$(PREP) `which CC` ``
* `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include`
* `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich`
* comment out
* ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)``
-->
* Now execute the binary
* `time mpirun -np 1 Jacobi_hip -g 1 1`
* Check the duration
<!--
### Dynamic instrumentation (it will take long time/fail)
* Execute dynamic instrumentation: `time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1` and check the duration
* About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '`
* Available functions to instrument (**it can take long time**): `srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1`
* the simulate option means that it will not execute the binary
-->
### Binary rewriting (to be used with MPI codes and decreases overhead)
* Binary rewriting: `omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip`
* We created a new instrumented binary called jacobi.inst
* Executing the new instrumented binary: `time mpirun -n 1 omnitrace-run -- ./jacobi.inst -g 1 1` and check the duration
* See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt`
* See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt
* Check the MPI calls
### Visualization
* Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
* Where are all the MPI calls?
* If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)
### Call-stack
Edit your omnitrace.cfg:
```
OMNITRACE_USE_SAMPLING = true;
OMNITRACE_SAMPLING_FREQ = 100
```
Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.
### Hardware counters
* See a list of all the counters: `omnitrace-avail --all`
* Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy`
* Execute: `omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize
### Kernel timings
* Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information)
* In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt`
## Omniperf
* Add this in your path:
`export PATH=/opt/conda/bin/:$PATH`
* Load Omniperf
```
module load rocm
module load omniperf
```
* Go to the Omniperf examples:
```
cd HPCTrainingExamples/OmniperfExamples/
```
* Enter each directory and read the instructions even from a web page: https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples
* Before you execute any Omniper call, select a specific GPU:
`export HIP_VISIBLE_DEVICES=X`