Try   HackMD

Omnitrace - ENCCS Hackathon

  • Reserve a GPU

  • Load Omnitrace

ml rocm/5.3.3
source /cfs/klemming/home/g/gmarkoma/Public/omnitrace/1.7.4/share/omnitrace/setup-env.sh
  • Allocate resources with salloc

  • Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

  • Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all

or cp /project/project_465000388/exercises/AMD/MatrixTranspose.cpp .

  • Compile hipcc --offload-arch=gfx90a -o MatrixTranspose MatrixTranspose.cpp

  • Execute the binary: time srun -n 1 --gpus 1 ./MatrixTranspose and check the duration

Dynamic instrumentation

  • Execute dynamic instrumentation: time srun –n 1 –-gpus 1 omnitrace -- ./MatrixTranspose and check the duration
  • Check what the binary calls and gets instrumented: nm --demangle MatrixTranspose | egrep -i ' (t|u) '
  • Available functions to instrument: srun -n 1 --gpus 1 omnitrace -v 1 --simulate --print-available functions -- ./MatrixTranspose
    • the simulate option means that it will not execute the binary

Binary rewriting (to be used with MPI codes and decreases overhead)

  • Binary rewriting: srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose

    • We created a new instrumented binary called matrix.inst
  • Executing the new instrumented binary: time srun -n 1 --gpus 1 ./matrix.inst and check the duration

  • See the list of the instrumented GPU calls: cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt

Visualization

  • Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file

Hardware counters

  • See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all
  • Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
  • Execute: srun -n 1 --gpus 1 ./matrix.inst and copy the perfetto file and visualize

Sampling

Activate in your configuration file OMNITRACE_USE_SAMPLING = true and OMNITRACE_SAMPLING_FREQ = 100, execute and visualize

Kernel timings

  • Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
  • In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt