https://hackmd.io/@gmarkoma/eviden_training

Eviden training

Get the initial material

cp -r /tmp/HPCTrainingExamples/ .

Logistics

Reserve resources and login to the node

Omnitrace

Load Omniperf

export PATH=/opt/rocm-5.6.0/bin/:$PATH
export LD_LIBRARY_PATH=/opt/rocm-5.6.0/lib:$LD_LIBRARY_PATH
module use  /software/modulefiles/amd/
module load omnitrace

Before you execute any Omnitrace call, select a specific GPU:
export HIP_VISIBLE_DEVICES=X where X is 0,1,…,7
Check the various options and their values and also a second command for description

omnitrace-avail --categories omnitrace --brief --description

Create an Omnitrace configuration file with description per option

omnitrace-avail -G omnitrace.cfg --all

Declare to use this configuration file:

export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg

Activate the TIMEMORY option in the omnitrace.cfg, edit the file, find the OMNITRACE_USE_TIMEMORY and declare it equal to true

OMNITRACE_USE_TIMEMORY = true

Compile and execute Jacobi
- cd HPCTrainingExamples/HIP/jacobi
- make clean;make
- Binary jacobi_hip

Now execute the binary
- time mpirun -np 1 Jacobi_hip -g 1 1
Check the duration

Dynamic instrumentation (it will take long time/fail)

Execute dynamic instrumentation: time mpirun -np 1 omnitrace-instrument -- Jacobi_hip -g 1 1 and check the duration
About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: nm --demangle Jacobi_hip | egrep -i ' (t|u) '
Available functions to instrument (it can take long time): mpirun -np 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
- the simulate option means that it will not execute the binary

Binary rewriting (to be used with MPI codes and decreases overhead)

Binary rewriting: omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip
We created a new instrumented binary called jacobi.inst
Executing the new instrumented binary: time mpirun -n 1 omnitrace-run -- ./jacobi.inst -g 1 1 and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt
See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt
Check the MPI calls

Visualization

Copy the perfetto-trace.proto to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.
Where are all the MPI calls?
If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers)

Call-stack

Edit your omnitrace.cfg:

OMNITRACE_USE_SAMPLING = true; 
OMNITRACE_SAMPLING_FREQ = 100

Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.

Hardware counters

See a list of all the counters: omnitrace-avail --all
Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
Execute: omnitrace-run -- ./jacobi.inst -g 1 1 and copy the perfetto file and visualize

Kernel timings

Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)
In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt

GROMACS

Go to the bin directory:
omnitrace-instrument -o gmx_mpi.inst -- gmx_mpi

export LD_LIBRARY_PATH=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/lib/:$LD_LIBRARY_PATH
export PATH=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/bin/:$PATH
export ROCM_DIR=/opt/rocm-5.7.0/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm-5.7.0/lib/
export PATH=/opt/rocm-5.7.0/bin/:$PATH
export PATH=$PATH:$ROCM_DIR/llvm/bin:$ROCM_DIR/bin
export OMPI_DIR=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OMPI_DIR/lib
export C_INCLUDE_PATH=$C_INCLUDE_PATH:$OMPI_DIR/include
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:$OMPI_DIR/include
export INCLUDE=$INCLUDE:$OMPI_DIR/include
export LD_LIBRARY_PATH=/software/mpi/ucx/1.15.0/rocm/5.7.0/lib/ucx/:$LD_LIBRARY_PATH

export OMP_NUM_THREADS=16
export PME=gpu
export TUNEPME=no
export GPUIDS=45236701

export MPICH_GPU_SUPPORT_ENABLED=1

export GMX_ENABLE_DIRECT_GPU_COMM=1
export AMD_DIRECT_DISPATCH=1
export GMX_GPU_PME_DECOMPOSITION=1
export GMX_FORCE_CUDA_AWARE_MPI=1
export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_GPU_AWARE_MPI=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true

export GMX_DIR=/home_nfs/xmarkomanolisg/Gromacs_new/build

Execute:
srun -n 8 -c 8 omnitrace-run -- $GMX_DIR/bin/gmx_mpi.inst mdrun -resetstep 8000 -nsteps 300 -noconfout -nstlist 300 -nb gpu -bonded gpu -pme gpu -ntomp $OMP_NUM_THREADS -pin on -npme 1 -s benchPEP.tpr -gpu_id $GPUIDS -notunepme -maxh 0.32 -resethway -cpt 999999

Omniperf

Load Omniperf

export PATH=/opt/rocm-5.6.0/bin/:$PATH
export LD_LIBRARY_PATH=/opt/rocm-5.6.0/lib:$LD_LIBRARY_PATH
module use  /software/modulefiles/amd/
module load omniperf

Go to the Omniperf examples:

cd HPCTrainingExamples/OmniperfExamples/

Enter each directory and read the instructions even from a web page: https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples
Before you execute any Omniper call, select a specific GPU:
export ROCR_VISIBLE_DEVICES=X

or add the --device X in the omniperf call

Eviden training

Logistics

Omnitrace

Dynamic instrumentation (it will take long time/fail)

Binary rewriting (to be used with MPI codes and decreases overhead)

Visualization

Call-stack

Hardware counters

Kernel timings

GROMACS

Omniperf

Read more

Omnitrace by Example

LUMI Training April 2024

Omnitools: Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial

Omnitools: Performance Analysis Tools for AMD GPUs, LUMI training Estonia