https://hackmd.io/@gmarkoma/eviden_training # Eviden training * Get the initial material `cp -r /tmp/HPCTrainingExamples/ .` ## Logistics Reserve resources and login to the node ## Omnitrace * Load Omniperf ``` export PATH=/opt/rocm-5.6.0/bin/:$PATH export LD_LIBRARY_PATH=/opt/rocm-5.6.0/lib:$LD_LIBRARY_PATH module use /software/modulefiles/amd/ module load omnitrace ``` * Before you execute any Omnitrace call, select a specific GPU: `export HIP_VISIBLE_DEVICES=X` where X is 0,1,...,7 * Check the various options and their values and also a second command for description `omnitrace-avail --categories omnitrace --brief --description` * Create an Omnitrace configuration file with description per option `omnitrace-avail -G omnitrace.cfg --all` * Declare to use this configuration file: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` * Activate the TIMEMORY option in the `omnitrace.cfg`, edit the file, find the `OMNITRACE_USE_TIMEMORY` and declare it equal to true `OMNITRACE_USE_TIMEMORY = true` * Compile and execute Jacobi * `cd HPCTrainingExamples/HIP/jacobi` * `make clean;make` * Binary `jacobi_hip` <!-- * Need to make some changes to the makefile * ``MPICC=$(PREP) `which CC` `` * `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include` * `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich` * comment out * ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)`` --> * Now execute the binary * `time mpirun -np 1 Jacobi_hip -g 1 1` * Check the duration ### Dynamic instrumentation (it will take long time/fail) * Execute dynamic instrumentation: `time mpirun -np 1 omnitrace-instrument -- Jacobi_hip -g 1 1` and check the duration * About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '` * Available functions to instrument (**it can take long time**): `mpirun -np 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1` * the simulate option means that it will not execute the binary ### Binary rewriting (to be used with MPI codes and decreases overhead) * Binary rewriting: `omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip` * We created a new instrumented binary called jacobi.inst * Executing the new instrumented binary: `time mpirun -n 1 omnitrace-run -- ./jacobi.inst -g 1 1` and check the duration * See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt` * See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` * Check the MPI calls ### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto. * Where are all the MPI calls? * If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers) ### Call-stack Edit your omnitrace.cfg: ``` OMNITRACE_USE_SAMPLING = true;  OMNITRACE_SAMPLING_FREQ = 100 ``` Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto. ### Hardware counters * See a list of all the counters: `omnitrace-avail --all` * Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize ### Kernel timings * Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt` ### GROMACS * Go to the bin directory: `omnitrace-instrument -o gmx_mpi.inst -- gmx_mpi` ``` export LD_LIBRARY_PATH=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/lib/:$LD_LIBRARY_PATH export PATH=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/bin/:$PATH export ROCM_DIR=/opt/rocm-5.7.0/ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm-5.7.0/lib/ export PATH=/opt/rocm-5.7.0/bin/:$PATH export PATH=$PATH:$ROCM_DIR/llvm/bin:$ROCM_DIR/bin export OMPI_DIR=/software/mpi/openmpi/4.1.6/ucx/1.15.0/rocm/5.7.0/ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OMPI_DIR/lib export C_INCLUDE_PATH=$C_INCLUDE_PATH:$OMPI_DIR/include export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:$OMPI_DIR/include export INCLUDE=$INCLUDE:$OMPI_DIR/include export LD_LIBRARY_PATH=/software/mpi/ucx/1.15.0/rocm/5.7.0/lib/ucx/:$LD_LIBRARY_PATH export OMP_NUM_THREADS=16 export PME=gpu export TUNEPME=no export GPUIDS=45236701 export MPICH_GPU_SUPPORT_ENABLED=1 export GMX_ENABLE_DIRECT_GPU_COMM=1 export AMD_DIRECT_DISPATCH=1 export GMX_GPU_PME_DECOMPOSITION=1 export GMX_FORCE_CUDA_AWARE_MPI=1 export GMX_GPU_DD_COMMS=true export GMX_GPU_PME_PP_COMMS=true export GMX_FORCE_GPU_AWARE_MPI=true export GMX_FORCE_UPDATE_DEFAULT_GPU=true export GMX_DIR=/home_nfs/xmarkomanolisg/Gromacs_new/build ``` * Execute: `srun -n 8 -c 8 omnitrace-run -- $GMX_DIR/bin/gmx_mpi.inst mdrun -resetstep 8000 -nsteps 300 -noconfout -nstlist 300 -nb gpu -bonded gpu -pme gpu -ntomp $OMP_NUM_THREADS -pin on -npme 1 -s benchPEP.tpr -gpu_id $GPUIDS -notunepme -maxh 0.32 -resethway -cpt 999999` ## Omniperf * Load Omniperf ``` export PATH=/opt/rocm-5.6.0/bin/:$PATH export LD_LIBRARY_PATH=/opt/rocm-5.6.0/lib:$LD_LIBRARY_PATH module use /software/modulefiles/amd/ module load omniperf ``` * Go to the Omniperf examples: ``` cd HPCTrainingExamples/OmniperfExamples/ ``` * Enter each directory and read the instructions even from a web page: https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples * Before you execute any Omniper call, select a specific GPU: `export ROCR_VISIBLE_DEVICES=X` or add the `--device X` in the omniperf call