# Performance Analysis on AMD GPUs, CRAY User Group Tutorial 2025 <!--## Logistics * Access to LUMI: ssh username@lumi.csc.fi * Project account: project_465000532 * Slides: /project/project_465000532/slides/ * Working space: /scratch/project_465000532/$USER --> These exercises are based on the AMD training project in https://github.com/amd/HPCTrainingExamples. You can clone it to your home folder as: `git clone https://github.com/amd/HPCTrainingExamples` ## Rocprofiler-Systems (Rocprofsys) * Load rocprof-sys ``` module load rocm/6.4.0 module load rocprofiler-systems/6.4.0 ``` <!-- ``` module load LUMI/22.08 partition/G rocm/5.3.3 module use /project/project_465000532/software/ROCPROFSYS_rocm533/share/modulefiles/ module load rocprof-sys/1.10.0 ``` --> Use the `export HIP_VISIBLE_DEVICES=1-3` to avoid using only the 0 GPU. If you reserve an APU, do not forget to release it, maybe we should not use SLURM. * Allocate resources with `salloc` `salloc -N 1 --ntasks=1 --partition=gpu-dev --gpus=1 -A XXX --time=00:35:00` * Check the various options and their values and also a second command for description `srun -n 1 --gpus 1 rocprof-sys-avail --categories rocprof-sys` `srun -n 1 --gpus 1 rocprof-sys-avail --categories rocprof-sys --brief --description` * Create an rocprof-sys configuration file with description per option `srun -n 1 rocprof-sys-avail -G rocprof-sys.cfg --all` * Declare to use this configuration file: `export ROCPROFSYS_CONFIG_FILE=/path/rocprof-sys.cfg` * Get the training examples: `cp -r /your_path/exercises/HPCTrainingExamples/ .` <!-- * Compile and execute saxpy * `cd HPCTrainingExamples/HIP/saxpy` * `hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.cpp` * `time srun -n 1 ./saxpy` * Check the duration --> * Compile and execute Jacobi * `cd HPCTrainingExamples/HIP/jacobi` * `make clean;make -f Makefile` * Binary `Jacobi_hip` <!-- * Need to make some changes to the makefile * ``MPICC=$(PREP) `which CC` `` * `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include` * `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich` * comment out * ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)`` --> * Now execute the binary * `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1` * Check the duration ### Dynamic instrumentation (it will take long time) * Execute dynamic instrumentation: `time srun -n 1 --gpus 1 rocprof-sys-instrument -- Jacobi_hip -g 1 1` and check the duration <!-- * Execute dynamic instrumentation: `time srun -n 1 --gpus 1 rocprof-sys-instrument -- ./Jacobi_hip -g 1 1` and check the duration (may fail?) --> * About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '` * Available functions to instrument (**it can take long time**): `srun -n 1 --gpus 1 rocprof-sys-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1` * the simulate option means that it will not execute the binary ### Binary rewriting (to be used with MPI codes and decreases overhead) * Allocation: `salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00` * Binary rewriting: `srun -n 1 --gpus 1 rocprof-sys-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip` or ` srun -n 1 --gpus 1 rocprof-sys-instrument -o jacobi.inst -- ./Jacobi_hip` * We created a new instrumented binary called jacobi.inst * Activating profiling: * Edit the rocprof-sys.cfg file and edit the parameter: `ROCPROFSYS_PROFILE = true` * Executing the new instrumented binary: `time srun -n 2 --gpus 2 rocprof-sys-run -- ./jacobi.inst -g 2 1` and check the duration * See the list of the instrumented GPU calls: `cat rocprof-sys-jacobi.inst-output/TIMESTAMP/roctracer-0.txt` * See the list of the instrumented CPU calls: `cat rocprof-sys-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt * Check the MPI calls ### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto. * Where are all the MPI calls? * If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers) ### Call-stack Edit your rocprof-sys.cfg: ``` ROCPROFSYS_USE_SAMPLING = true;  ROCPROFSYS_SAMPLING_FREQ = 100 ``` Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto. ### Hardware counters * See a list of all the counters: `srun -n 1 --gpus 1 rocprof-sys-avail --all` * Declare in your configuration file: `ROCPROFSYS_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `srun -n 1 --gpus 1 rocprof-sys-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize ### Kernel timings * Open the file `rocprof-sys-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `ROCPROFSYS_USE_TIMEMORY = true` and `ROCPROFSYS_FLAT_PROFILE = true`, execute the code and open again the file `rocprof-sys-binary-output/timestamp/wall_clock.txt` ## Rocprof-compute * Load rocprof-compute: ``` module load rocm/6.4.0 module load rocprof-compute/6.4.0 ``` * Reserve a GPU, compile the exercise and execute rocprof-compute, observe how many times the code is executed ``` salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 --time=00:30:00 cp -r /project/project_465000532/exercises/HPCTrainingExamples/ . cd HPCTrainingExamples/HIP/dgemm/ mkdir build cd build cmake .. make cd bin srun -n 1 --gpus 1 rocprof-compute profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv ``` ### Execution with all the analysis * Run `srun -n 1 --gpus 1 rocprof-compute profile -h` to see all the options * Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it ``` srun -n 1 --gpus 1 rocprof-compute analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` ### Roofline only execution * If you want to only roofline analysis, then execute: `srun -n 1 rocprof-compute profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv` * When you use the `--roof-only` option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8 * If you want to create another PDF file with the kernel names, then add in the above command the `--kernel-names` this will create another PDF file with the markers and their names. ``` srun -n 1 --gpus 1 rocprof-compute analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file `dgemm_analyze.txt` * We can select specific IP Blocks, like: ``` srun -n 1 --gpus 1 rocprof-compute analyze -p workloads/dgemm/mi200/ -b 7.1.2 ``` But you need to know the code of the IP Block ### Visualize * If you have installed rocprof-compute on your laptop (no ROCm required for analysis) then you can download the data and execute: ``` rocprof-compute analyze -p workloads/dgemm/mi200/ --gui ``` * Open the web page: http://IP:8050/ The IP will be displayed in the output ### Too large profiling data * Use a tool like rocprof to get the top 10 kernels * Then in the above commands add the option `-k 0` where 0 is the `id` of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.