# LUMI-G Pre-Hackathon Note: Reservation: small_g (-\-reservation=small_g)` ## Rocprof * Get the exercise: https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/mini-nbody * Compile and run the code ``` cd mini-nbody/hip ``` Can compile and run all with ``` ./HIP-nbody-orig.sh ``` Or just run compile and run one case ``` hipcc --offload-arch=gfx90a -I../ -DSHMOO nbody-orig.cpp -o nbody-orig ./nbody-orig 65536 65536, 161.871 ``` * Note that we can find the compile line and the executable name by checking the file `HIP-nbody-orig.sh` or by watching the output from running the script. ``` cat HIP-nbody-orig.sh ... EXE=nbody-orig ... ./$EXE 65536 ... ``` * The binary is called `nbody-orig` * Use rocprof with `--stats` ``` rocprof --stats nbody-orig 65536 RPL: on '221130_200946' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_200946_3670592' RPL: result dir '/tmp/rpl_data_221130_200946_3670592/input_results_221130_200946' ROCProfiler: input from "/tmp/rpl_data_221130_200946_3670592/input.xml" 0 metrics 65536, 159.960 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_200946_3670592/input_results_221130_200946 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating ``` Files with the prefix results are created * Check the files results.csv You can see information for each kernel call with their duration ``` cat results.csv "Index","KernelName","gpu-id","queue-id","queue-index","pid","tid","grd","wgr","lds","scr","arch_vgpr","accum_vgpr","sgpr","wave_size","sig","obj","DispatchNs","BeginNs","EndNs","CompleteNs","DurationNs" 0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372809346673,1591372809872935,1591372836189584,1591372836215944,26316649 1,"bodyForce(Body*, float, int) [clone .kd]",0,0,2,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372836774261,1591372837000949,1591372863116796,1591372863132315,26115847 2,"bodyForce(Body*, float, int) [clone .kd]",0,0,4,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372863652552,1591372863877281,1591372889980009,1591372889994436,26102728 3,"bodyForce(Body*, float, int) [clone .kd]",0,0,6,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372890512133,1591372890735562,1591372916796147,1591372916817087,26060585 4,"bodyForce(Body*, float, int) [clone .kd]",0,0,8,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372917332974,1591372917556629,1591372943652575,1591372943667909,26095946 5,"bodyForce(Body*, float, int) [clone .kd]",0,0,10,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372944181896,1591372944405378,1591372970475883,1591372970491020,26070505 6,"bodyForce(Body*, float, int) [clone .kd]",0,0,12,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372971009527,1591372971233309,1591372997318181,1591372997339821,26084872 7,"bodyForce(Body*, float, int) [clone .kd]",0,0,14,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372997856209,1591372998080743,1591373024164495,1591373024180993,26083752 8,"bodyForce(Body*, float, int) [clone .kd]",0,0,16,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373024701060,1591373024924818,1591373051023611,1591373051040364,26098793 9,"bodyForce(Body*, float, int) [clone .kd]",0,0,18,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373051559851,1591373051782878,1591373077878145,1591373077902255,26095267 ``` * Check the statistics result file, one line per kernel ``` cat results.stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "bodyForce(Body*, float, int) [clone .kd]",10,261124944,26112494,100.0 ``` * Profile the HIP calls with `--hip-trace` ``` rocprof --stats --hip-trace nbody-orig 65536 RPL: on '221130_201416' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_201416_3670892' RPL: result dir '/tmp/rpl_data_221130_201416_3670892/input_results_221130_201416' ROCTracer (pid=3670915): HIP-trace() 65536, 161.051 hsa_copy_deps: 0 scan ops data 29:30 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating dump json 19:20 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating dump json 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating dump json 9:10 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating ``` Now we have new files with the `hip` in their name like below, check the file `results.hip_stats.csv` ``` cat results.hip_stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "hipMemcpy",20,486845521,24342276,99.89375113830629 "hipLaunchKernel",10,467008,46700,0.09582337501179998 "hipMalloc",1,30570,30570,0.006272527610042495 "hipFree",1,14210,14210,0.0029156891507590398 "__hipPushCallConfiguration",10,3510,351,0.0007202018943817191 "__hipPopCallConfiguration",10,2520,252,0.0005170680267355932 ``` * Profile also the HSA API with the `--hsa-trace` ``` rocprof --stats --hip-trace --hsa-trace nbody-orig 65536 RPL: on '221130_201737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_201737_3671219' RPL: result dir '/tmp/rpl_data_221130_201737_3671219/input_results_221130_201737' ROCProfiler: input from "/tmp/rpl_data_221130_201737_3671219/input.xml" 0 metrics ROCTracer (pid=3671242): HSA-trace() HSA-activity-trace() HIP-trace() 65536, 155.978 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_201737_3671219/input_results_221130_201737 hsa_copy_deps: 1 scan hsa API data 5953:5954 hsa_copy_deps: 0 scan hip API data 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating dump json 9:10 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hsa_stats.csv' is generating dump json 5963:5964 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating dump json 19:20 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating dump json 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating ``` * See the content of the file `results.hsa_stats.csv` ``` cat results.hsa_stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "hsa_signal_wait_scacquire",50,264955082,5299101,82.69977799679005 "hsa_queue_create",1,39868279,39868279,12.443987854568068 "hsa_amd_memory_async_copy",20,4917141,245857,1.5347751249357586 "hsa_amd_signal_async_handler",20,4262555,213127,1.3304608069344652 "hsa_signal_store_screlease",40,1945998,48649,0.60739956889069 "hsa_amd_memory_lock_to_pool",20,1418202,70910,0.44265990170591873 "hsa_amd_memory_unlock",20,723957,36197,0.22596691758953363 "hsa_agent_get_info",80,671007,8387,0.20943976433821374 "hsa_amd_memory_pool_allocate",5,597926,119585,0.18662917157599068 "hsa_system_get_info",5005,367139,73,0.11459419296574767 "hsa_executable_load_agent_code_object",2,216629,108314,0.0676158768966984 "hsa_executable_freeze",2,156380,78190,0.048810504729771616 "hsa_amd_agents_allow_access",4,89470,22367,0.02792605101785821 "hsa_signal_create",57,51500,903,0.016074568318092074 "hsa_code_object_reader_create_from_memory",2,20940,10470,0.006535950690890253 "hsa_isa_get_info_alt",2,18180,9090,0.005674478680056581 "hsa_signal_load_relaxed",236,17880,75,0.005580840418009442 "hsa_system_get_major_extension_table",3,14920,4973,0.004656942899144344 "hsa_amd_profiling_get_async_copy_time",40,13430,335,0.004191872864310224 "hsa_executable_create_alt",2,9030,4515,0.0028185116876188626 "hsa_amd_memory_pool_get_info",106,7570,71,0.002362805478989456 "hsa_amd_agent_iterate_memory_pools",27,7440,275,0.0023222288987690297 "hsa_amd_memory_pool_free",1,4710,4710,0.0014701207141400712 "hsa_executable_get_symbol_by_name",15,3599,239,0.0011233470170255023 "hsa_queue_add_write_index_screlease",20,3500,175,0.0010924463905499467 "hsa_amd_profiling_get_dispatch_time",20,3110,155,0.0009707166498886669 "hsa_signal_silent_store_relaxed",40,2800,70,0.0008739571124399574 "hsa_amd_agent_memory_pool_get_info",19,2490,131,0.0007771975749912477 "hsa_iterate_agents",2,2250,1125,0.0007022869653535371 "hsa_queue_load_read_index_relaxed",20,2240,112,0.0006991656899519659 "hsa_signal_destroy",20,2190,109,0.0006835593129441095 "hsa_queue_load_read_index_scacquire",20,1790,89,0.0005587082968812585 "hsa_executable_symbol_get_info",30,1540,51,0.00048067641184197657 "hsa_amd_profiling_async_copy_enable",1,440,440,0.00013733611766913615 "hsa_agent_iterate_isas",1,400,400,0.00012485101606285104 "hsa_amd_profiling_set_profiler_enabled",1,140,140,4.3697855621997866e-05 "hsa_dispatch",10,0,0,0.0 ``` * Download the results.json file on your laptop From your laptop: `scp username@lumi.csc.fi:/path/results.json .` * Visit the web page: https://ui.perfetto.dev/ * Click on the top left menu, "Open Trace File on the left top" * Select the file results.json Zoom in/out: W/S Move left/right: A/D ![](https://i.imgur.com/ZKgVBKI.png) Read about the counters: ` vim /opt/rocm/rocprofiler/lib/gfx_metrics.xml` * Create a file with the contents: ``` cat rocprof_counters.txt pmc : Wavefronts VALUInsts VFetchInsts VWriteInsts VALUUtilization VALUBusy WriteSize pmc : SALUInsts SFetchInsts LDSInsts FlatLDSInsts GDSInsts SALUBusy FetchSize pmc : L2CacheHit MemUnitBusy MemUnitStalled WriteUnitStalled ALUStalledByLDS LDSBankConflict ``` * Execute with using the counters ``` rocprof --timestamp on -i rocprof_counters.txt nbody-orig 65536 RPL: on '221130_205737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file 'rocprof_counters.txt' RPL: output dir '/tmp/rpl_data_221130_205737_3673574' RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input0.xml" gpu_index = kernel = range = 7 metrics Wavefronts, VALUInsts, VFetchInsts, VWriteInsts, VALUUtilization, VALUBusy, WriteSize 65536, 155.389 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737 RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input1.xml" gpu_index = kernel = range = 7 metrics SALUInsts, SFetchInsts, LDSInsts, FlatLDSInsts, GDSInsts, SALUBusy, FetchSize 65536, 156.996 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737 RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input2.xml" gpu_index = kernel = range = 6 metrics L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, ALUStalledByLDS, LDSBankConflict 65536, 155.264 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/rocprof_counters.csv' is generating ``` * Contents of the rocprof_counters.csv file ``` cat rocprof_counters.csv Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,VFetchInsts,VWriteInsts,VALUUtilization,VALUBusy,WriteSize,SALUInsts,SFetchInsts,LDSInsts,FlatLDSInsts,GDSInsts,SALUBusy,FetchSize,L2CacheHit,MemUnitBusy,MemUnitStalled,WriteUnitStalled,ALUStalledByLDS,LDSBankConflict,DispatchNs,BeginNs,EndNs,CompleteNs 0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3673711,3673711,65536,256,0,0,20,4,16,64,0x0,0x7f2b4d282500,2048.0000000000,1212443.0000000000,12.0000000000,12.0000000000,100.0000000000,68.1476813493,7872.0000000000,131228.5000000000,65553.0000000000,0.0000000000,0.0000000000,0.0000000000,6.3483148000,9429.1875000000,96.5684331443,0.0250344612,0.0044357832,0.0102024550,0.0000000000,0.0000000000,1594244102859719,1594244111978746,1594244138305243,1594244138330792 ... ``` ## Omnitrace * Load Omnitrace ``` module use /project/project_465000502/software/omnitrace192/share/modulefiles/ module load omnitrace/1.9.2 ``` * Use your own space, for example: `mkdir /scratch/project_465000502/$USER` `cd /scratch/project_465000502/$USER` * Allocate resources with `salloc` `salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_465000502` * Check the various options and their values and also a second command for description `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace` `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description` * Create an Omnitrace configuration file with description per option `srun -n 1 omnitrace-avail -G omnitrace.cfg --all` * Declare to use this configuration file: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` * Get the training examples: `cp -r /project/project_465000502/exercises/AMD/HPCTrainingExamples .` * Compile and execute saxpy * `cd HIP/saxpy` * `hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.cpp` * `time srun -n 1 ./saxpy` * Check the duration * Compile and execute Jacobi * `cd HIP/jacobi` * `make` * `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1` * Check the duration #### Dynamic instrumentation * Execute dynamic instrumentation: `time srun –n 1 –-gpus 1 omnitrace-instrument -- ./saxpy` and check the duration * Execute dynamic instrumentation: `time srun –n 1 –-gpus 1 omnitrace-instrument -- ./Jacobi_hip -g 1 1` and check the duration * Check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '` * Available functions to instrument: `srun -n 1 --gpus 1 omnitrace -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1` * the simulate option means that it will not execute the binary #### Binary rewriting (to be used with MPI codes and decreases overhead) * Binary rewriting: `srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip` * We created a new instrumented binary called jacobi.inst * Executing the new instrumented binary: `time srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and check the duration * See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer.txt` #### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file #### Hardware counters * See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all` * Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize #### Sampling Activate in your configuration file `OMNITRACE_USE_SAMPLING = true` and `OMNITRACE_SAMPLING_FREQ = 100`, execute and visualize #### Kernel timings * Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt` #### Call-stack Edit your omnitrace.cfg: ``` OMNITRACE_USE_SAMPLING = true;  OMNITRACE_SAMPLING_FREQ = 100 ``` Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto. ## Omniperf * Lod Omniperf: ``` module load cray-python module use /project/project_465000502/software/omniperf108/modules module load omniperf export ROOFLINE_BIN=/project/project_465000502/software/omniperf108/bin/utils/rooflines/roofline-sle15sp3-mi200-rocm5 ``` * Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed ``` salloc -N1 -n 2 -p small-g --gpus 1 -t 00:40:00 cp -r /project/project_465000502/exercises/AMD/HPCTrainingExamples . cd HPCTrainingExamples/HIP/dgemm/ mkdir build cd build cmake .. make cd bin srun -n 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv ``` * Run `srun -n 1 --gpus 1 omniperf profile -h` to see all the options * Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` * If you want to only roofline analysis, then execute: `srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv` There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file `dgemm_analyze.txt` * We can select specific IP Blocks, like: ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2 ``` But you need to know the code of the IP Block * If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute: ``` omniperf analyze -p workloads/dgemm/mi200/ --gui ``` * Open the web page: http://IP:8050/ The IP will be displayed in the output