Rocprof
https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/mini-nbody
Can compile and run all with
Or just run compile and run one case
- Note that we can find the compile line and the executable name by checking the file
HIP-nbody-orig.sh
or by watching the output from running the script.
Files with the prefix results are created
- Check the files results.csv
You can see information for each kernel call with their duration
cat results.csv
"Index","KernelName","gpu-id","queue-id","queue-index","pid","tid","grd","wgr","lds","scr","arch_vgpr","accum_vgpr","sgpr","wave_size","sig","obj","DispatchNs","BeginNs","EndNs","CompleteNs","DurationNs"
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372809346673,1591372809872935,1591372836189584,1591372836215944,26316649
1,"bodyForce(Body*, float, int) [clone .kd]",0,0,2,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372836774261,1591372837000949,1591372863116796,1591372863132315,26115847
2,"bodyForce(Body*, float, int) [clone .kd]",0,0,4,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372863652552,1591372863877281,1591372889980009,1591372889994436,26102728
3,"bodyForce(Body*, float, int) [clone .kd]",0,0,6,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372890512133,1591372890735562,1591372916796147,1591372916817087,26060585
4,"bodyForce(Body*, float, int) [clone .kd]",0,0,8,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372917332974,1591372917556629,1591372943652575,1591372943667909,26095946
5,"bodyForce(Body*, float, int) [clone .kd]",0,0,10,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372944181896,1591372944405378,1591372970475883,1591372970491020,26070505
6,"bodyForce(Body*, float, int) [clone .kd]",0,0,12,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372971009527,1591372971233309,1591372997318181,1591372997339821,26084872
7,"bodyForce(Body*, float, int) [clone .kd]",0,0,14,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372997856209,1591372998080743,1591373024164495,1591373024180993,26083752
8,"bodyForce(Body*, float, int) [clone .kd]",0,0,16,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373024701060,1591373024924818,1591373051023611,1591373051040364,26098793
9,"bodyForce(Body*, float, int) [clone .kd]",0,0,18,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373051559851,1591373051782878,1591373077878145,1591373077902255,26095267
- Check the statistics result file, one line per kernel
- Profile the HIP calls with
--hip-trace
Now we have new files with the hip
in their name like below, check the file results.hip_stats.csv
- Profile also the HSA API with the
--hsa-trace
- See the content of the file
results.hsa_stats.csv
cat results.hsa_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hsa_signal_wait_scacquire",50,264955082,5299101,82.69977799679005
"hsa_queue_create",1,39868279,39868279,12.443987854568068
"hsa_amd_memory_async_copy",20,4917141,245857,1.5347751249357586
"hsa_amd_signal_async_handler",20,4262555,213127,1.3304608069344652
"hsa_signal_store_screlease",40,1945998,48649,0.60739956889069
"hsa_amd_memory_lock_to_pool",20,1418202,70910,0.44265990170591873
"hsa_amd_memory_unlock",20,723957,36197,0.22596691758953363
"hsa_agent_get_info",80,671007,8387,0.20943976433821374
"hsa_amd_memory_pool_allocate",5,597926,119585,0.18662917157599068
"hsa_system_get_info",5005,367139,73,0.11459419296574767
"hsa_executable_load_agent_code_object",2,216629,108314,0.0676158768966984
"hsa_executable_freeze",2,156380,78190,0.048810504729771616
"hsa_amd_agents_allow_access",4,89470,22367,0.02792605101785821
"hsa_signal_create",57,51500,903,0.016074568318092074
"hsa_code_object_reader_create_from_memory",2,20940,10470,0.006535950690890253
"hsa_isa_get_info_alt",2,18180,9090,0.005674478680056581
"hsa_signal_load_relaxed",236,17880,75,0.005580840418009442
"hsa_system_get_major_extension_table",3,14920,4973,0.004656942899144344
"hsa_amd_profiling_get_async_copy_time",40,13430,335,0.004191872864310224
"hsa_executable_create_alt",2,9030,4515,0.0028185116876188626
"hsa_amd_memory_pool_get_info",106,7570,71,0.002362805478989456
"hsa_amd_agent_iterate_memory_pools",27,7440,275,0.0023222288987690297
"hsa_amd_memory_pool_free",1,4710,4710,0.0014701207141400712
"hsa_executable_get_symbol_by_name",15,3599,239,0.0011233470170255023
"hsa_queue_add_write_index_screlease",20,3500,175,0.0010924463905499467
"hsa_amd_profiling_get_dispatch_time",20,3110,155,0.0009707166498886669
"hsa_signal_silent_store_relaxed",40,2800,70,0.0008739571124399574
"hsa_amd_agent_memory_pool_get_info",19,2490,131,0.0007771975749912477
"hsa_iterate_agents",2,2250,1125,0.0007022869653535371
"hsa_queue_load_read_index_relaxed",20,2240,112,0.0006991656899519659
"hsa_signal_destroy",20,2190,109,0.0006835593129441095
"hsa_queue_load_read_index_scacquire",20,1790,89,0.0005587082968812585
"hsa_executable_symbol_get_info",30,1540,51,0.00048067641184197657
"hsa_amd_profiling_async_copy_enable",1,440,440,0.00013733611766913615
"hsa_agent_iterate_isas",1,400,400,0.00012485101606285104
"hsa_amd_profiling_set_profiler_enabled",1,140,140,4.3697855621997866e-05
"hsa_dispatch",10,0,0,0.0
- Download the results.json file on your laptop
From your laptop:
scp username@lumi.csc.fi:/path/results.json .
https://ui.perfetto.dev/
Zoom in/out: W/S
Move left/right: A/D

Read about the counters: vim /opt/rocm/rocprofiler/lib/gfx_metrics.xml
- Create a file with the contents:
- Execute with using the counters
- Contents of the rocprof_counters.csv file
cat rocprof_counters.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,VFetchInsts,VWriteInsts,VALUUtilization,VALUBusy,WriteSize,SALUInsts,SFetchInsts,LDSInsts,FlatLDSInsts,GDSInsts,SALUBusy,FetchSize,L2CacheHit,MemUnitBusy,MemUnitStalled,WriteUnitStalled,ALUStalledByLDS,LDSBankConflict,DispatchNs,BeginNs,EndNs,CompleteNs
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3673711,3673711,65536,256,0,0,20,4,16,64,0x0,0x7f2b4d282500,2048.0000000000,1212443.0000000000,12.0000000000,12.0000000000,100.0000000000,68.1476813493,7872.0000000000,131228.5000000000,65553.0000000000,0.0000000000,0.0000000000,0.0000000000,6.3483148000,9429.1875000000,96.5684331443,0.0250344612,0.0044357832,0.0102024550,0.0000000000,0.0000000000,1594244102859719,1594244111978746,1594244138305243,1594244138330792
...
Omnitrace
We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises
salloc -p gpu --gpus 1 -t 00:40:00
Load the module or declare PATH
and LD_LIBRARY_PATH
- Check the various options and their values and also a second command for description
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description
- Create an Omnitrace configuration file with description per option
srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all
- Execute the binary:
time srun -n 1 --gpus 1 ./MatrixTranspose
and check the duration
Dynamic instrumentation
- Execute dynamic instrumentation:
time srun –n 1 –gpus 1 omnitrace -- ./MatrixTranspose
and check the duration
- Check what the binary calls and gets instrumented:
nm --demangle MatrixTranspose | egrep -i ' (t|u) '
- Available functions to instrument:
srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose
- the simulate option means that it will not execute the binary
Binary rewriting
-
Binary rewriting: srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose
- We created a new instrumented binary called matrix.inst
-
Executing the new instrumented binary: time srun -n 1 --gpus 1 ./matrix.inst
and check the duration
-
See the list of the instrumented GPU calls: cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt
Visualization
- Copy the
perfetto-trace.proto
to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file
Hardware counters
- See a list of all the counters:
srun -n 1 --gpus 1 omnitrace-avail --all
- Declare in your configuration file:
OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
- Execute:
srun -n 1 --gpus 1 ./matrix.inst
and copy the perfetto file and visualize
Sampling
Activate in your configuration file OMNITRACE_USE_SAMPLING = true
and OMNITRACE_SAMPLING_FREQ = 100
, execute and visualize
Kernel timings
- Open the file
omnitrace-binary-output/timestamp/wall_clock.txt
(replace binary and timestamp with your information)
- In order to see the kernels gathered in your configuration file, make sure that
OMNITRACE_USE_TIMEMORY = true
and OMNITRACE_FLAT_PROFILE = true
, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt
Omniperf
- Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed
-
Run srun -n 1 --gpus 1 omniperf profile -h
to see all the options
-
Now is created a workload in the directory workloads with the name vcopy_all (the argument of the -n). So, we can analyze it
- If you want to only roofline analysis, then execute:
srun -n 1 --gpus 1 omniperf profile -n vcopy_all --roof-only -- ./vcopy 1048576 256
There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file vcopy_analyze.txt
- We can select specific IP Blocks, like:
But you need to know the code of the IP Block
- If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute: