# Rocprofv3
## Jacobi
### Setup environment
```
module load rocm/6.4.0
```
These exercises are based on the AMD training project in https://github.com/amd/HPCTrainingExamples.
You can clone it to your home folder as:
`git clone https://github.com/amd/HPCTrainingExamples`
* Download examples repo (if necessary) and navigate to the `jacobi` exercises
```
cd ~/HPCTrainingExamples/HIP/jacobi
```
### Compile and run one case
```
make clean
make
mpirun -np 2 ./Jacobi_hip -g 2 1
```
### Let's profile HIP
```
mpirun -np 2 rocprofv3 --hip-trace -- ./Jacobi_hip -g 2 1
```
Now you have two files per MPI process, one with the HW information (`XXXXX_agent_info.csv`) and for the HIP API `XXXXX_hip_api_trace.csv` where XXXXX are numbers.
```
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_COMPILER_API","__hipRegisterFatBinary",1389712,1389712,1,4762229062888604,4762229062892624
"HIP_COMPILER_API","__hipRegisterFunction",1389712,1389712,2,4762229062903414,4762229062910744
"HIP_COMPILER_API","__hipRegisterFatBinary",1389712,1389712,3,4762229062911814,4762229062911924
...
"HIP_RUNTIME_API","hipGetDeviceCount",1389712,1389712,9,4762229067837299,4762229201986925
"HIP_RUNTIME_API","hipStreamCreate",1389712,1389712,10,4762229253999055,4762229484333519
"HIP_RUNTIME_API","hipStreamCreate",1389712,1389712,11,4762229484352199,4762229502251764
"HIP_RUNTIME_API","hipEventCreateWithFlags",1389712,1389712,12,4762229502311284,4762229502317444
"HIP_RUNTIME_API","hipEventCreateWithFlags",1389712,1389712,13,4762229502318894,4762229502319244
"HIP_RUNTIME_API","hipEventCreateWithFlags",1389712,1389712,14,4762229502320134,4762229502320454
...
```
Correlation_Id: Unique identifier for correlation between HIP and HSA async calls during activity tracing.
Start_Timestamp: Begin time in nanoseconds (ns) when the kernel begins execution.
End_Timestamp: End time in ns when the kernel finishes execution.
### Let's create statistics
```
mpirun -np 2 rocprofv3 --stats --hip-trace -- ./Jacobi_hip -g 2 1
```
Now there is extra one file per MPI process call `XXXXX_hip_stats.csv` and the content are:
```
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"hipMemcpy",1005,567684477,564860.176119,47.99,277461,4165053,163501.123978
"hipStreamCreate",4,257540667,64385166.750000,21.77,79530,226259882,108165143.720195
"hipStreamSynchronize",2000,143870836,71935.418000,12.16,6990,173191,64446.616580
"hipGetDeviceCount",2,139874248,69937124.000000,11.82,830,139873418,98904855.476912
"hipMalloc",7,18917455,2702493.571429,1.60,1520,4763843,2484579.981128
...
```
The column Percentage means how much percentage of the execution time this command takes, in this case we have all the calls of a specific HIP API in the same row, as you can see the column Calls of how times this HIP command was called.
### Where are the kernels?
```
mpirun -np 2 rocprofv3 --stats --kernel-trace --hip-trace -- ./Jacobi_hip -g 2 1
```
We have one extra file per MPI process, called `XXXXX_kernel_stats.csv`:
```
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"NormKernel1(int, double, double, double const*, double*)",1001,358330487,357972.514486,53.00,354720,384680,1797.388416
"JacobiIterationKernel(int, double, double, double const*, double const*, double*, double*)",1000,172351563,172351.563000,25.49,165120,206241,7074.371069
"LocalLaplacianKernel(int, int, int, double, double, double const*, double*)",1000,133229404,133229.404000,19.71,122561,168920,8710.277191
"HaloLaplacianKernel(int, int, int, double, double, double const*, double const*, double*)",1000,9922658,9922.658000,1.47,9080,12160,315.466626
"NormKernel2(int, double const*, double*)",1001,2269165,2266.898102,0.3356,2000,3560,168.847901
"__amd_rocclr_fillBufferAligned",1,3640,3640.000000,5.384e-04,3640,3640,0.00000000e+00
```
In order to have information for each Kernel call, remove the `--stats`
### Create pftrace file for Perfetto and Visualize
`mpirun -np 2 rocprofv3 --kernel-trace --hip-trace --output-format pftrace -- ./Jacobi_hip -g 2 1`
Now we have only pftrace files, one per MPI process.
* Merge the pftraces, if you want: `cat XXX*_results.ptrace > jacobi.pftrace`
* Download the trace on your laptop and load the file on Perfetto.
`scp -P 7002 aac6.amd.com:<path_to_file>/jacobi.pftrace jacobi.pftrace`
1. Open a browser and go to [https://ui.perfetto.dev/](https://ui.perfetto.dev/).
2. Click on `Open trace file` in the top left corner.
3. Navigate to the `jacobi.pftrace` or the file before the merging, that you just downloaded.
4. Use the keystrokes W,A,S,D to zoom in and move right and left in the GUI
```
Navigation
w/s Zoom in/out
a/d Pan left/right
```
Feel free to use various flags as they were presented in the presentation
### Hardware Counters
Read about hardware counters available for the GPU on this system (look for gfx90a section)
```
less $ROCM_PATH/lib/rocprofiler/gfx_metrics.xml
```
Create a `rocprof_counters.txt` file with the counters you would like to collect
```
vi rocprof_counters.txt
```
Content for `rocprof_counters.txt`:
```
pmc: VALUUtilization VALUBusy FetchSize WriteSize MemUnitStalled
pmc: GPU_UTIL CU_OCCUPANCY MeanOccupancyPerCU MeanOccupancyPerActiveCU
```
Execute with the counters we just added:
```
mpirun -np 2 rocprofv3 -i rocprof_counters.txt --kernel-trace --hip-trace -- ./Jacobi_hip -g 2 1
```
You'll notice that `rocprofv3` runs 2 passes, one for each set of counters we have in that file.
Now the data are in two different folders, one for each MPI process, pmc_1 and pmc_2.
Explore the content of the pmc_* directories.
Try to use the `--hsa-trace` option also.
### Tips
Do not forget for OMP Offloading information to declare the `--kernel-trace`