# LUMI pre-hackathon training - Oct. 2024
# Omniperf - part 2
--------------------------------------------------------------
## Environment for LUMI
```
module load CrayEnv
module load buildtools/24.03
module load PrgEnv-cray/8.5.0
module load cce/17.0.1
module load craype-accel-amd-gfx90a
module load craype-x86-trento
module load cray-python
module use /pfs/lustrep3/scratch/project_462000394/amd-sw/modules
module load rocm/6.0.3 omnitrace/1.12.0-rocm6.0.x omniperf/2.1.0
```
You can setup the following environment variables for the project you want to use:
```
export SALLOC_ACCOUNT=project_<your porject ID>
export SBATCH_ACCOUNT=project_<your porject ID>
```
# Omniperf Advanced Exercises
These exercises are meant to provide extra insight on the tunning of kernels. The exercises files are included in:
```
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/OmniperfExamples
```
## Exercise 5: Algorithmic Optimizations
A simple yAx kernel, and more efficient, but more complex yAx kernel to demonstrate algorithmic improvements.
<details>
<summary><h3>Background: Acronyms and terms used in this exercise</h3></summary>
<ul>
<li><strong>L1:</strong> Level 1 Cache, the first level cache local to the Compute Unit (CU). If requested data is not found in the L1, the request goes to the L2</li>
<li><strong>L2:</strong> Level 2 Cache, the second level cache, which is shared by all Compute Units (CUs) on a GPU. If requested data is not found in the L2, the request goes to HBM</li>
<li><strong>HBM:</strong> High Bandwidth Memory is globally accessible from the GPU, and is a level of memory above the L2 cache</li>
<li><strong>CU:</strong> The Compute Unit is responsible for executing the User's kernels </li>
<li><strong>yAx:</strong> a vector-matrix-vector product, y*A*x, where y and x are vectors, and A is a matrix</li>
<li><strong>FP(32/16):</strong> 32- or 16-bit Floating Point numeric types</li>
</ul>
</details>
<details>
<summary><h3>Background: yAx Algorithmic Improvement Explanation</h3></summary>
Our approach up to this point could be described as having each thread sum up a row, as illustrated below:
<img src="threadrows.PNG"/>
However, this is not efficient in the way the parallelism is expressed. Namely, we could add up all the partial sums for each row in parallel.
This would make our approach to be: give a rows to wavefronts, and have the threads inside each wavefront sum up partial sums in parallel.
Then, we reduce the partial sums atomically with shared memory, before completing the computation and reducing the final answer using global atomics.
This approach expresses more of the parallelism that is available, and would look something like the figure below:
<img src="wavefrontrow.PNG"/>
The expressed parallelism in each approach roughly corresponds to the number of red arrows in each figure.
</details>
### Initial Roofline Analysis
We should start by doing a roofline to see where the problem executable stands.
These plots can be generated with:
```
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
```
The plots will appear as PDF files in the `./workloads/problem_roof_only/MI200` directory, if generated on MI200 hardware.
They are also provided below for easy reference:
| Roofline Type | Roofline Legend | Roofline Plot |
|---------------|----------------------------------------------------|------------------------------------------------------|
|FP32 |<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise1_problem_kernelName_legend.png"/>|<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_problem_roofline_fp32.png"/> |
|FP16/INT8 |<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise1_problem_kernelName_legend.png"/>|<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_problem_roofline_int8_fp16.png"/> |
The performance of this kernel looks pretty close to being HBM bandwidth bound.
In the case of algorithmic optimizations, there may not be obvious evidence other than a suspicion that poor
usage of hardware resources may be improved by changing the overall approach.
In this case, we should be able to make better usage of both L1 and L2 resources by using wavefronts more efficiently
to better parallelize our computation.
### Exercise Instructions:
To start, let's profile `problem.exe`:
```
make
srun ./problem.exe
```
(*simulated output*)
```
yAx time 13 ms
```
This should be in line with our last solution. From the last exercise, we saw this output from `omniperf analyze` for this kernel:
```
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
---------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
---------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 12427547.00 | 12427547.00 | 12427547.00 | 100.00 |
| | double*) [clone .kd] | | | | | |
---------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.1 Speed-of-Light
___________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
___________________________________________________
_ 16.1.0 _ Hit rate _ 49.98 _ Pct of peak _
___________________________________________________
_ 16.1.1 _ Bandwidth _ 10.88 _ Pct of peak _
___________________________________________________
_ 16.1.2 _ Utilization _ 98.15 _ Pct of peak _
___________________________________________________
_ 16.1.3 _ Coalescing _ 25.00 _ Pct of peak _
___________________________________________________
--------------------------------------------------------------------------------
17. L2 Cache
17.1 Speed-of-Light
_________________________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
_________________________________________________________________
_ 17.1.0 _ Utilization _ 98.60 _ Pct _
_________________________________________________________________
_ 17.1.1 _ Bandwidth _ 9.40 _ Pct _
_________________________________________________________________
_ 17.1.2 _ Hit Rate _ 0.52 _ Pct _
_________________________________________________________________
_ 17.1.3 _ L2-Fabric Read BW _ 650.84 _ Gb/s _
_________________________________________________________________
_ 17.1.4 _ L2-Fabric Write and Atomic BW _ 0.00 _ Gb/s _
_________________________________________________________________
```
Looking at this data again, we see:
- L1 Cache Hit (`16.1.0`) is about 50%, which is fairly low for a "well performing" kernel.
- L2 Cache Hit (`17.1.2`) is about 0%, which is very low to consider this kernel "well performing".
Let's run the profiling again, but dig more into detailed L1 and L2 stats
to see if we can make better use of the L1 and L2:
```
srun omniperf profile -n problem --no-roof -- ./problem.exe
```
(*output omitted*)
```
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 16.3 16.4 17.3 17.2
```
The metrics we request are:
- `16.3` Detailed L1 cache access stats
- `16.4` Detailed L1-L2 transaction stats
- `17.3` Detailed L2 access stats
- `17.2` Detailed L2-Fabric transaction stats
The output from the `analyze` command should look something like:
```
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
โโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโ
โ โ Kernel_Name โ Count โ Sum(ns) โ Mean(ns) โ Median(ns) โ Pct โ
โโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโก
โ 0 โ yax(double*, double*, double*, int, int, โ 1.00 โ 13164269.00 โ 13164269.00 โ 13164269.00 โ 100.00 โ
โ โ double*) [clone .kd] โ โ โ โ โ โ
โโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโ
0.2 Dispatch List
โโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโ
โ โ Dispatch_ID โ Kernel_Name โ GPU_ID โ
โโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโก
โ 0 โ 1 โ yax(double*, double*, double*, int, int, double*) [clone .kd] โ 4 โ
โโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโ
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโก
โ 16.3.0 โ Total Req โ 524368.00 โ 524368.00 โ 524368.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.1 โ Read Req โ 524304.00 โ 524304.00 โ 524304.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.2 โ Write Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.3 โ Atomic Req โ 64.00 โ 64.00 โ 64.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.4 โ Cache BW โ 8392960.00 โ 8392960.00 โ 8392960.00 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.5 โ Cache Hit Rate โ 49.98 โ 49.98 โ 49.98 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.6 โ Cache Accesses โ 131140.00 โ 131140.00 โ 131140.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.7 โ Cache Hits โ 65538.00 โ 65538.00 โ 65538.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.8 โ Invalidations โ 0.05 โ 0.05 โ 0.05 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.9 โ L1-L2 BW โ 4198528.00 โ 4198528.00 โ 4198528.00 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.10 โ L1-L2 Read โ 65538.00 โ 65538.00 โ 65538.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.11 โ L1-L2 Write โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.12 โ L1-L2 Atomic โ 64.00 โ 64.00 โ 64.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.13 โ L1 Access Latency โ 483.56 โ 483.56 โ 483.56 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.14 โ L1-L2 Read Latency โ 392.95 โ 392.95 โ 392.95 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.15 โ L1-L2 Write Latency โ 6198.37 โ 6198.37 โ 6198.37 โ Cycles โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโ
16.4 L1D - L2 Transactions
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Xfer โ Coherency โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโโก
โ 16.4.0 โ NC - Read โ Read โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.1 โ UC - Read โ Read โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.2 โ CC - Read โ Read โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.3 โ RW - Read โ Read โ RW โ 65538.00 โ 65538.00 โ 65538.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.4 โ RW - Write โ Write โ RW โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.5 โ NC - Write โ Write โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.6 โ UC - Write โ Write โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.7 โ CC - Write โ Write โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.8 โ NC - Atomic โ Atomic โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.9 โ UC - Atomic โ Atomic โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.10 โ CC - Atomic โ Atomic โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.11 โ RW - Atomic โ Atomic โ RW โ 64.00 โ 64.00 โ 64.00 โ Req per wave โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโโโโโ
--------------------------------------------------------------------------------
17. L2 Cache
17.2 L2 - Fabric Transactions
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโก
โ 17.2.0 โ Read BW โ 4195335.34 โ 4195335.34 โ 4195335.34 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.1 โ HBM Read Traffic โ 100.0 โ 100.0 โ 100.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.2 โ Remote Read Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.3 โ Uncached Read Traffic โ 0.01 โ 0.01 โ 0.01 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.4 โ Write and Atomic BW โ 0.11 โ 0.11 โ 0.11 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.5 โ HBM Write and Atomic Traffic โ 100.0 โ 100.0 โ 100.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.6 โ Remote Write and Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.7 โ Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.8 โ Uncached Write and Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.9 โ Read Latency โ 266.15 โ 266.15 โ 266.15 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.10 โ Write and Atomic Latency โ 480.86 โ 480.86 โ 480.86 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.11 โ Atomic Latency โ โ โ โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.12 โ Read Stall โ โ โ โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.13 โ Write Stall โ โ โ โ Pct โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโ
17.3 L2 Cache Accesses
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโก
โ 17.3.0 โ Bandwidth โ 4217468.25 โ 4217468.25 โ 4217468.25 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.1 โ Req โ 32948.97 โ 32948.97 โ 32948.97 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.2 โ Read Req โ 32884.77 โ 32884.77 โ 32884.77 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.3 โ Write Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.4 โ Atomic Req โ 64.00 โ 64.00 โ 64.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.5 โ Streaming Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.6 โ Probe Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.7 โ Cache Hit โ 0.52 โ 0.52 โ 0.52 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.8 โ Hits โ 172.26 โ 172.26 โ 172.26 โ Hits per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.9 โ Misses โ 32776.71 โ 32776.71 โ 32776.71 โ Misses per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.10 โ Writeback โ 0.02 โ 0.02 โ 0.02 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.11 โ Writeback (Internal) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.12 โ Writeback (vL1D Req) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.13 โ Evict (Internal) โ 32740.59 โ 32740.59 โ 32740.59 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.14 โ Evict (vL1D Req) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.15 โ NC Req โ 0.03 โ 0.03 โ 0.03 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.16 โ UC Req โ 3.93 โ 3.93 โ 3.93 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.17 โ CC Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.18 โ RW Req โ 32945.00 โ 32945.00 โ 32945.00 โ Req per wave โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโ
```
Profiling information of this level of detail is available, and it can be useful once it is determined that the high-level speed-of-light statistics
indicate there may be a performance issue in the code that hits a specific hardware subsystem.
Looking at the L1 stats, we see:
- L1 Total Req (`16.3.0`) and L1 Read Req (`16.3.1`) show we generate a lot of L1 requests
- L1 Cache Hit Rate (`16.3.5`) shows half the requests have to go out to the L2
Looking at the L2 (`17.3`, `17.2`) data, we see:
- L2 Req (`17.3.1`) is 32948.97
- L2 Read Req (`17.3.2`) is 2884.77
- L2 Hits (`17.3.8`) is 172.26
- L2 Misses (`17.3.9`) is 32776.71
- We are issuing a lot of requests to the L2 (`17.3.1`,`17.3.2`), but we almost never find the data in the L2 (`17.3.8`, `17.3.9`).
- L2 Read Bandwidth (`17.2.0`) is consequently very high, the L2 always has to go out to HBM to find data.
This data indicates that we should be able to make better usage of our memory system, so let's apply the algorithmic optimization present in `solution.cpp`:
```
cd solution
make
srun ./solution.exe
```
(*simulated output*)
```
yAx time: 0.4 ms
```
It should be noted again that algorithmic optimizations are usually the most expensive optimizations to implement, as they usually entail
re-conceptualizing the problem in a way that allows for a more efficient solution. However, as we see here, algorithmic optimization _can_
result in impressive speedups. A better runtime is not proof that we are using our caches more efficiently, we have to profile the solution:
```
srun omniperf profile -n solution --no-roof -- ./solution.exe
```
(*output omitted*)
```
omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 16.3 16.4 17.3 17.2
```
The output for the solution should look something like:
```
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
โโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโ
โ โ Kernel_Name โ Count โ Sum(ns) โ Mean(ns) โ Median(ns) โ Pct โ
โโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโก
โ 0 โ yax(double*, double*, double*, int, int, โ 1.00 โ 392003.00 โ 392003.00 โ 392003.00 โ 100.00 โ
โ โ double*) [clone .kd] โ โ โ โ โ โ
โโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโ
0.2 Dispatch List
โโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโ
โ โ Dispatch_ID โ Kernel_Name โ GPU_ID โ
โโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโก
โ 0 โ 1 โ yax(double*, double*, double*, int, int, double*) [clone .kd] โ 4 โ
โโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโ
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโก
โ 16.3.0 โ Total Req โ 16448.00 โ 16448.00 โ 16448.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.1 โ Read Req โ 16384.00 โ 16384.00 โ 16384.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.2 โ Write Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.3 โ Atomic Req โ 64.00 โ 64.00 โ 64.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.4 โ Cache BW โ 262208.00 โ 262208.00 โ 262208.00 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.5 โ Cache Hit Rate โ 69.61 โ 69.61 โ 69.61 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.6 โ Cache Accesses โ 4097.00 โ 4097.00 โ 4097.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.7 โ Cache Hits โ 2852.00 โ 2852.00 โ 2852.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.8 โ Invalidations โ 0.05 โ 0.05 โ 0.05 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.9 โ L1-L2 BW โ 79680.00 โ 79680.00 โ 79680.00 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.10 โ L1-L2 Read โ 1244.00 โ 1244.00 โ 1244.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.11 โ L1-L2 Write โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.12 โ L1-L2 Atomic โ 1.00 โ 1.00 โ 1.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.13 โ L1 Access Latency โ 748.90 โ 748.90 โ 748.90 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.14 โ L1-L2 Read Latency โ 580.90 โ 580.90 โ 580.90 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 16.3.15 โ L1-L2 Write Latency โ 236.73 โ 236.73 โ 236.73 โ Cycles โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโ
16.4 L1D - L2 Transactions
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Xfer โ Coherency โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโโก
โ 16.4.0 โ NC - Read โ Read โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.1 โ UC - Read โ Read โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.2 โ CC - Read โ Read โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.3 โ RW - Read โ Read โ RW โ 1244.00 โ 1244.00 โ 1244.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.4 โ RW - Write โ Write โ RW โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.5 โ NC - Write โ Write โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.6 โ UC - Write โ Write โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.7 โ CC - Write โ Write โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.8 โ NC - Atomic โ Atomic โ NC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.9 โ UC - Atomic โ Atomic โ UC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.10 โ CC - Atomic โ Atomic โ CC โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 16.4.11 โ RW - Atomic โ Atomic โ RW โ 1.00 โ 1.00 โ 1.00 โ Req per wave โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโโโโโโ
--------------------------------------------------------------------------------
17. L2 Cache
17.2 L2 - Fabric Transactions
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโโโโก
โ 17.2.0 โ Read BW โ 65690.47 โ 65690.47 โ 65690.47 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.1 โ HBM Read Traffic โ 100.0 โ 100.0 โ 100.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.2 โ Remote Read Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.3 โ Uncached Read Traffic โ 0.03 โ 0.03 โ 0.03 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.4 โ Write and Atomic BW โ 0.02 โ 0.02 โ 0.02 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.5 โ HBM Write and Atomic Traffic โ 100.0 โ 100.0 โ 100.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.6 โ Remote Write and Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.7 โ Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.8 โ Uncached Write and Atomic Traffic โ 0.0 โ 0.0 โ 0.0 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.9 โ Read Latency โ 541.56 โ 541.56 โ 541.56 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.10 โ Write and Atomic Latency โ 482.0 โ 482.0 โ 482.0 โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.11 โ Atomic Latency โ โ โ โ Cycles โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.12 โ Read Stall โ โ โ โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโค
โ 17.2.13 โ Write Stall โ โ โ โ Pct โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโโโโโโโ
17.3 L2 Cache Accesses
โโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโ
โ Metric_ID โ Metric โ Avg โ Min โ Max โ Unit โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโก
โ 17.3.0 โ Bandwidth โ 79800.38 โ 79800.38 โ 79800.38 โ Bytes per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.1 โ Req โ 623.44 โ 623.44 โ 623.44 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.2 โ Read Req โ 622.45 โ 622.45 โ 622.45 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.3 โ Write Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.4 โ Atomic Req โ 1.00 โ 1.00 โ 1.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.5 โ Streaming Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.6 โ Probe Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.7 โ Cache Hit โ 17.65 โ 17.65 โ 17.65 โ Pct โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.8 โ Hits โ 110.02 โ 110.02 โ 110.02 โ Hits per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.9 โ Misses โ 513.42 โ 513.42 โ 513.42 โ Misses per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.10 โ Writeback โ 0.02 โ 0.02 โ 0.02 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.11 โ Writeback (Internal) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.12 โ Writeback (vL1D Req) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.13 โ Evict (Internal) โ 481.25 โ 481.25 โ 481.25 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.14 โ Evict (vL1D Req) โ 0.00 โ 0.00 โ 0.00 โ Cachelines per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.15 โ NC Req โ 0.03 โ 0.03 โ 0.03 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.16 โ UC Req โ 0.17 โ 0.17 โ 0.17 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.17 โ CC Req โ 0.00 โ 0.00 โ 0.00 โ Req per wave โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ 17.3.18 โ RW Req โ 623.25 โ 623.25 โ 623.25 โ Req per wave โ
โโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโ
```
Looking at the L1 data, we see:
- L1 Cache Hit Rate (`16.3.5`) has increased about 20%.
- L1 Total Req (`16.3.0`) has decreased by ~32x, 16448 compared to 524368 previously.
- This results in fewer requests going out to the L2.
Looking at the L2 data, we see:
- L2 Req (`17.3.1`) has decreased by ~53x, 623.44 compared to 32948.97 previously.
- L2 Cache Hit (`17.3.7`) is orders of magnitude higher than before: 17.65% compared to 0.52% previously.
- L2 Hits (`17.3.8`) is slightly lower than before: 110 compared to 172
- L2 Misses (`17.3.9`) has decreased by ~64x, 513 compared to 32776
- L2 Read bandwidth (`17.2.0`) has decreased due to the decrease in L2 misses.
- We have orders of magnitude fewer requests going out to HBM than we did previously, which explains our observed speedup.
### Solution Roofline Analysis
As a final step, we should check how this new implementation stacks up with the roofline.
These plots can be generated with:
```
srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe
```
The plots will appear as PDF files in the `./workloads/solution_roof_only/MI200` directory, if generated on MI200 hardware.
They are also provided below for easy reference:
| Roofline Type | Roofline Legend | Roofline Plot |
|---------------|----------------------------------------------------|------------------------------------------------------|
|FP32 |<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise1_problem_kernelName_legend.png"/>|<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_solution_roofline_fp32.png"/> |
|FP16/INT8 |<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise1_problem_kernelName_legend.png"/>|<img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_solution_roofline_int8_fp16.png"/> |
As the Omniperf stats indicate, we are moving most data through the L1, which shows in the roofline as a decrease in Arithmetic Intensity for that cache layer.
We have a high hit rate in L1, with a fairly low hit rate in L2, and we end up having to go to HBM much less frequently than we did previously,
thus our HBM bandwidth has decreased as a result of more efficient usage of our memory system.
### Roofline Comparison
The comparison of these two rooflines is confusing, due to the fact that these algorithms use the memory system very differently.
It is important to keep in mind that our solution runs **29x** faster than the problem.
| Roofline Type | Problem Roofline | Solution Roofline |
|---------------|------------------------------------------------------|--------------------------------------------------------|
| FP32 | <img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_problem_roofline_fp32.png"/> | <img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_solution_roofline_fp32.png"/> |
| FP16/INT8 | <img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_problem_roofline_int8_fp16.png"/>| <img src="https://raw.githubusercontent.com/amd/HPCTrainingExamples/main/OmniperfExamples/5-AlgorithmicOptimizations/exercise5_solution_roofline_int8_fp16.png"/> |
We see a significant speedup from problem to solution, but on the roofline it is difficult to determine which implementation is using the hardware more efficiently. The problem seems to be better, as the HBM point is very close to the achievable bandwidth, while the performance of the solution points seem to decrease.
The roofline, though useful for estimating efficiencies of kernels, still only shows one perspective of performance.
### Summary and Take-aways
This algorithmic optimization is able to work more efficiently out of the L1, generating far fewer
L2 requests that require expensive memory operations. Algorithmic optimizations are all but guaranteed
to have significant development overhead, but finding a more efficient algorithm can have large impacts
to performance. If profiling reveals inefficient use of the memory hardware, it could be worth thinking
about alternative algorithms.