msm-mopro shaders profiling report

# Msm-Mopro Shaders Profiling Report ## Profile Setup - hardware: `Macbook M1 Pro 14' 16 GB Shared Virtual Memory` - OS: `macOS 15.2 (24C98)` - Xcode (Instruments): `16.0 (16A242)` - swift-driver version: `1.115 Apple Swift version 6.0 (swiftlang-6.0.0.9.10 clang-1600.0.26.2)` - msm-mopro commit hash: `85958b8a541e733221fe9117285968bc6c40bf90` ## Profiling Methodology Profiling performed on binary compiled in release mode with debug symbols included. The binary executed with the following args `--exact msm::metal::msm::tests::test_msm_correctness_medium_sample --nocapture`. Named args run a single test `test_msm_correctness_medium_sample` five times in a row, the data obtained from all of the runs are pretty consistent, so in sake of redundancy avoiding only the values from the first run provided in this report. ## Common Metrics | Metric | Value (ms) | | -------------------------------- | --------------------------------------: | | Overall Binary Time Run | ~240,000 | | Overall Shaders Time Run | 182,600 | | Cumulative Longest Shader to Run | `bucket_wise_accumulation` (121,500.00) | | Second Longest Shader to Run | `sum_reduction` (5,780.00) | | Third Longest Shader to Run | `final_accumulation` (325.53) | ## Single Run Metrics | Metric | Max Value (%) | Avg Value (%) | Comment | | :--- | ---: | ---: | :--- | | Top Performance Limiter | 99.9517 | 7.8164 | | | Compute Occupancy | 63.2224 | 7.5858 | | | GPU Write Bandwidth | 224.7269 | 2.1583 | Gb/s | | GPU Read Bandwidth | 170.8448 | 2.6435 | Gb/s | | ALU LImiter | 72.0604 | 6.7060 | | | ALU Utilization | 64.4509 | 6.3585 | | | Buffer Load Utilization | 51.3569 | 2.2396 | | | Buffer Store Utilization | 34.0671 | 0.1608 | | | Buffer Read LImiter | 99.9517 | 3.1027 | | | Buffer Write Limiter | 54.2443 | 0.1618 | | | GPU Last Level Cache Limiter | 74.1960 | 1.1863 | | | GPU Last Level Cache Utilization | 71.0583 | 0.9938 | | | MMU TLB Miss Rate | 72.8721 | 0.6302 | | | MMU Utilization | 50.8915 | 0.7273 | | | MMU Limiter | 72.3086 | 1.1514 | | | F32 Utilization | 14.7544 | 0.0082 | | | F16 Utilization | 38.3325 | 0.0157 | | ## Conclusions 1. **Top Performance Limiter** Max Value is identical to **Buffer Read Limiter** Max Value, emphasizing that buffer reading throughput is the main issue in the whole pipeline to solve to. 2. Significant difference between Max Values and Avg Values emphasize that workload passes to GPU cores unevenly by a big chunks. This should be considered as the next issue to solve to. 3. Another performance issues presented in data (relatively low ALU, F16, F32 units utilization) matters, but still be reconsidered once again after yet another profile session after first two issues would be fixed.