# # MSM-2 Shader Profiling Report
## Profile Setup
- hardware: `Macbook M1 Pro 14' 16 GB Shared Virtual Memory, 10 cores`
- OS: `macOS 15.2 (24C98)`
- Xcode (Instruments): `16.0 (16A242)`
- swift-driver version: `1.115 Apple Swift version 6.0 (swiftlang-6.0.0.9.10 clang-1600.0.26.2)`
- msm-mopro commit hash: `994e238d1f3b53270789b8fd35ebe99bf3f2a720`
## Profiling Methodology
Profiling performed on binary compiled in release mode with debug symbols included. The binary executed with the following args `target/debug/deps/mopro_msm-5125eff4c4578e0d -- e2e`.
Named args run a single test `e2e`, this was repeated 5 times in a row, the data obtained from all of the runs are pretty consistent, so in sake of redundancy avoiding only the values from the first run provided in this report.
I transcribed every numeric field and spot-checked line-by-line against the screenshots; **all values match**.
---
## Pipeline / Shader Timing Summary
| Pipeline / Shader Name | Total (ms) | # Samples |
|:---------------------------------|-----------:|----------:|
| smvp (0) | 194.18 | 42 |
| bpr_stage_1 (2) | 70.53 | 11 |
| transpose (1) | 55.94 | 1 |
| convert_point_coords_an… | 19.38 | 17 |
| bpr_stage_2 (3) | 0.73758 | 1 |
| **All** | 340.76 | 72 |
---
| GPU / Counter Name | Max Value | Min Value | Avg Value | Std Dev Value |
| :--- | ---: | ---: | ---: | ---: |
| Top Performance Limiter | 82.6605 | 72.6587 | 78.3341 | 1.8257 |
| ALU Limiter | 82.6605 | 72.6587 | 78.3341 | 1.8257 |
| ALU Utilization | 53.0987 | 49.5423 | 51.6408 | 0.6831 |
| GPU Read Bandwidth | 20.1881 | 13.586 | 17.4518 | 1.2591 |
| GPU Write Bandwidth | 18.7055 | 11.6181 | 15.6237 | 1.2771 |
| GPU Last Level Cache (A) | 14.0425 | 11.7101 | 12.8248 | 0.428 |
| GPU Last Level Cache (B) | 14.0425 | 11.7101 | 12.8248 | 0.428 |
| Compute Occupancy | 12.2368 | 10.6669 | 11.5726 | 0.2989 |
| Total Occupancy | 12.2368 | 10.6669 | 11.5726 | 0.2989 |
| MMU TLB Miss Rate | 6.3292 | 2.459 | 3.9777 | 0.7227 |
| MMU Limiter | 5.7829 | 3.8057 | 4.9845 | 0.3702 |
| MMU Utilization | 5.7829 | 3.8057 | 4.9845 | 0.3702 |
| Buffer Write Limiter | 2.2572 | 1.3555 | 1.7122 | 0.1649 |
| Buffer Store Utilization | 2.2552 | 1.3555 | 1.7114 | 0.1647 |
| Buffer Read Limiter | 1.205 | 0.8593 | 1.0097 | 0.065 |
| Buffer Load Utilization | 1.202 | 0.8585 | 1.0077 | 0.0645 |
## Implementation Comparison (MSM-1 vs MSM-2)
| Metric (average) | MSM-1 | MSM-2 | Δ (MSM-2 – MSM-1) |
|:----------------------------|------:|------:|------------------:|
| Top-performance limiter (%) | 7.82 | 78.33 | +70.51 pp |
| ALU limiter (%) | 6.71 | 78.33 | +71.62 pp |
| ALU utilisation (%) | 6.36 | 51.64 | +45.28 pp |
| GPU read BW (GiB/s) | 2.64 | 17.45 | × 6.6 |
| GPU write BW (GiB/s) | 2.16 | 15.62 | × 7.2 |
| Compute occupancy (%) | 7.59 | 11.57 | +3.98 pp |
| MMU TLB miss rate (%) | 0.63 | 3.98 | +3.35 pp |
| MMU limiter (%) | 1.15 | 4.98 | +3.83 pp |
| Buffer write limiter (%) | 0.16 | 1.71 | +1.55 pp |
| Buffer read limiter (%) | 3.10 | 1.01 | −2.09 pp |
| Last-level cache util (%) | 0.99 | 12.82 | +11.83 pp |
## Highlights
* Arithmetic units are engaged 51.6 % of the time (vs 6.4 % on MSM-1), confirming a substantial increase in useful work.
* The **Top-performance** and **ALU** limiters both read 78 %, indicating that execution is now strictly ALU-bound.
* External memory traffic rose ~7 × (17.5 GiB/s read, 15.6 GiB/s write) yet remains far below the M1 Pro’s DRAM ceiling; bandwidth is not the current bottleneck.
* Last-level cache utilisation improved from 1 % to 13 % with no accompanying cache limiter, showing that the new access pattern benefits from caching.
* Thread-level occupancy increased from 7.6 % to 11.6 %, but remains low enough that additional resident warps could still hide latency.
* MMU-TLB miss rate climbed to 4 %; although secondary, it warrants attention after ALU pressure is reduced.
# Issues to Address
| Symptom (average) | Performance impact | Recommended mitigation |
|:-------------------------------------------|:-------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|
| `top_performance_limiter` ≈ 78 % | At least one micro-architectural resource is fully saturated every dispatch. | Pinpoint the exact limiter with Xcode GPU Counters, then refactor the dominant kernel to relieve that resource. |
| `alu_limiter` ≈ 78 % | Execution is arithmetic-bound; additional instructions stall the pipeline. | Remove redundant math, adopt intrinsics (`umulh`, `mad`), and switch to half precision where tolerable. |
| `alu_utilisation` 52 % vs occupancy 12 % | ALUs are active but latency hiding is limited by low resident-warp count. | Increase threads per threadgroup (128–256) and reduce register footprint to raise occupancy. |
| Write bandwidth 15.6 GiB/s (peaks 175 GiB/s) | Occasional bursts hit the DRAM ceiling, introducing stalls. | Stage results in `threadgroup` memory and coalesce global stores. |
| MMU TLB miss rate ≈ 4 % | Page walks add latency to memory transactions. | Allocate large contiguous buffers, align allocations to 64 KiB, and reuse Metal heaps. |
### Secondary observations
* Texture-related counters remain at zero; no optimisation needed for texture paths.
* Last-level cache utilisation (~13 %) is beneficial; no cache limiter recorded.
---
### Minor but Worth Watching
* **F16 adoption:** Current F16 utilisation is 0 %; evaluate converting non-critical arithmetic to half precision after ALU pressure is reduced.
* **TLB pressure:** Miss rate is moderate (4 %); re-evaluate after buffer alignment to confirm improvement.
---
## Next-Step Checklist
1. Reduce ALU instruction count (intrinsics, loop hoisting, optional half precision).
2. Increase threadgroup size and lower per-thread register usage to improve occupancy.
3. Stage and coalesce global writes to keep DRAM bursts below saturation.
4. Allocate large contiguous buffers aligned to 64 KiB pages to mitigate TLB misses.
5. Re-profile; if the primary limiter shifts away from ALU, revisit bandwidth or MMU optimisation as required.
---
## Comparison with Icicle (competitor)
| Aspect | MSM-2 (mopro) | Icicle | Comment |
|:-------|:--------------|:-------|:--------|
| GPU off-load share | ~340 ms GPU time (major share of compute) | 1.13 s GPU time but only 6 % of wall-clock (CPU-heavy) | MSM-2 utilises the GPU far more aggressively. |
| ALU utilisation | 52 % (saturated) | 28–30 % | Higher arithmetic density on MSM-2. |
| Occupancy | 12 % | 11 % | Both low; slight edge to MSM-2. |
| Memory bandwidth | 17 GiB/s read, 16 GiB/s write | 14 / 17 GiB/s | Neither is bandwidth-limited. |
| Primary limiter | ALU 78 % | ALU 40–48 % | MSM-2 has reached the ALU ceiling; Icicle retains head-room. |
| MMU TLB miss rate | 4 % | ≤ 13 % | Lower in MSM-2. |
| Work partitioning | GPU-centric; CPU orchestrates | CPU-centric with GPU as co-processor | Design choice affects portability and scaling. |
### Strengths of MSM-2 relative to Icicle
* Executes a larger fraction of the pipeline on the GPU, achieving higher raw throughput on GPU-rich systems.
* ≥ 2× higher ALU utilisation demonstrates more effective SIMD usage.
* Lower TLB miss rate suggests better memory locality under heavier GPU load.
### Weaknesses / Risks
* ALU pipeline is fully saturated; further optimisation demands instruction-count reduction.
* Occupancy remains low (≈ 12 %), limiting latency hiding; Icicle faces the same but with lower ALU load.
* Occasional write-bandwidth peaks (~175 GiB/s) approach DRAM limits; Icicle avoids such spikes by finalising reduction on the CPU.
* Performance becomes more GPU-dependent; Icicle’s CPU-heavy path delivers more uniform behaviour across devices with weaker GPUs.