Metal Profiling Manual

# Metal Profiling Manual ![instruments_profile](https://hackmd.io/_uploads/Hkf37dV4Jx.png) ## Why A lot folks ends up just measuring the pure time that takes to run their metal shader function. This approach has few downsides: 1. Measures are inconsistent, the time metric gets affected by all the others tasks operation system handles during the tests, which can lead to quite disperse results in measurements of the very same function. 2. It's barely reproducible, time duration measures taken on one setup (device) could be completely incompatible with another device. This is not so much a problem for Apple ecosystem in average but still can lead to completely misleading results on outliners devices (i.e. iPad Pro M4 and iPhone SE Series 3). 3. It leaves you blind about the exact reasons of why the shader is not that optimized as it could be. It happens because of time is actually a derivative metric composed from a various first order one, like memory bus pass-through capacity, works/cores workload distribution and few others. Profiling is the tool that unveils to developer that first order metrics to get, study and thus optimize the code leveraging on insight learned from them. ## Preconditions 1. Apple silicon based computer M1 or newer 2. macOS 14.0 or newer 3. Xcode 15.0 or newer ## Initial setup 1. Setup given binary/library with a profile flag, example for rust provided below: ```toml [profile.release] debug = true ``` 2. Build should be performed in release mode (`--release` attribute for rust), because out goal is to measure the fastest version of our metal library. 3. Open **Instruments** app (Xcode installed required) and select **Metal System Trace** setup from the templates gallery appeared. 4. Select binary to profile on top of the Instruments app 1. You can add attributes to apply to binary on its run 2. You can set Working Directory if necessary ## Profiling setup Metal template default set up configured to provide basic thus limited information about the binary, so it's worth it to enhance the setup for your needs. In particular: 1. In "Next Recording" section in "Metal Application" and "Metal GPU Counters" options set "Counter Set" to "Performance Limiters" and "Performance State" to "Maximum". 2. Add missed instruments from the set of predefined one by clicking "+Instrument", :::info All the purple cells are GPU related ::: 3. Press record button to start profiling ## Profiling Report Interpretation After profiling processing completed something like picture related appears on your screen. As you discovered just yet to run profiling is not any hard task, the hard one appears to be the next, to interpret given results accurately. ![profile_metrics_overal](https://hackmd.io/_uploads/r1A3XuV4kl.png) All of the measures could belongs to one of the following groups: - Task — indicates the single task ongoing in time (e.g. green `bucket_wise_accumulation (4)`) represents the ongoing task time frame other measures to align with - **Resource Utilization Metric** (e.g. `ALU Utilization`) — represents the ratio of utilization of a given resource in a given tick - **Resource Limiter Metric** (e.g. `ALU Limiter`) — represents the ratio of through-output occupancy, meaning when hitting 100% it becomes the bottleneck of the whole computation pipeline. Let's elaborate purpose of each of these groups ### Task It worth nothing that this measure allows you to map all the following measures whether it's limiter, utilization ratio or absolute value to a shared(s) that is actually running meanwhile. It's hard to overvalue this bit because it allows you to get the idea of what bit of code to look at during optimization design task effortlessly. ### Resource Utilization Metric **Utilization** metrics are all about utilization of a particular computational part of the GPU pipeline, it's ratio metric from 0 to 100%. Since GPU is all about throughput the ideal case to target to is all the utilization counters are having their values near 100% simultaneously. It worth nothing that the worst case scenario is where all of that counters are measured around the 0%, meaning GPU is significantly starving. Case where few of those utilizations hitting the bar while others are somewhere in the middle should be considered as the starting position to further optimization. ### Resource Limiter Metric In fact Limiter metric counters can be considered as the other side of the same coin, if utilization metric shows how hard a part of the computational pipeline are workloaded. The limiter counter shows how optimal it's loaded during the task execution. Obviously Limiter counter can't have any lower values than Utilization one, but the opposite situation - where Limiter counter is far higher than Utilization one - is not just possible but quite common. The best case scenario is when same Utilization and Limiter counters are in par with each other and again all of them hitting the top bound with 100%, which leads to the most diverse and full workload distribution among all the GPU compute units. The worst case appears when the Limiter counter is significantly higher (more than 10%) comparing to the paired Utilization counter. It worth nothing that Limiter are never mind about being low, because by definition they are about to measure the throughout top bound unintentional hit, so if you're having Limiters low you should rather focus on the previous step enhancing utilization in advance instead. ## Optimisation framework 1. Run a profiling (and check that all required metrics was successfully measured). 2. Review the picture in overall, again the best - yet unreachable - case is when all the utilization counters are around the 100%, so within review pay a closer attention to those **utilization limiters that are around zero but shouldn't be there** (e.g. Buffer read/write, ALU, MMU, F16, F32). 3. Pick a single likely the lowest utilization counter followed the previous rule and try to improve it significantly in any possible way, say to increase utilization to 20% more. 4. If succeeded follow to the next one utilization counter. 5. After all sensible utilization counters have became increased follow with the Limiter counter having the biggest gap between itself and utilization counter. The goal here is to make the single workload spread as much uniformly as possible. The target here is to put Limiter as close to Utilization counter as possible. 6. After you solved the first one proceed to the second and so on. 7. In the end the perfect picture is the following: All Utilization counters are about the 100% during across the whole computation performance, Limiter counters are slightly higher than that in a portion of a percent or so and general counters such as **Compute Occupancy** and **Top Performance Limiter** are in their heights as well. ## Caveats 1. GPU computational pipeline is quite complicated thing, so be prepared that improving one counter could lead to degrading another, so while being focused on such single improvement keep an eye on all the others to catch their drop if any. 2. Apple GPU is proposed to do not that many things, it's scene composition computation, frame rendering, linear algebra compute (mostly related to LLM tasks or to image processing). So the reasonable approach here is to reduce your custom task to one of those. 3. Blockchain specific: As being said Apple GPU stack among the graphic rendering tasks is targeted to LLM cases. This is why there are F16 (Half Precision) and F32 (Full Precision) cores are provided — LLMs weights are encoded mostly with Half or even lower precision. So working with types that are having more than that bits in themselves will likely lead to having these cores completely unloaded, so it's reasonably to consider to decompose your double or even higher precision even more. ## Further reading 1. On how to exactly improve the code to make it use Apple GPU more efficiently please check this great article [https://github.com/philipturner/metal-benchmarks](https://github.com/philipturner/metal-benchmarks). 2. On how to reach more extensive debugging tools for metal performance shaders debugging please check Apple article [https://developer.apple.com/documentation/xcode/optimizing-gpu-performance](https://developer.apple.com/documentation/xcode/optimizing-gpu-performance) 3. On how to capture Metal workload programmatically please check the following article [https://developer.apple.com/documentation/xcode/optimizing-gpu-performance](https://developer.apple.com/documentation/xcode/optimizing-gpu-performance)