Milestone 1 report | zkmopro GPU Exploration

# Milestone 1 report | zkmopro GPU Exploration ## Overview This report shows the 1st milestone of zkmopro GPU acceleration. In this milestone, we refactored the EC Backend in Metal Shader, profiled our previous implementation to measure its performance, and developed an optimization plan and criteria for our future work. ## Tasks 1. :heavy_check_mark: Adapted and refactored a better EC backend for Metal - [X] bigint backend implementation - [X] field backend implementation - [X] montgomery multiplication implementation - [X] elliptic curve backend implementation 2. :heavy_check_mark: Profiling Current Metal MSM 3. :heavy_check_mark: Implementation and analyzing current performance bottlenecks using GPU tracing tools 4. :heavy_check_mark: Identify CPU-GPU Latency Issues 5. :heavy_check_mark: A detailed plan to reduce latency and improve synchronization between CPU and GPU ## Adapted and Refactored Metal EC Backend In this milestone, we integrated and refined a Metal-based elliptic curve (EC) backend inspired by the [ZPrize 2023 winning submission from Tal and Koh](https://github.com/z-prize/2023-entries/tree/main/prize-2-msm-wasm/webgpu-only/tal-derei-koh-wei-jie). The primary goal was to ensure broad compatibility with different limb sizes, explore multiple Montgomery multiplication approaches, and align our EC point algorithms with the MSM optimizations recommended in the [cuZK paper](https://eprint.iacr.org/2022/1321). Below is a summary of our main accomplishments and findings. ### 1. Support for Arbitrary Limb Sizes - **Flexible Limb Configurations:** We developed an internal structure that can handle variable limb sizes. This allows us to dynamically pick the optimal limb size for different elliptic curves and hardware constraints. - **Optimal Limb Size for BN254:** Preliminary benchmarks revealed that **16-bit limbs** strike the best balance between performance and compatibility for BN254, as detailed in the **Montgomery Multiplication** section below. - **Reference:** Our strategy for variable-limb arithmetic is partially informed by [Gregor Mitscha-Baude’s writeup](https://github.com/mitschabaude/montgomery/blob/main/doc/zprize22.md#13-x-30-bit-multiplication). - **Related PRs and issues:** - **BigInt Implementation ([#21](https://github.com/zkmopro/gpu-acceleration/pull/21)):** Foundations for large-integer computations on Metal. - **Field Implementation ([#22](https://github.com/zkmopro/gpu-acceleration/pull/22)):** Field arithmetic operations, including modular additions and subtractions. - **Improve Carry Handling of BigInt and Field Arithmetic ([#25](https://github.com/zkmopro/gpu-acceleration/issues/25)):** BigInt and field arithmetic operations need improved carry handling to prevent overflow during high-bit multiplications. ### 2. Montgomery Multiplication Implementations and Benchmarks We tested three different Montgomery multiplication kernels—each using different optimizations and limb-size configurations. 1. **`mont_mul_modified`** - Achieved peak performance at **15-bit limbs** (129ms). - Based on [Mitscha-Baude’s approach](https://github.com/mitschabaude/montgomery#13-x-30-bit-multiplication) with some omitted conditional branches. - Attempted 16-bit limbs were “inapplicable” due to architectural constraints. 2. **`mont_mul_optimised`** - Similar performance as `mont_mul_modified` for 11–14 bit limbs. - Encountered correctness issues at higher limb sizes (≥15 bits), prompting further investigation. - It is effectively a streamlined version of `mont_mul_modified` with certain condition checks removed. 3. **`mont_mul_cios`** - Implements the **Coarsely Integrated Operand Scanning (CIOS)** approach described in [this paper](https://www.microsoft.com/en-us/research/wp-content/uploads/1996/01/j37acmon.pdf). - Performs best at **16-bit limbs**, completing in **114ms**, making it the fastest option in our tests. | **Algorithm** | **16-bit limbs** | **15-bit limbs** | **14-bit limbs** | **13-bit limbs** | **12-bit limbs** | **11-bit limbs** | |---------------------------|------------------------------------------|------------------------------------------|------------------|------------------|------------------------------------------|------------------------------------------| | **mont_mul_modified** | Inapplicable | 129ms | 147ms | 133ms | 223ms | 329ms | | **mont_mul_optimised** | Inapplicable | Inapplicable | 142ms | 132ms | 223ms | 325ms | | **mont_mul_cios** | **114ms** | 249ms | 340ms | 335ms | Inapplicable | Inapplicable | > **Note:** “Inapplicable” indicates failure at that limb size (e.g., due to overflow or kernel misconfiguration). For a broader overview of Montgomery multiplication techniques, see [Koh’s writeup](https://weijiekoh.github.io/clientside/montgomery_multiplication.html). ```bash # Reproduce the benchmark result with this test cargo test --release --package mopro-msm --lib -- msm::metal_msm::tests::mont_backend::mont_benchmarks --nocapture --include-ignored ``` - **Related PRs and issues:** - **Optimized Montgomery Multiplication ([#6](https://github.com/zkmopro/gpu-acceleration/issues/6)):** Introduces the core MontMul kernels tested above. ### 3. EC Point Arithmetic in Jacobian Coordinates - **Point Addition ([pADD](https://hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-add-2007-bl)), Mixed Addition ([pMADD](https://hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#madd-2007-bl)), and Doubling ([pDBL](https://hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#doubling-dbl-2009-l)):** Implemented according to the [Explicit-Formulas Database](https://hyperelliptic.org/EFD). - **Correctness and Stability:** Verified using unit tests on a range of random EC points and scalars, ensuring reliability in point arithmetic on GPU. ```bash # Run this command for the comprehensive tests cargo test --release --package mopro-msm --lib -- msm::metal_msm::tests --nocapture ``` - **Related PRs and issues:** - **Refactor Task ([#20](https://github.com/zkmopro/gpu-acceleration/issues/20)):** Continual refinement of the Metal backend code structure to simplify kernel dispatch and reduce overhead. - **EC Point Algorithms Implementation ([#31](https://github.com/zkmopro/gpu-acceleration/pull/31)):** pADD, pMADD, and pDBL in Jacobian coordinates for faster group operations. ## Profile report [This is the report](https://hackmd.io/@yaroslav-ya/rJkpqc_Nke) that profiles current msm (running the test `--exact msm::metal::msm::tests::test_msm_correctness_medium_sample --nocapture`) conducted on `Macbook M1 Pro 14'` done by [yaroslav-ya](https://github.com/yaroslavyaroslav) and he also shared how to profile in zkmopro community call ([manual](https://hackmd.io/@yaroslav-ya/HkhUD6x4ye)). Here are the metrics of profiling ### Common Metrics | Metric | Value (ms) | |----------------------------------|------------------| | Overall Binary Time Run | ~240,000 | | Overall Shaders Time Run | 182,600 | | Cumulative Longest Shader to Run | bucket_wise_accumulation (121,500.00) | | Second Longest Shader to Run | sum_reduction (5,780.00) | | Third Longest Shader to Run | final_accumulation (325.53) | ### Single Run Metrics | Metric | Max Value (%) | Avg Value (%) | Comment | | -------------------------------- | -------------------------------------- | ------------- | ------- | | Top Performance Limiter | 99.9517 | 7.8164 | | | Compute Occupancy | 63.2224 | 7.5858 | | | GPU Write Bandwidth | 224.7269 | 2.1583 | Gb/s | | GPU Read Bandwidth | 170.8448 | 2.6435 | Gb/s | | ALU Limiter | 72.0604 | 6.7060 | | | ALU Utilization | 64.4509 | 6.3585 | | | Buffer Load Utilization | 51.3569 | 2.2396 | | | Buffer Store Utilization | 34.0671 | 0.1608 | | | Buffer Read Limiter | 99.9517 | 3.1027 | | | Buffer Write Limiter | 54.2443 | 0.1618 | | | GPU Last Level Cache Limiter | 74.1960 | 1.1863 | | | GPU Last Level Cache Utilization | 71.0583 | 0.9938 | | | MMU TLB Miss Rate | 72.8721 | 0.6302 | | | MMU Utilization | 50.8915 | 0.7273 | | | MMU Limiter | 72.3086 | 1.1514 | | | F32 Utilization | 14.7544 | 0.0082 | | | F16 Utilization | 38.3325 | 0.0157 | | --- Overall, our profiling observes several issues in our current `metal_msm` work. ### 1. Buffer Reading Throughput Issue We witnessed that the **Top Performance Limiter Max Value** is identical to **Buffer Read Limiter Max Value**. This implies that our current bottleneck is strongly related to **Buffer Reading Throughputs**. * Buffer Reading Limiter: 99.9517% * GPU Read Bandwidth: 170.8448 Gb/s This indicates that shader performance is primarily constrained by **memory access patterns** and **buffer reading efficiency**. ### 2. Uneven distribution for GPU core We found that significant difference between **Max Values** and **Avg Values** emphasize that workload passes to GPU cores unevenly by a big chunks. This should be considered as the next issue to solve to. i.e., `Buffer Read Limiter` showing 99.95% max but only 7.81 avg value. This implies: * Some thread/threadgroups are doing significantly more work than others. * There might be GPU cores are heavily loaded while the others are underutilized * This also makes bottlenecks as some cores finish early yet others continue processing. for example, in our work: ```cpp // perform bucket-wise accumulation while (thread_id == buckets_indices[bucket_start_idx].x && bucket_start_idx < max_idx) { Point p = buckets_matrix[thread_id]; buckets_matrix[thread_id] = p + p_buff[buckets_indices[bucket_start_idx].y]; bucket_start_idx++; } ``` some thread might need to process the bucket accumulate yet the others don't. --- ## Optimization plan The optimization technique and the issues have to solved are depended on current profile report. However, our current work with **refactoring EC backend** and the future work (integrating cuZK algorithms for MSM) will make a lot of change. Thus, the plan will be dynamically modified based on our status and should be a criteria to follow in the future. * Data Structure: * Structure of Array (SoA)[^SoA1][^SoA2] would preform better/ more efficient than Array of Structure (AoS) due to memory coalescing * In fact, in our future work: integrating cuZK's algorithm, already include this technique. * Memory Access Pattern Restructuring * Use Shared memory[^use_shared_memory] * Dispatching Threadgroups * we will focus on optimizing the performance on out experimental device. --- ## Conclusion Current profiling result shows that previous `msm` work has several issues, including **Uneven dispatching workload in GPU cores**, **Bad input's Data Structures**, and **Unoptimized threadgroups/threads dispatching size**. ### Review for previous work Regarding `Uneven dispatching workload`, this might be a good example in our previous work. `bucket_wise_accumulation`: ```cpp while (thread_id == buckets_indices[bucket_start_idx].x && bucket_start_idx < max_idx) { Point p = buckets_matrix[thread_id]; buckets_matrix[thread_id] = p + p_buff[buckets_indices[bucket_start_idx].y]; bucket_start_idx++; } ``` In this case, some threads might have to wait for other threads running this While Loop, which might cause lots of thread idling and decrease the performance. And this is what we have to avoid in the future work. --- Another case for **Memory Access Pattern** might be: `accumulation_and_reduction_phase`: ```cpp Point bucket = buckets_matrix[thread_id * buckets_len + idx]; ``` we used a non-coalesced memory access while reading from the `buckets_matrix`. This cause an unefficient memory usage in GPU, which is also a main cause of the low performance. ## References - [metal profiling manual by @Yar](https://hackmd.io/@yaroslav-ya/HkhUD6x4ye) - [profiling report by @Yar](https://hackmd.io/@yaroslav-ya/rJkpqc_Nke) - [Metal note by @FoodChain](https://hackmd.io/Bf7ZxFdgS0SBV7lRsE27TA) [^SoA1]: [What is SoA?](https://stackoverflow.com/questions/28676445/array-of-structures-aos-vs-structure-of-arrays-soa-on-random-reads-for-vecto) [^SoA2]: [AoS vs SoA particle data and GPU memory coalescing](https://github.com/AMReX-Codes/amrex/discussions/2747) [^use_shared_memory]: [metal-optimizing-memory-access](https://stackoverflow.com/questions/43864136/metal-optimizing-memory-access)