Try   HackMD

Detailed Notes on the GPU Acceleration of MSM

Below is a deeper dive into the technical optimizations we implemented to accelerate Multi-Scalar Multiplication (MSM) on Apple devices. If you’ve read the main blog post, this section offers a closer look at our decisions, benchmarks, and the challenges we encountered along the way.


1. Why Optimize MSM?

Multi-Scalar Multiplication is a central operation in KZG polynomial commitments, widely used in zero-knowledge proofs. We typically see MSM consuming around 70% of the compute time, especially in setups and proving steps. By speeding it up, we gain:

  • Faster Proof Generation: Reduces overall proving and key generation times.
  • Better Feasibility on Low-Power Devices: Apple’s M-series (Macs) and A-series (iPhones) have shown promise for on-device cryptography.
  • Immediate Impact: A range of cryptographic protocols (ZKPs, Verkle Trees, Merkle commitments) benefit from a more efficient MSM.

Our project leverages previous work by Jeff (@foodchain1028) and Moven (@moven0831), whose Ethereum Foundation grant focused on Arkworks MSM using Metal. We adapted many of their ideas for Halo2 (our target for EZKL). Additionally, teams like Ingonyama are pursuing related GPU optimizations on Apple devices (metal-poc).


2. Performance Snapshots

2.1 Mac Performance (M1 Pro)

Sample Timings (CPU vs. GPU)

log2(N) | CPU Range | GPU Range

16 | 86.50–95.55 ms | 89.91–93.95 ms
17 | 158.11–159.34 ms | 122.70–130.70 ms
18 | 281.41–289.69 ms | 194.35–200.72 ms
19 | 510.40–529.16 ms | 357.14–378.82 ms
20 | 1.02–1.07 s | 0.61–0.63 s
21 | 1.76–1.82 s | 1.14–1.16 s
22 | 3.97–4.17 s | 2.21–2.25 s
23 | 7.67–7.87 s | 4.59–4.63 s

By accelerating MSM with Metal, we see a measurable improvement over CPU-only times, especially at higher orders (log2(N) ≥ 18).

2.2 iOS Performance (iPhone 15 Pro)

Sample Timings (CPU vs. GPU+CPU)

log2(N) | CPU-Only | GPU+CPU

16 | 85 ms | 98 ms
17 | 165 ms | 128 ms
18 | 302 ms | 200 ms
19 | 620 ms | 380 ms
20 | 1,405 ms | 787 ms
21 | 3,649 ms | 1,619 ms

2.3 Integration with EZKL

Setup Time:

  • CPU Average: ~15.24 s
  • GPU Average: ~13.98 s (~8.3% improvement)

Prove Time:

  • CPU Average: ~18.42 s
  • GPU Average: ~16.72 s (~8.9% improvement)

In real-world usage (like EZKL’s full pipelines), the improvements were smaller than raw benchmarks suggested—likely due to special cases (e.g., many zero scalars, smaller circuit windows). This remains an area for future investigation.

Notes on Benchmarking Conditions
  • These benchmarks were conducted for log 20-sized MSMs in EZKL, where the number of rows was artificially inflated from log 16 (the original circuit size). This adjustment may not perfectly reflect the actual computation profile of the circuit, introducing some potential inaccuracies in the measurement.
  • Additionally, these results were obtained on a Mac Studio rather than the MacBook M1 Pro used in the earlier benchmarks. While the Mac Studio offers better thermal performance and sustained throughput, the baseline differences between the two devices mean that these numbers should not be directly compared to the raw MSM benchmarks from earlier sections.

3. The Optimization Approach

Initially, a single MSM at 2^20 points took around 10 seconds on GPU in a naive implementation. By contrast, CPU benchmarks (Arkworks) hovered around 1.3 s, and Halo2Curves MSM was roughly 1.1 s. After our optimizations, we hit 0.59 s on GPU—about faster than the Halo2 CPU baseline and over 15× faster than our original GPU version.

We break down our improvements into three main categories:

3.1 Reducing Setup Overhead

  1. Amortizing GPU Setup

    • Problem: Every invocation reinitialized the Metal library/device, costing ~20 ms each time.
    • Solution: Cache the GPU device/context in a static resource. Subsequent calls skip the redundant setup.
    • Result: Saved ~4% of total compute time in high-volume scenarios.
  2. Encoding Step Optimization

    • Problem: Converting from Halo2/Arkworks representations to a GPU-friendly format took up to 700 ms.
    • Solution: Use memory preallocation and parallelization in Rust to batch-convert points and scalars.
    • Result: Reduced encoding to a few tens of milliseconds.
  3. Allocating Directly on GPU

    • Problem: Copy overhead from CPU to GPU memory.
    • Solution: Exploit Apple’s shared memory architecture—encode data “in place” on GPU buffers.
    • Result: Avoided large data transfers, improving throughput.

3.2 Optimizing the MSM GPU Implementation

The original GPU approach was functional but left plenty of parallelism on the table. With Apple’s specific thread group constraints, we needed to refine both the Pippenger algorithm steps and how we dispatched threads.

  1. Memory & Parallel Bottlenecks

    • We discovered certain stages (initialization, reduction) were memory-bound. Careful profiling guided us to batch tasks more effectively.
  2. Workflow Overview

    1. Window Splitting
      • Our scalars are 255-bit (bn256), so we split them into windows to parallelize each segment of scalar multiplication.
    2. Bucket Index Preparation
      • For each (scalar, point), compute a (bucket index, point index).
      • We removed unnecessary allocations, reducing GPU bottlenecks.
    3. Sorting Buckets
      • Ideally done on GPU, but Apple doesn’t provide built-in sorting libraries. We tried a custom merge sort (prototype branch) but ran into thread-group memory limits.
      • Ultimately, we settled on a CPU-based in-place sort, which still leverages shared memory to cut overhead by ~150 ms.
    4. Bucket Accumulation
      • Once sorted, all points for a given bucket index can be added up in parallel.
      • Large outlier buckets caused imbalance: some threads had thousands of points to sum, while others idled. Our solution was to group ~64 threads to handle a range of buckets together.
      • This approach got complex. We suspect a simpler method might suffice, or we could tailor the thread groups dynamically based on bucket size.
    5. Window Reduction
      • After each bucket is accumulated, we perform a partial reduction. Moving from a single-thread approach to a multi-thread strategy cut reduction from ~800 ms to ~40 ms in some cases.
    6. Final Summation
      • Only about 10 remaining points needed to be combined, which was faster on CPU than paying the IO cost to run a tiny kernel on GPU.

3.3 Balancing CPU & GPU

As discussed in the zkMopro GPU report, we also tried interleaving CPU and GPU tasks:

  • gpu_with_cpu approach:
    • Splits MSM workload between CPU and GPU.
    • The ratio depends on log2(N). For instance, log2(2^20) saw around 66% handled by GPU, but smaller MSM sizes (<2^16) sometimes favored CPU more.

4. Observations in Real Usage

  1. Integration with EZKL

    • Raw benchmarks showed ~2× faster MSM on random inputs, but real-world usage gave ~9–10% total speedup.
    • Many zero scalars or smaller circuit sizes let the CPU outperform the GPU in certain segments.
    • Eliminating trivial MSM calls helped, but further investigation is needed.
  2. Compilation Overhead

    • Rebuilding EZKL takes ~7 minutes, slowing iteration and performance tuning.

Overall, while our single MSM tests displayed strong gains, sequential real-proofs in EZKL didn’t speed up as much. We suspect this is due to the unique, sometimes sparse nature of actual proof vectors.


5. Challenges

  1. Library-Specific Patterns
    • For now, sorting is CPU-bound. Fully GPU-based sorting remains a to-do (merge sort or radix sort).
  2. Memory Conversions
    • Point/scalar conversion still happens on CPU. We may eventually shift this logic entirely to Metal.
  3. Concurrency Bugs
    • We observed occasional failures when multiple MSMs ran in parallel. A temporary mutex ensures single-thread execution, but a deeper fix is needed.
  4. Hyperparameter Tuning
    • Window size, thread group size, and CPU-GPU split all matter. Oddly, a window size of 15 gave us the best performance for 2^202^24—likely due to 255 dividing evenly.
    • Comprehensive param sweeps take ~22 hours, so we postponed it. The experimental code is here.

6. Conclusion and Future Work

Despite the challenges, our efforts significantly improved MSM performance on Apple hardware. We’ve beaten naive GPU baselines by over 20× in some cases and achieved 2× speedups over CPU implementations for larger MSM sizes.

Potential Next Steps

  1. Simplify Accumulation
    • Reduce complexity in the bucket-accumulation phase.
  2. Resolve Parallelization Bugs
    • Lift the single-MSM mutex constraint and fix concurrency issues properly.
  3. Deep-Dive into GPU Performance
    • Investigate GPU usage in real-world proof generation to close the gap between theoretical and practical speedups.
  4. Explore FFT Metal Shaders
    • Extend or adapt solutions like metal-fft or Icicle’s approach.

Ultimately, our GPU acceleration for MSM is one more step in bringing zero-knowledge proofs—and advanced cryptography in general—closer to everyday devices. There’s plenty more to do, but the groundwork is laid for further performance gains.