GPU Acceleration of MSM on Apple Devices

# GPU Acceleration of MSM on Apple Devices ## Introduction Mobile devices are our go-to platform for privacy-sensitive tasks. They hold all our photos, messages, emails, and even our location. With an array of sensors, they know more about us than we might like to admit. It’s no wonder they’re a match made in heaven with **Zero-Knowledge Proofs (ZKPs)**, which allow us to use this data securely without exposing it. Yet, ZKPs have never been friendly to mobile devices, or even laptops. Modern proving pipelines demand massive amounts of memory and compute power, typically requiring servers to generate succinct proofs. But our goal is different: **everything on-device**. After [successfully deploying EZKL on mobile devices](https://blog.ezkl.xyz/post/ios/), we wanted to improve the experience and make it more efficient. The obvious solution? Use the GPU. Modern devices have GPUs that are increasingly treated as dedicated compute engines rather than just graphics tools. Our focus was on accelerating **Multi-Scalar Multiplication (MSM)**, a key operation in the KZG polynomial commitment scheme and the single biggest bottleneck in ZKP proving. MSM alone accounts for up to **70% of compute time**, so every bit of improvement here matters. ## Why Apple Devices? We specifically targeted Apple devices for several reasons: 1. **Unified Memory**: Apple’s architecture allows the CPU and GPU to share memory, avoiding the heavy data transfer overhead seen on other platforms. 2. **Hardware Uniformity**: Unlike other ecosystems, Apple’s devices are relatively consistent in their CPU-GPU design, making them easier to optimize. 3. **Previous Work**: We had already focused on iOS bindings for EZKL, making Apple the logical next step. 4. **Shared Architectrue**: Both A-series (iPhones) and M-series (MacBooks) chips share the same underlying architecture. This means optimizing for iPhones also optimizes for Macs. And since much of EZKL’s development happens on M-series Macs, making them faster directly helps us ship more improvements faster! ## Implementation & Results We built on the foundational work by Jeff (tg [@foodchain1028](https://t.me/foodchain1028)) and Moven (tg [@moven0831](https://t.me/moven0831)), who focused on [MSM optimization using Metal for Arkworks](https://github.com/zkmopro/gpu-acceleration). Our goal was to bring similar improvements to **Halo2**, the framework required by EZKL. ### Key Steps: 1. **Splitting MSM**: We divided the workload between the CPU and GPU, leveraging the strengths of each. 2. **Unified Memory**: We optimized for Apple’s unified memory architecture to minimize data transfer costs. 3. **Parallelization**: Metal shaders were used to accelerate the largest compute chunks. ### Performance Gains: - **2× faster** than the CPU-based implementation on both M-series Macs and iPhones for **log(2^20)** (~1 million points). - **15× faster** compared to the previous GPU baseline. - **9% improvement** in overall proving time after integration into EZKL. While the 9% improvement in proving time is less than we initially hoped, we suspect this is due to the specifics of how MSM is used in real proofs - especially with cases involving many zeros or smaller circuits. Further investigation will need to be done. | ![M1 Pro Performance: CPU vs Metal Enabled GPU (Log2 Scale)](https://hackmd.io/_uploads/rJXa_V_Pkl.png) | ![iPhone 15 Pro Performance: CPU-Only vs GPU+CPU (Log2 Scale)](https://hackmd.io/_uploads/r1J6_EOv1l.png) | |-------------------------|-------------------------| | **Figure 1**: Performance comparison of CPU vs GPU (Metal-enabled) on an M1 Pro for MSM computation. | **Figure 2**: CPU-only vs combined CPU+GPU performance on an iPhone 15 Pro for MSM computation. | ### Impact: These optimizations unlock some interesting use cases: - **On-Device KYC and Fraud Detection**: Sundial’s ZK-KYC Onflow, launching in Q1, can run private checks and image processing directly on the device, ensuring user data never leaves the phone. - **Credit Scoring & Risk Assessment**: Trustless scoring models can run locally and update platforms like Sentiment without revealing sensitive financial information. - **Future Potential**: With GPU acceleration in place, we can imagine a host of privacy-preserving applications leveraging this capability. We can't wait to see what the comunity does with this! ## Next Steps There’s still plenty of room for improvement: - **Refine GPU-CPU Splitting**: Fine-tune heuristics for distributing workloads and GPU-specific parameters. - **Incorporate New MSM Techniques**: Explore recent research, such as [ePrint 2022/1321](https://eprint.iacr.org/2022/1321.pdf). - **Specialized Proving Systems**: Investigate low-power provers optimized for mobile devices ([ePrint 2024/1970](https://eprint.iacr.org/2024/1970)). - **Address Other Bottlenecks**: Tackle FFT, another key operation in ZKP pipelines. With these steps, we aim to push the boundaries of what’s possible on mobile devices, bringing ZKP computations closer to everyday applications. ## Links - **Codebase**: [Metal MSM GPU Acceleration on GitHub](https://github.com/zkonduit/metal-msm-gpu-acceleration/tree/feat/further-gpu-acceleration) - **Community Discussion**: [zkMopro GPU Optimizations on Telegram](https://t.me/zkmopro/474) - **Optimization Notes**: [Detailed Breakdown on HackMD](https://hackmd.io/@guard/BJDmuEODke)