GPU Acceleration for Client-Side ZK Proving on Mobile Devices

# GPU Acceleration for Client-Side ZK Proving on Mobile Devices *Can mobile GPUs make zero-knowledge proofs fast enough for real-time use? We investigated — and the answer is more nuanced than you'd expect.* --- ## Introduction Zero-knowledge proofs (ZKPs) are one of the most promising tools for bringing privacy to Ethereum. But there's a catch: proving is computationally expensive, and if we want privacy to be *user-controlled* — generated on the user's own device rather than delegated to a centralized server — we need client-side proving to be fast. Today, it isn't. On mobile hardware, generating even a simple ZK proof can take several seconds. For privacy-preserving transactions, identity verification, or anonymous voting to feel native, proving times need to drop below 200ms. GPU acceleration is one path to get there. GPUs offer massive parallelism — thousands of small cores executing operations simultaneously — which maps well onto certain ZK proving bottlenecks. But mobile GPUs are not desktop GPUs, and the gap between theoretical speedups and practical gains is significant. This article presents our findings from working on [Deimos](https://github.com/AnonDevTeam/Deimos), a mobile zkVM benchmarking tool built by BlocSoc IIT Roorkee. We benchmarked hash functions (Blake2, Blake3, MiMC-256, Pedersen, Poseidon, Rescue-Prime, SHA-256) across proving frameworks (Circom, Noir, Cairo) on real Android and iOS hardware — and then investigated how GPU acceleration could improve these results. What follows is a practical analysis grounded in real measurements: we'll start with our CPU-only baseline benchmarks from five different mobile devices, then examine where GPU acceleration helps, where it doesn't, and what it actually takes to ship it on a phone. --- ## The Mobile Proving Problem Before diving into GPU acceleration, it helps to understand *why* client-side ZK proving is slow on mobile devices. In a ZKP system, the prover must demonstrate knowledge of some secret (a "witness") by evaluating an arithmetic circuit — a system of equations called **constraints**. A simple hash like Poseidon might compile to ~250 constraints. SHA-256 compiles to over 30,000. The prover must satisfy *every* constraint, and typical real-world circuits involve hundreds of thousands to millions of them. This is the kind of workload CPUs aren't optimized for: large-scale, repetitive arithmetic over finite fields. Mobile CPUs make it worse — they're power-constrained, thermally limited, and running at lower clock speeds than their desktop counterparts. The natural question: can we offload this work to the GPU? --- ## Where GPU Acceleration Fits in ZK Proving Not all parts of ZK proving benefit equally from GPU parallelism. The three main computational bottlenecks — and how they map to GPU workloads — are: ### 1. Multi-Scalar Multiplication (MSM) MSM involves computing a weighted sum of elliptic curve points: ``` Result = s₁·G₁ + s₂·G₂ + ... + sₙ·Gₙ ``` This is the dominant cost in SNARKs (Groth16, PLONK). Each scalar-point multiplication is independent, making MSM **highly parallelizable** and an ideal GPU target. On desktop GPUs, MSM acceleration alone has delivered 5–10× speedups in tools like [ICICLE](https://github.com/ingonyama-zk/icicle). ### 2. Number Theoretic Transform (NTT) NTT is the finite-field equivalent of the Fast Fourier Transform (FFT). It's used heavily in polynomial commitment schemes (KZG, FRI) to convert between coefficient and evaluation representations. NTT has a butterfly structure similar to FFT — parallelizable, but **memory-bandwidth-bound**. GPUs can accelerate NTT, but the speedup is limited by how fast data moves between memory and compute cores. This is especially relevant on mobile, where memory bandwidth is shared between CPU and GPU. ### 3. Witness Generation Before proving begins, the prover must compute the witness — the concrete values that satisfy the circuit. For hash-based circuits, this means actually computing the hash. Witness generation is the *least* parallelizable step. It's typically sequential and depends on the specific hash function. For GPU-unfriendly hashes like SHA-256 or Keccak, witness generation can be a significant fraction of total proving time. GPU acceleration offers **modest** gains here, mostly for batch witness computation. ### Summary | Bottleneck | GPU Suitability | Key Constraint | | -------------------- | --------------- | ---------------------------- | | MSM | Excellent | Compute-bound, embarrassingly parallel | | NTT | Good | Memory-bandwidth-bound | | Witness Generation | Limited | Mostly sequential | --- ## ZK-Friendliness: Why the Hash Function Matters The choice of hash function dramatically affects both the circuit size and the GPU acceleration potential. **ZK-friendly hashes** (Poseidon, MiMC-256, Rescue-Prime, Pedersen) were designed to minimize constraint count in arithmetic circuits. Poseidon, for example, uses ~250 constraints — meaning the proving overhead is dominated by MSM and NTT, both of which benefit from GPU parallelism. **ZK-unfriendly hashes** (SHA-256, Keccak-256, Blake2/Blake3) were designed for CPU efficiency, not arithmetic circuit efficiency. SHA-256 requires 30,000+ constraints. Keccak-256 can exceed 150,000. The resulting circuits are enormous, and while GPU-accelerated MSM and NTT help in absolute terms, the circuit is so large that proving times remain impractical for real-time mobile use. This creates a counterintuitive dynamic: > **GPU acceleration provides the *largest absolute* speedup for unfriendly hashes (because the circuits are bigger), but the *most practical impact* for friendly hashes (because the resulting times actually hit usable thresholds).** If your application can choose its hash function — use Poseidon or Pedersen. GPU acceleration then becomes the difference between "slow" and "real-time." If you're stuck with keccak-256 (e.g., for Ethereum compatibility), GPU acceleration helps, but don't expect miracles on mobile. --- ## Implementation Approaches We evaluated four strategies for bringing GPU acceleration to mobile ZK proving, each targeting a different framework. The diagram below shows how a hybrid CPU/GPU proving pipeline would work compared to the current CPU-only approach. ```mermaid flowchart TB subgraph current["Current: CPU-Only Pipeline"] direction LR A1["Circuit Compilation"] --> A2["Witness Generation (CPU)"] A2 --> A3["MSM (CPU)"] A3 --> A4["NTT (CPU)"] A4 --> A5["Proof Output"] end subgraph proposed["Proposed: Hybrid CPU/GPU Pipeline"] direction LR B1["Circuit Compilation"] --> B2["Witness Generation (CPU)"] B2 --> B3{"Circuit Size > 10K?"} B3 -->|Yes| B4["MSM + NTT (GPU via Metal/Vulkan)"] B3 -->|No| B5["MSM + NTT (CPU)"] B4 --> B6["Proof Output"] B5 --> B6 end current ~~~ proposed style current fill:#1a1a2e,stroke:#e94560,color:#fff style proposed fill:#1a1a2e,stroke:#0f3460,color:#fff ``` ### Approach 1: ICICLE for Circom (Groth16) [ICICLE](https://github.com/ingonyama-zk/icicle) is Ingonyama's GPU-accelerated cryptographic library. It provides CUDA-based implementations of MSM and NTT that can replace the CPU-bound paths in provers like RapidSnark. **Mobile adaptation:** ICICLE is built for CUDA (NVIDIA desktop GPUs). Porting to mobile requires a backend abstraction layer — Metal compute shaders on iOS, Vulkan compute shaders on Android. This is non-trivial: you're essentially reimplementing the core kernels for two different GPU APIs. **Expected improvement:** 2–3× faster proving for large circuits. ### Approach 2: Barretenberg GPU for Noir (UltraPlonk) Aztec's [Barretenberg](https://github.com/AztecProtocol/aztec-packages/tree/next/barretenberg) backend already has experimental GPU support for its UltraPlonk prover. Enabling GPU acceleration here is more about compilation flags and backend configuration than ground-up reimplementation. **Mobile adaptation:** Barretenberg's GPU path targets CUDA/OpenCL. A Metal/Vulkan port is needed, but the abstraction is cleaner than ICICLE's tight CUDA coupling. **Expected improvement:** 1.5–2× faster proving. ### Approach 3: Stwo for Cairo (STARKs) StarkWare's [Stwo](https://github.com/starkware-libs/stwo) prover is a next-generation STARK prover that replaces the older Stone prover. Stwo is architecturally different — it uses circle STARKs and is fundamentally more efficient, independent of GPU acceleration. **Important distinction:** The 10–30× speedup attributed to Stwo is primarily an *algorithmic* improvement over Stone, not a GPU-specific gain. Adding GPU acceleration (via WebGPU or Metal) on top of Stwo's already-faster architecture amplifies the benefit further, but conflating "new prover" with "GPU acceleration" would be misleading. **Expected improvement:** 10–30× faster proving (algorithm + GPU combined). ### Approach 4: Hybrid CPU–GPU Strategy Not every proof benefits from GPU offloading. For small circuits (< 10,000 constraints), the overhead of transferring data to the GPU, dispatching compute shaders, and reading results back can *exceed* the time saved. A practical strategy: - **Small circuits (< 10K constraints):** CPU-only. The transfer overhead erases GPU gains. - **Medium circuits (10K–100K constraints):** GPU-accelerated MSM, CPU witness generation. - **Large circuits (> 100K constraints):** Full GPU offload of MSM and NTT. This hybrid approach requires runtime profiling and circuit-size heuristics, but it avoids the worst-case scenario of GPU acceleration actually *slowing down* small proofs. --- ## Baseline Benchmarks: Where We Are Today (CPU-Only) Before projecting GPU gains, here's where things stand right now. These are **real measurements** from the Deimos benchmarking suite, collected on actual mobile hardware using [MoPro](https://github.com/zkmopro/mopro) as the prover interface. All times are in seconds. ### Test Devices | Device | Chipset | Platform | Tier | | ------------ | -------------------- | -------- | ----------- | | iPhone 13 | Apple A15 Bionic | iOS | Flagship | | RMX3853 | MediaTek Dimensity | Android | Mid-range | | AC2001 | MediaTek Helio | Android | Mid-range | | A059 | MediaTek Helio | Android | Budget | | SM-M315F | Samsung Exynos 9611 | Android | Budget | ### Circom (Groth16) — Proving Times | Algorithm | iPhone 13 | A059 | AC2001 | RMX3710 | SM-M315F | | ----------- | --------- | ------ | ------ | ------- | -------- | | Poseidon | **0.05s** | 0.08s | 0.36s | — | 0.88s | | Pedersen | **0.07s** | 0.53s | — | — | 0.46s | | MiMC-256 | — | 0.19s | — | 0.52s | 1.92s | | Blake3 | 0.70s | 1.23s | — | — | — | | SHA-256 | 0.76s | 1.91s | 0.86s | 1.56s | — | | Blake2s-256 | — | — | 1.78s | 2.90s | — | | Keccak-256 | — | 2.90s | — | — | — | ### Noir (UltraPlonk) — Proving Times | Algorithm | RMX3853 | A059 | AC2001 | RMX3710 | SM-M315F | iPhone 13 | | ------------ | ------- | ------ | ------ | ------- | -------- | --------- | | Poseidon (2F)| 0.53s | 1.19s | 2.62s | — | — | — | | Anemoi | 0.61s | — | — | — | — | — | | Rescue-Prime| 0.96s | — | — | — | — | — | | Blake3 | 1.02s | — | — | — | — | — | | Pedersen | — | — | — | — | 4.66s | 0.54s | | MiMC | — | — | 2.49s | 3.08s | — | — | | SHA-256 | 2.55s | — | — | — | — | — | | Blake2 | — | — | — | 4.82s | 3.73s | — | ### imp1 Framework — Proving Times | Algorithm | A059 | AC2001 | SM-M315F | RMX3710 | | ----------- | ------ | ------ | -------- | ------- | | Poseidon | — | — | 1.66s | — | | MiMC-256 | — | — | 1.43s | — | | SHA-256 | 1.47s | — | — | — | | Blake2s-256 | — | 1.88s | 2.80s | 2.06s | | Keccak-256 | 2.60s | — | 9.87s | — | ### Cairo (Stone Prover) — Proving Times | Algorithm | RMX3710 | AC2001 | | --------- | ------- | ------ | | SHA-256 | 9.23s | 7.74s | ### Cross-Framework Comparison: SHA-256 SHA-256 is the one hash we benchmarked across all four frameworks, making it a useful control variable. The differences are stark: | Framework | Device | Proving Time | Relative Speed | | --------- | ------- | ------------ | -------------- | | Circom | AC2001 | 0.86s | **Fastest** | | imp1 | A059 | 1.47s | 1.7× slower | | Noir | RMX3853 | 2.55s | 3.0× slower | | Cairo | AC2001 | 7.74s | **9.0× slower**| *Note: Different devices across rows mean these aren't perfectly apples-to-apples, but the trend is clear — framework choice matters as much as hash choice.* ### Key Takeaways from Baseline Data Three patterns jump out from the measured data: 1. **ZK-friendly hashes are dramatically faster.** Poseidon proves in **0.05s** on iPhone 13 vs. **0.76s** for SHA-256 — a 15× gap on the *same device and framework*. This validates the ZK-friendliness analysis above. 2. **Device variance is enormous.** Poseidon/Circom ranges from 0.05s (iPhone 13) to 0.88s (SM-M315F) — a 17× spread. Any GPU acceleration strategy must account for the fact that the "mobile" category spans flagship to budget hardware. 3. **Cairo/Stone is painfully slow.** SHA-256 via the Stone prover takes 7–9 seconds, making it the strongest candidate for the Stwo upgrade discussed in Implementation Approaches above. 4. **Noir circuits are consistently slower than Circom** for equivalent hash functions, likely due to UltraPlonk's higher per-constraint overhead compared to Groth16. These baselines are the starting point for our GPU acceleration projections below. ```mermaid xychart-beta title "Proving Time by Hash Function (iPhone 13, Circom)" x-axis ["Poseidon", "Pedersen", "Blake3", "SHA-256"] y-axis "Time (seconds)" 0 --> 1 bar [0.05, 0.07, 0.70, 0.76] ``` --- ## Projected Performance with GPU Acceleration The following tables present our **projected estimates** for GPU-accelerated proving times, based on desktop GPU benchmarks scaled for mobile GPU capabilities. These are not measured results — they represent our best-case modeling given the current state of mobile GPU compute APIs. > **⚠️ Caveat:** These projections assume sustained GPU throughput without thermal throttling, which is unrealistic for mobile workloads beyond ~15 seconds. Real-world performance will be lower. See [Mobile Hardware Constraints](#the-elephant-in-the-room-mobile-hardware-constraints) below. ### Circom (Groth16) — CPU vs. ICICLE-adapted GPU | Input Size | CPU (RapidSnark) | GPU (projected) | Speedup | | ---------- | ---------------- | --------------- | ------- | | 32 bytes | ~50ms | ~40ms | 1.25× | | 128 bytes | ~100ms | ~60ms | 1.67× | | 512 bytes | ~300ms | ~120ms | 2.5× | | 1024 bytes | ~600ms | ~200ms | 3× | ### Noir (UltraPlonk) — CPU vs. Barretenberg GPU | Input Size | CPU | GPU (projected) | Speedup | | ---------- | ------- | --------------- | ------- | | 32 bytes | ~80ms | ~60ms | 1.33× | | 128 bytes | ~150ms | ~90ms | 1.67× | | 512 bytes | ~400ms | ~200ms | 2× | | 1024 bytes | ~800ms | ~350ms | 2.3× | ### Cairo (Stwo) — Stone CPU vs. Stwo GPU | Input Size | CPU (Stone) | GPU (Stwo, projected) | Speedup | | ---------- | ----------- | --------------------- | ------- | | 32 bytes | ~500ms | ~50ms | 10× | | 128 bytes | ~1s | ~80ms | 12.5× | | 512 bytes | ~3s | ~150ms | 20× | | 1024 bytes | ~5s | ~250ms | 20× | *Note: The Cairo/Stwo speedup is primarily architectural (circle STARKs vs. classic STARKs), not purely GPU-driven.* ### Memory Usage Comparison | Implementation | Peak RAM | VRAM | Total | | ------------------- | -------- | ----- | ------- | | RapidSnark (CPU) | 2–3 GB | 0 | 2–3 GB | | ICICLE-adapted (GPU)| 1–1.5 GB | 1 GB | 2–2.5 GB| | Barretenberg CPU | 3–4 GB | 0 | 3–4 GB | | Barretenberg GPU | 2 GB | 1.5 GB| 3.5 GB | | Stone CPU | 4–5 GB | 0 | 4–5 GB | | Stwo GPU | 2 GB | 2 GB | 4 GB | GPU implementations shift memory pressure from RAM to VRAM. On mobile — where RAM and VRAM are the same physical memory — this doesn't reduce total memory usage, but it can reduce *peak allocation pressure* through streaming and pipelining. --- ## GPU Acceleration Priority by Hash Function Based on our analysis, here's how we'd prioritize GPU acceleration efforts by framework and hash function. ### Circom Circuits (Groth16) | Algorithm | GPU Benefit | Priority | Rationale | | ----------- | ----------- | -------- | -------------------------------------- | | Keccak-256 | Very High | Critical | Largest circuit (~150K+ constraints), most to gain | | SHA-256 | High | High | ~30K constraints, significant room for improvement | | Blake2/Blake3| Medium | Medium | Moderate circuit size, already reasonably fast | | Pedersen | Medium | Medium | Benefits from MSM acceleration specifically | | Poseidon | Low | Low | ~250 constraints, already fast on CPU | | MiMC-256 | Low | Low | Small circuits, CPU is sufficient | ### Noir Circuits (UltraPlonk) | Algorithm | GPU Benefit | Priority | Rationale | | ------------ | ----------- | -------- | -------------------------------------- | | Keccak-256 | Very High | Critical | Biggest improvement potential | | SHA-256 | High | High | Significant constraint reduction benefit| | Blake2/Blake3| Medium | Medium | Moderate gains | | Anemoi | Medium | Medium | Newer algebraic hash, moderate circuit size | | Rescue-Prime | Medium | Medium | Moderate improvement | | Poseidon | Low | Low | CPU is sufficient | --- ## The Elephant in the Room: Mobile Hardware Constraints Everything above assumes the GPU is available and capable. On mobile, that assumption breaks down in four critical ways. ### 1. Thermal Throttling Mobile devices are passively cooled. Sustained GPU compute generates heat that the device can't dissipate, triggering thermal throttling — clock speed reduction — after 10–30 seconds of continuous load. A proof that benchmarks at 200ms in isolation might take 400ms when generated immediately after a previous proof, because the GPU hasn't cooled down. **Implication:** GPU acceleration works best for *burst* proving (single proofs with recovery time), not *sustained* proving (batch operations). ### 2. Shared Memory Architecture Unlike desktop systems where the GPU has dedicated VRAM, mobile GPUs share RAM with the CPU (unified memory architecture). This means: - GPU memory bandwidth is substantially lower than desktop (50–100 GB/s vs. 500+ GB/s) - NTT — which is memory-bandwidth-bound — sees reduced GPU benefit - Large circuits may cause memory pressure that degrades *both* CPU and GPU performance ### 3. No Native 64-bit Arithmetic ZK proving operates on large finite field elements — typically 254-bit or 256-bit integers. Desktop GPUs (CUDA) can accelerate this using native 64-bit integer multiply instructions. Most mobile GPUs lack 64-bit integer support entirely, requiring field arithmetic to be **emulated using 32-bit operations** — 4× the instruction count, at minimum. This is perhaps the most underappreciated constraint. It doesn't eliminate the GPU advantage, but it significantly reduces the parallelism benefit. ### 4. Immature Compute Shader Ecosystem CUDA is the gold standard for GPU compute — mature, well-documented, with extensive library support. Mobile offers: - **Metal Compute Shaders** (iOS): Capable but Apple-only, with limited third-party tooling for cryptographic workloads. - **Vulkan Compute Shaders** (Android): Available on most modern devices, but driver quality varies dramatically across manufacturers (Qualcomm, Mali, PowerVR all behave differently). - **WebGPU**: Emerging standard, but browser support is incomplete and performance overhead from the abstraction layer is non-trivial. Building production-quality ZK proving on these APIs requires significantly more engineering effort than using CUDA on desktop. --- ## Architectural Recommendations Based on our analysis, here's a practical roadmap for integrating GPU acceleration into a mobile ZK proving pipeline. ```mermaid flowchart LR subgraph phase1["Phase 1: Quick Wins"] P1A["Switch to ZK-friendly hashes where possible"] P1B["Upgrade Cairo from Stone to Stwo"] end subgraph phase2["Phase 2: Selective GPU"] P2A["Add GPU-accelerated MSM for large circuits"] P2B["Implement hybrid CPU/GPU routing"] end subgraph phase3["Phase 3: Full Pipeline"] P3A["GPU-accelerated NTT"] P3B["Thermal-aware scheduling"] end phase1 --> phase2 --> phase3 style phase1 fill:#0d4b3c,stroke:#10b981,color:#fff style phase2 fill:#1e3a5f,stroke:#3b82f6,color:#fff style phase3 fill:#4a1942,stroke:#a855f7,color:#fff ``` ### Start with the hybrid approach Don't GPU-accelerate everything. Profile your circuits, identify which ones are large enough to benefit, and route accordingly. Our benchmarks show that ZK-friendly hashes (Poseidon at 0.05s, Pedersen at 0.07s on iPhone 13) are already fast enough on CPU — GPU acceleration won't meaningfully improve a 50ms proof. Focus GPU resources on the circuits that need it: SHA-256, Keccak-256, Blake2/3. The decision boundary is roughly: ``` if circuit.constraint_count > 10_000: use_gpu_prover() else: use_cpu_prover() ``` ### Choose ZK-friendly hashes where possible The single biggest performance improvement doesn't come from GPU acceleration — it comes from choosing a hash function with fewer constraints. Switching from SHA-256 to Poseidon is a ~100× reduction in circuit size. No GPU can match that. ### Target MSM first If you're building GPU acceleration from scratch, start with MSM. It's the most parallelizable operation, offers the highest speedup, and has the most reference implementations to learn from. NTT can follow — but its memory-bandwidth sensitivity makes it harder to optimize on mobile. ### Plan for thermal budgets Design your proving pipeline with thermal constraints in mind. If your app needs to generate multiple proofs in sequence, insert cooling delays or degrade gracefully to CPU-only proving when temperature thresholds are hit. ### Watch for Stwo and WebGPU maturity The Stwo prover and WebGPU standard are both evolving rapidly. Within 12–18 months, the mobile GPU compute landscape may look substantially different. Build abstractions that allow you to swap backends without rewriting your proving pipeline. --- ## Conclusion GPU acceleration for client-side ZK proving is promising but not a silver bullet — especially on mobile. The theoretical speedups are real: 2–3× for Groth16, 1.5–2× for UltraPlonk, and potentially 10–30× when combining architectural improvements (Stwo) with GPU compute. But mobile hardware imposes hard constraints that desktop benchmarks don't capture: thermal throttling, shared memory, missing 64-bit integer support, and an immature compute shader ecosystem. The practical gains on a phone are likely 30–50% lower than desktop projections. The most impactful near-term strategy isn't aggressive GPU optimization — it's choosing ZK-friendly hash functions, applying GPU acceleration selectively to large circuits, and designing for the thermal realities of sustained mobile workloads. The tooling will improve. The physics of mobile thermals and memory bandwidth won't. --- *This research was conducted as part of the [Deimos](https://github.com/Blocsoc-iitr/Deimos) project at BlocSoc, IIT Roorkee. Deimos is an open-source mobile zkVM benchmarking tool that evaluates client-side proving performance on real Android and iOS hardware.*

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.