groverInnovate
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # GPU Acceleration for Client-Side ZK Proving on Mobile Devices *Can mobile GPUs make zero-knowledge proofs fast enough for real-time use? We investigated — and the answer is more nuanced than you'd expect.* --- ## Introduction Zero-knowledge proofs (ZKPs) are one of the most promising tools for bringing privacy to Ethereum. But there's a catch: proving is computationally expensive, and if we want privacy to be *user-controlled* — generated on the user's own device rather than delegated to a centralized server — we need client-side proving to be fast. Today, it isn't. On mobile hardware, generating even a simple ZK proof can take several seconds. For privacy-preserving transactions, identity verification, or anonymous voting to feel native, proving times need to drop below 200ms. GPU acceleration is one path to get there. GPUs offer massive parallelism — thousands of small cores executing operations simultaneously — which maps well onto certain ZK proving bottlenecks. But mobile GPUs are not desktop GPUs, and the gap between theoretical speedups and practical gains is significant. This article presents our findings from working on [Deimos](https://github.com/AnonDevTeam/Deimos), a mobile zkVM benchmarking tool built by BlocSoc IIT Roorkee. We benchmarked hash functions (Blake2, Blake3, MiMC-256, Pedersen, Poseidon, Rescue-Prime, SHA-256) across proving frameworks (Circom, Noir, Cairo) on real Android and iOS hardware — and then investigated how GPU acceleration could improve these results. What follows is a practical analysis grounded in real measurements: we'll start with our CPU-only baseline benchmarks from five different mobile devices, then examine where GPU acceleration helps, where it doesn't, and what it actually takes to ship it on a phone. --- ## The Mobile Proving Problem Before diving into GPU acceleration, it helps to understand *why* client-side ZK proving is slow on mobile devices. In a ZKP system, the prover must demonstrate knowledge of some secret (a "witness") by evaluating an arithmetic circuit — a system of equations called **constraints**. A simple hash like Poseidon might compile to ~250 constraints. SHA-256 compiles to over 30,000. The prover must satisfy *every* constraint, and typical real-world circuits involve hundreds of thousands to millions of them. This is the kind of workload CPUs aren't optimized for: large-scale, repetitive arithmetic over finite fields. Mobile CPUs make it worse — they're power-constrained, thermally limited, and running at lower clock speeds than their desktop counterparts. The natural question: can we offload this work to the GPU? --- ## Where GPU Acceleration Fits in ZK Proving Not all parts of ZK proving benefit equally from GPU parallelism. The three main computational bottlenecks — and how they map to GPU workloads — are: ### 1. Multi-Scalar Multiplication (MSM) MSM involves computing a weighted sum of elliptic curve points: ``` Result = s₁·G₁ + s₂·G₂ + ... + sₙ·Gₙ ``` This is the dominant cost in SNARKs (Groth16, PLONK). Each scalar-point multiplication is independent, making MSM **highly parallelizable** and an ideal GPU target. On desktop GPUs, MSM acceleration alone has delivered 5–10× speedups in tools like [ICICLE](https://github.com/ingonyama-zk/icicle). ### 2. Number Theoretic Transform (NTT) NTT is the finite-field equivalent of the Fast Fourier Transform (FFT). It's used heavily in polynomial commitment schemes (KZG, FRI) to convert between coefficient and evaluation representations. NTT has a butterfly structure similar to FFT — parallelizable, but **memory-bandwidth-bound**. GPUs can accelerate NTT, but the speedup is limited by how fast data moves between memory and compute cores. This is especially relevant on mobile, where memory bandwidth is shared between CPU and GPU. ### 3. Witness Generation Before proving begins, the prover must compute the witness — the concrete values that satisfy the circuit. For hash-based circuits, this means actually computing the hash. Witness generation is the *least* parallelizable step. It's typically sequential and depends on the specific hash function. For GPU-unfriendly hashes like SHA-256 or Keccak, witness generation can be a significant fraction of total proving time. GPU acceleration offers **modest** gains here, mostly for batch witness computation. ### Summary | Bottleneck | GPU Suitability | Key Constraint | | -------------------- | --------------- | ---------------------------- | | MSM | Excellent | Compute-bound, embarrassingly parallel | | NTT | Good | Memory-bandwidth-bound | | Witness Generation | Limited | Mostly sequential | --- ## ZK-Friendliness: Why the Hash Function Matters The choice of hash function dramatically affects both the circuit size and the GPU acceleration potential. **ZK-friendly hashes** (Poseidon, MiMC-256, Rescue-Prime, Pedersen) were designed to minimize constraint count in arithmetic circuits. Poseidon, for example, uses ~250 constraints — meaning the proving overhead is dominated by MSM and NTT, both of which benefit from GPU parallelism. **ZK-unfriendly hashes** (SHA-256, Keccak-256, Blake2/Blake3) were designed for CPU efficiency, not arithmetic circuit efficiency. SHA-256 requires 30,000+ constraints. Keccak-256 can exceed 150,000. The resulting circuits are enormous, and while GPU-accelerated MSM and NTT help in absolute terms, the circuit is so large that proving times remain impractical for real-time mobile use. This creates a counterintuitive dynamic: > **GPU acceleration provides the *largest absolute* speedup for unfriendly hashes (because the circuits are bigger), but the *most practical impact* for friendly hashes (because the resulting times actually hit usable thresholds).** If your application can choose its hash function — use Poseidon or Pedersen. GPU acceleration then becomes the difference between "slow" and "real-time." If you're stuck with keccak-256 (e.g., for Ethereum compatibility), GPU acceleration helps, but don't expect miracles on mobile. --- ## Implementation Approaches We evaluated four strategies for bringing GPU acceleration to mobile ZK proving, each targeting a different framework. The diagram below shows how a hybrid CPU/GPU proving pipeline would work compared to the current CPU-only approach. ```mermaid flowchart TB subgraph current["Current: CPU-Only Pipeline"] direction LR A1["Circuit<br>Compilation"] --> A2["Witness<br>Generation<br>(CPU)"] A2 --> A3["MSM<br>(CPU)"] A3 --> A4["NTT<br>(CPU)"] A4 --> A5["Proof<br>Output"] end subgraph proposed["Proposed: Hybrid CPU/GPU Pipeline"] direction LR B1["Circuit<br>Compilation"] --> B2["Witness<br>Generation<br>(CPU)"] B2 --> B3{"Circuit Size<br>> 10K?"} B3 -->|Yes| B4["MSM + NTT<br>(GPU via Metal/Vulkan)"] B3 -->|No| B5["MSM + NTT<br>(CPU)"] B4 --> B6["Proof<br>Output"] B5 --> B6 end current ~~~ proposed style current fill:#1a1a2e,stroke:#e94560,color:#fff style proposed fill:#1a1a2e,stroke:#0f3460,color:#fff ``` ### Approach 1: ICICLE for Circom (Groth16) [ICICLE](https://github.com/ingonyama-zk/icicle) is Ingonyama's GPU-accelerated cryptographic library. It provides CUDA-based implementations of MSM and NTT that can replace the CPU-bound paths in provers like RapidSnark. **Mobile adaptation:** ICICLE is built for CUDA (NVIDIA desktop GPUs). Porting to mobile requires a backend abstraction layer — Metal compute shaders on iOS, Vulkan compute shaders on Android. This is non-trivial: you're essentially reimplementing the core kernels for two different GPU APIs. **Expected improvement:** 2–3× faster proving for large circuits. ### Approach 2: Barretenberg GPU for Noir (UltraPlonk) Aztec's [Barretenberg](https://github.com/AztecProtocol/aztec-packages/tree/next/barretenberg) backend already has experimental GPU support for its UltraPlonk prover. Enabling GPU acceleration here is more about compilation flags and backend configuration than ground-up reimplementation. **Mobile adaptation:** Barretenberg's GPU path targets CUDA/OpenCL. A Metal/Vulkan port is needed, but the abstraction is cleaner than ICICLE's tight CUDA coupling. **Expected improvement:** 1.5–2× faster proving. ### Approach 3: Stwo for Cairo (STARKs) StarkWare's [Stwo](https://github.com/starkware-libs/stwo) prover is a next-generation STARK prover that replaces the older Stone prover. Stwo is architecturally different — it uses circle STARKs and is fundamentally more efficient, independent of GPU acceleration. **Important distinction:** The 10–30× speedup attributed to Stwo is primarily an *algorithmic* improvement over Stone, not a GPU-specific gain. Adding GPU acceleration (via WebGPU or Metal) on top of Stwo's already-faster architecture amplifies the benefit further, but conflating "new prover" with "GPU acceleration" would be misleading. **Expected improvement:** 10–30× faster proving (algorithm + GPU combined). ### Approach 4: Hybrid CPU–GPU Strategy Not every proof benefits from GPU offloading. For small circuits (< 10,000 constraints), the overhead of transferring data to the GPU, dispatching compute shaders, and reading results back can *exceed* the time saved. A practical strategy: - **Small circuits (< 10K constraints):** CPU-only. The transfer overhead erases GPU gains. - **Medium circuits (10K–100K constraints):** GPU-accelerated MSM, CPU witness generation. - **Large circuits (> 100K constraints):** Full GPU offload of MSM and NTT. This hybrid approach requires runtime profiling and circuit-size heuristics, but it avoids the worst-case scenario of GPU acceleration actually *slowing down* small proofs. --- ## Baseline Benchmarks: Where We Are Today (CPU-Only) Before projecting GPU gains, here's where things stand right now. These are **real measurements** from the Deimos benchmarking suite, collected on actual mobile hardware using [MoPro](https://github.com/zkmopro/mopro) as the prover interface. All times are in seconds. ### Test Devices | Device | Chipset | Platform | Tier | | ------------ | -------------------- | -------- | ----------- | | iPhone 13 | Apple A15 Bionic | iOS | Flagship | | RMX3853 | MediaTek Dimensity | Android | Mid-range | | AC2001 | MediaTek Helio | Android | Mid-range | | A059 | MediaTek Helio | Android | Budget | | SM-M315F | Samsung Exynos 9611 | Android | Budget | ### Circom (Groth16) — Proving Times | Algorithm | iPhone 13 | A059 | AC2001 | RMX3710 | SM-M315F | | ----------- | --------- | ------ | ------ | ------- | -------- | | Poseidon | **0.05s** | 0.08s | 0.36s | — | 0.88s | | Pedersen | **0.07s** | 0.53s | — | — | 0.46s | | MiMC-256 | — | 0.19s | — | 0.52s | 1.92s | | Blake3 | 0.70s | 1.23s | — | — | — | | SHA-256 | 0.76s | 1.91s | 0.86s | 1.56s | — | | Blake2s-256 | — | — | 1.78s | 2.90s | — | | Keccak-256 | — | 2.90s | — | — | — | ### Noir (UltraPlonk) — Proving Times | Algorithm | RMX3853 | A059 | AC2001 | RMX3710 | SM-M315F | iPhone 13 | | ------------ | ------- | ------ | ------ | ------- | -------- | --------- | | Poseidon (2F)| 0.53s | 1.19s | 2.62s | — | — | — | | Anemoi | 0.61s | — | — | — | — | — | | Rescue-Prime| 0.96s | — | — | — | — | — | | Blake3 | 1.02s | — | — | — | — | — | | Pedersen | — | — | — | — | 4.66s | 0.54s | | MiMC | — | — | 2.49s | 3.08s | — | — | | SHA-256 | 2.55s | — | — | — | — | — | | Blake2 | — | — | — | 4.82s | 3.73s | — | ### imp1 Framework — Proving Times | Algorithm | A059 | AC2001 | SM-M315F | RMX3710 | | ----------- | ------ | ------ | -------- | ------- | | Poseidon | — | — | 1.66s | — | | MiMC-256 | — | — | 1.43s | — | | SHA-256 | 1.47s | — | — | — | | Blake2s-256 | — | 1.88s | 2.80s | 2.06s | | Keccak-256 | 2.60s | — | 9.87s | — | ### Cairo (Stone Prover) — Proving Times | Algorithm | RMX3710 | AC2001 | | --------- | ------- | ------ | | SHA-256 | 9.23s | 7.74s | ### Cross-Framework Comparison: SHA-256 SHA-256 is the one hash we benchmarked across all four frameworks, making it a useful control variable. The differences are stark: | Framework | Device | Proving Time | Relative Speed | | --------- | ------- | ------------ | -------------- | | Circom | AC2001 | 0.86s | **Fastest** | | imp1 | A059 | 1.47s | 1.7× slower | | Noir | RMX3853 | 2.55s | 3.0× slower | | Cairo | AC2001 | 7.74s | **9.0× slower**| *Note: Different devices across rows mean these aren't perfectly apples-to-apples, but the trend is clear — framework choice matters as much as hash choice.* ### Key Takeaways from Baseline Data Three patterns jump out from the measured data: 1. **ZK-friendly hashes are dramatically faster.** Poseidon proves in **0.05s** on iPhone 13 vs. **0.76s** for SHA-256 — a 15× gap on the *same device and framework*. This validates the ZK-friendliness analysis above. 2. **Device variance is enormous.** Poseidon/Circom ranges from 0.05s (iPhone 13) to 0.88s (SM-M315F) — a 17× spread. Any GPU acceleration strategy must account for the fact that the "mobile" category spans flagship to budget hardware. 3. **Cairo/Stone is painfully slow.** SHA-256 via the Stone prover takes 7–9 seconds, making it the strongest candidate for the Stwo upgrade discussed in Implementation Approaches above. 4. **Noir circuits are consistently slower than Circom** for equivalent hash functions, likely due to UltraPlonk's higher per-constraint overhead compared to Groth16. These baselines are the starting point for our GPU acceleration projections below. ```mermaid xychart-beta title "Proving Time by Hash Function (iPhone 13, Circom)" x-axis ["Poseidon", "Pedersen", "Blake3", "SHA-256"] y-axis "Time (seconds)" 0 --> 1 bar [0.05, 0.07, 0.70, 0.76] ``` --- ## Projected Performance with GPU Acceleration The following tables present our **projected estimates** for GPU-accelerated proving times, based on desktop GPU benchmarks scaled for mobile GPU capabilities. These are not measured results — they represent our best-case modeling given the current state of mobile GPU compute APIs. > **⚠️ Caveat:** These projections assume sustained GPU throughput without thermal throttling, which is unrealistic for mobile workloads beyond ~15 seconds. Real-world performance will be lower. See [Mobile Hardware Constraints](#the-elephant-in-the-room-mobile-hardware-constraints) below. ### Circom (Groth16) — CPU vs. ICICLE-adapted GPU | Input Size | CPU (RapidSnark) | GPU (projected) | Speedup | | ---------- | ---------------- | --------------- | ------- | | 32 bytes | ~50ms | ~40ms | 1.25× | | 128 bytes | ~100ms | ~60ms | 1.67× | | 512 bytes | ~300ms | ~120ms | 2.5× | | 1024 bytes | ~600ms | ~200ms | 3× | ### Noir (UltraPlonk) — CPU vs. Barretenberg GPU | Input Size | CPU | GPU (projected) | Speedup | | ---------- | ------- | --------------- | ------- | | 32 bytes | ~80ms | ~60ms | 1.33× | | 128 bytes | ~150ms | ~90ms | 1.67× | | 512 bytes | ~400ms | ~200ms | 2× | | 1024 bytes | ~800ms | ~350ms | 2.3× | ### Cairo (Stwo) — Stone CPU vs. Stwo GPU | Input Size | CPU (Stone) | GPU (Stwo, projected) | Speedup | | ---------- | ----------- | --------------------- | ------- | | 32 bytes | ~500ms | ~50ms | 10× | | 128 bytes | ~1s | ~80ms | 12.5× | | 512 bytes | ~3s | ~150ms | 20× | | 1024 bytes | ~5s | ~250ms | 20× | *Note: The Cairo/Stwo speedup is primarily architectural (circle STARKs vs. classic STARKs), not purely GPU-driven.* ### Memory Usage Comparison | Implementation | Peak RAM | VRAM | Total | | ------------------- | -------- | ----- | ------- | | RapidSnark (CPU) | 2–3 GB | 0 | 2–3 GB | | ICICLE-adapted (GPU)| 1–1.5 GB | 1 GB | 2–2.5 GB| | Barretenberg CPU | 3–4 GB | 0 | 3–4 GB | | Barretenberg GPU | 2 GB | 1.5 GB| 3.5 GB | | Stone CPU | 4–5 GB | 0 | 4–5 GB | | Stwo GPU | 2 GB | 2 GB | 4 GB | GPU implementations shift memory pressure from RAM to VRAM. On mobile — where RAM and VRAM are the same physical memory — this doesn't reduce total memory usage, but it can reduce *peak allocation pressure* through streaming and pipelining. --- ## GPU Acceleration Priority by Hash Function Based on our analysis, here's how we'd prioritize GPU acceleration efforts by framework and hash function. ### Circom Circuits (Groth16) | Algorithm | GPU Benefit | Priority | Rationale | | ----------- | ----------- | -------- | -------------------------------------- | | Keccak-256 | Very High | Critical | Largest circuit (~150K+ constraints), most to gain | | SHA-256 | High | High | ~30K constraints, significant room for improvement | | Blake2/Blake3| Medium | Medium | Moderate circuit size, already reasonably fast | | Pedersen | Medium | Medium | Benefits from MSM acceleration specifically | | Poseidon | Low | Low | ~250 constraints, already fast on CPU | | MiMC-256 | Low | Low | Small circuits, CPU is sufficient | ### Noir Circuits (UltraPlonk) | Algorithm | GPU Benefit | Priority | Rationale | | ------------ | ----------- | -------- | -------------------------------------- | | Keccak-256 | Very High | Critical | Biggest improvement potential | | SHA-256 | High | High | Significant constraint reduction benefit| | Blake2/Blake3| Medium | Medium | Moderate gains | | Anemoi | Medium | Medium | Newer algebraic hash, moderate circuit size | | Rescue-Prime | Medium | Medium | Moderate improvement | | Poseidon | Low | Low | CPU is sufficient | --- ## The Elephant in the Room: Mobile Hardware Constraints Everything above assumes the GPU is available and capable. On mobile, that assumption breaks down in four critical ways. ### 1. Thermal Throttling Mobile devices are passively cooled. Sustained GPU compute generates heat that the device can't dissipate, triggering thermal throttling — clock speed reduction — after 10–30 seconds of continuous load. A proof that benchmarks at 200ms in isolation might take 400ms when generated immediately after a previous proof, because the GPU hasn't cooled down. **Implication:** GPU acceleration works best for *burst* proving (single proofs with recovery time), not *sustained* proving (batch operations). ### 2. Shared Memory Architecture Unlike desktop systems where the GPU has dedicated VRAM, mobile GPUs share RAM with the CPU (unified memory architecture). This means: - GPU memory bandwidth is substantially lower than desktop (50–100 GB/s vs. 500+ GB/s) - NTT — which is memory-bandwidth-bound — sees reduced GPU benefit - Large circuits may cause memory pressure that degrades *both* CPU and GPU performance ### 3. No Native 64-bit Arithmetic ZK proving operates on large finite field elements — typically 254-bit or 256-bit integers. Desktop GPUs (CUDA) can accelerate this using native 64-bit integer multiply instructions. Most mobile GPUs lack 64-bit integer support entirely, requiring field arithmetic to be **emulated using 32-bit operations** — 4× the instruction count, at minimum. This is perhaps the most underappreciated constraint. It doesn't eliminate the GPU advantage, but it significantly reduces the parallelism benefit. ### 4. Immature Compute Shader Ecosystem CUDA is the gold standard for GPU compute — mature, well-documented, with extensive library support. Mobile offers: - **Metal Compute Shaders** (iOS): Capable but Apple-only, with limited third-party tooling for cryptographic workloads. - **Vulkan Compute Shaders** (Android): Available on most modern devices, but driver quality varies dramatically across manufacturers (Qualcomm, Mali, PowerVR all behave differently). - **WebGPU**: Emerging standard, but browser support is incomplete and performance overhead from the abstraction layer is non-trivial. Building production-quality ZK proving on these APIs requires significantly more engineering effort than using CUDA on desktop. --- ## Architectural Recommendations Based on our analysis, here's a practical roadmap for integrating GPU acceleration into a mobile ZK proving pipeline. ```mermaid flowchart LR subgraph phase1["Phase 1: Quick Wins"] P1A["Switch to ZK-friendly<br>hashes where possible"] P1B["Upgrade Cairo<br>from Stone to Stwo"] end subgraph phase2["Phase 2: Selective GPU"] P2A["Add GPU-accelerated<br>MSM for large circuits"] P2B["Implement hybrid<br>CPU/GPU routing"] end subgraph phase3["Phase 3: Full Pipeline"] P3A["GPU-accelerated<br>NTT"] P3B["Thermal-aware<br>scheduling"] end phase1 --> phase2 --> phase3 style phase1 fill:#0d4b3c,stroke:#10b981,color:#fff style phase2 fill:#1e3a5f,stroke:#3b82f6,color:#fff style phase3 fill:#4a1942,stroke:#a855f7,color:#fff ``` ### Start with the hybrid approach Don't GPU-accelerate everything. Profile your circuits, identify which ones are large enough to benefit, and route accordingly. Our benchmarks show that ZK-friendly hashes (Poseidon at 0.05s, Pedersen at 0.07s on iPhone 13) are already fast enough on CPU — GPU acceleration won't meaningfully improve a 50ms proof. Focus GPU resources on the circuits that need it: SHA-256, Keccak-256, Blake2/3. The decision boundary is roughly: ``` if circuit.constraint_count > 10_000: use_gpu_prover() else: use_cpu_prover() ``` ### Choose ZK-friendly hashes where possible The single biggest performance improvement doesn't come from GPU acceleration — it comes from choosing a hash function with fewer constraints. Switching from SHA-256 to Poseidon is a ~100× reduction in circuit size. No GPU can match that. ### Target MSM first If you're building GPU acceleration from scratch, start with MSM. It's the most parallelizable operation, offers the highest speedup, and has the most reference implementations to learn from. NTT can follow — but its memory-bandwidth sensitivity makes it harder to optimize on mobile. ### Plan for thermal budgets Design your proving pipeline with thermal constraints in mind. If your app needs to generate multiple proofs in sequence, insert cooling delays or degrade gracefully to CPU-only proving when temperature thresholds are hit. ### Watch for Stwo and WebGPU maturity The Stwo prover and WebGPU standard are both evolving rapidly. Within 12–18 months, the mobile GPU compute landscape may look substantially different. Build abstractions that allow you to swap backends without rewriting your proving pipeline. --- ## Conclusion GPU acceleration for client-side ZK proving is promising but not a silver bullet — especially on mobile. The theoretical speedups are real: 2–3× for Groth16, 1.5–2× for UltraPlonk, and potentially 10–30× when combining architectural improvements (Stwo) with GPU compute. But mobile hardware imposes hard constraints that desktop benchmarks don't capture: thermal throttling, shared memory, missing 64-bit integer support, and an immature compute shader ecosystem. The practical gains on a phone are likely 30–50% lower than desktop projections. The most impactful near-term strategy isn't aggressive GPU optimization — it's choosing ZK-friendly hash functions, applying GPU acceleration selectively to large circuits, and designing for the thermal realities of sustained mobile workloads. The tooling will improve. The physics of mobile thermals and memory bandwidth won't. --- *This research was conducted as part of the [Deimos](https://github.com/Blocsoc-iitr/Deimos) project at BlocSoc, IIT Roorkee. Deimos is an open-source mobile zkVM benchmarking tool that evaluates client-side proving performance on real Android and iOS hardware.*

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully