The Definitive CSP: Towards a Post-Quantum, On-Chain-Verifiable, Client-Side Proving System for ZK Applications

# The Definitive CSP: Towards a Post-Quantum, On-Chain-Verifiable, Client-Side Proving System for ZK Applications > ***Living research document**. This document records design constraints, evaluated alternatives, empirical findings, and the current implementation plan. It is expected to evolve as benchmarks, parameters, and implementation details are refined.* ## Abstract We aim to design and implement a zero-knowledge proving system that is simultaneously: (i) **client-side feasible** (targeting mobile constraints on prover time, RAM and bandwidth), (ii) **post-quantum sound** (hash/code-based or lattice-based), (iii) **transparent** (eliminating setup security assumptions and distribution costs) and (iv) **directly verifiable on-chain** on an EVM L1 without relying on a SNARK wrapper. Current on-chain ZKP verification is dominated by Groth16 and PLONK-ish systems, but these rely on elliptic-curve pairings and are not post-quantum sound. We propose a research and engineering program that starts from **existing** components - specifically, WHIR-based constructions - and iteratively drives down proof size and verification gas cost while preserving client-side feasibility and developer usability. --- ## 1. Problem Statement And Success Criteria ### 1.1 Goal Provide an answer to: **"What is the best client-side proving ZKP system that is PQ-sound and verifiable on-chain?"** "Best" is defined by a three-way objective: - **Client-side proving**: feasible on commodity phones (~4GB RAM target[^1]) under realistic latency budgets (comparable to or better than ProveKit in CSP benchmarks[^2] for reasons defined in [section 5.1](#51-Baseline-And-Performance-Expectations)). - **Post-quantum soundness**: avoid pairing-based assumptions; prefer hash-/code-based soundness. - **On-chain verification**: gas cost low enough for practical L1 settlement (see sections [2](#2-On-Chain-Verification-Constraints) and [7.4](#74-Verification-Cost-Targets)). [^1]: https://hackmd.io/@clientsideproving/AvgMobileHardware [^2]: https://ethproofs.org/csp-benchmarks ### 1.2 Concrete Success Metrics We will track: - **Prover time**, **peak RAM**, **proof size**, preprocessing sizes. - **Verifier gas** (broken down into calldata + hashing + field arithmetic + control flow). - **Security level** (targeting ≥128-bit where feasible; otherwise explicit roadmap to achieve it). ## 2. On-Chain Verification Constraints ### 2.1 Current Baseline: Groth16/PLONK-ish Systems Groth16 verification is widely used because it is cheap and constant-size on-chain, but it is not post-quantum sound. PLONK-ish verifiers can be substantially more expensive in gas. For example, Base reported **~2,396,575 gas** for Noir (UltraHonk) verification in a passkey/ECDSA benchmark suite, vs ~347k gas for Groth16 implementations in that comparison.[^4] [^4]: https://blog.base.dev/benchmarking-zkp-systems ### 2.2 On-Chain Verification of Hash-Based Proof Systems STARK verification on Ethereum has historically been considered expensive; an Ethereum Research post (May 26, 2021) cites **~5M gas** for StarkNet proof verification at that time.[^5] This cost profile is acceptable for rollups but not acceptable for client-side dApps where thhe user is supposed to pay the gas cost per-transaction. [^5]: https://ethresear.ch/t/checkpoints-for-faster-finality-in-starknet/9633 ### 2.3 WHIR As A Promising Direction Recent work explores making WHIR/FRI-style verification significantly cheaper. A WHIR Solidity verifier PoC reported **~1.9M gas** (although, with parameters not necessarily at 128-bit security).[^3] This is already competitive with (and in that comparison, better than) some PLONK-ish on-chain verifiers (e.g., the Noir/UltraHonk figure above). [^3]: https://ethresear.ch/t/on-the-gas-efficiency-of-the-whir-polynomial-commitment-scheme/21301 ## 3. Design Space and Scope Constraints ### 3.1 PQ Families Two main PQ directions are commonly considered: 1. **Hash-/code-based proof systems** (often based on FRI/Reed–Solomon proximity testing), colloquially "STARKs", though not all are STARKs - e.g., Ligero - and Binius use RS codes but are structurally distinct. 2. **Lattice-based approaches**, which can benefit from NTT-heavy arithmetic. ### 3.2 Lattice Verification And The NTT Precompile Line Of Work ZKNOX reports practical results for on-chain lattice verification and motivates NTT acceleration via proposed precompiles.[^6] This is relevant because NTT is also a general acceleration primitive for ZK systems and STARKs in particular.[^7] However, this direction does not currently translate into actionable design choices for our project. Existing STARK-style verifiers deployed on Ethereum are fundamentally Merkle-based: on-chain verification is dominated by Merkle path checking and hash computation and does not include NTTs. Leveraging an NTT precompile would therefore require a **substantial restructuring of the proof system itself**, rather than a localized verifier optimization. Moreover, practical use of such a precompile would require successful standardization and deployment of the precompile at the protocol level. At present, neither condition is guaranteed, and progress on one without the other does not yield a deployable system. As a result, while NTT precompiles represent an important research direction, they do not currently provide a concrete basis for system design, benchmarking, or implementation within the scope of this work. Accordingly, we treat this line of work as **informative but out-of-scope** for the present study, and focus instead on constructions that can be realized using today’s EVM primitives. [^6]: https://zknox.eth.limo/posts/2025/02/24/ETHEREUM_for_PQ_era_250224.html [^7]: https://ethresear.ch/t/tasklist-for-post-quantum-eth/21296 ### 3.3 Scope: WHIR-Based Constructions from Existing Implementations We prioritize a path that is safe engineering-wise: **build only from existing components**, minimizing speculative cryptographic redesign. Concretely, WHIR has already been integrated in multiple projects (PCS usage): - ProveKit (Spartan + WHIR): [https://github.com/worldfnd/ProveKit](https://github.com/worldfnd/ProveKit) - Whirlaway (WHIR-based): [https://github.com/TomWambsgans/Whirlaway](https://github.com/TomWambsgans/Whirlaway) - leanMultisig (continuation of Whirlaway work): [https://github.com/leanEthereum/leanMultisig](https://github.com/leanEthereum/leanMultisig) - Ceno zkVM: [https://github.com/scroll-tech/ceno/](https://github.com/scroll-tech/ceno/) - HyperPlonk + WHIR (p3-playground): [https://github.com/han0110/p3-playground](https://github.com/han0110/p3-playground) This ecosystem gives us multiple candidates and implementation reference points, even though they currently do not implement an EVM verifier. ## 4. Small-Field WHIR-Based Constructions ### 4.1 Small Fields Vs Big Fields A repeated pattern in recent "STARK-ish/IOP-ish" engineering is a move toward **small fields** (e.g., KoalaBear/BabyBear/Goldilocks) to reduce prover overhead and improve hardware efficiency. We adopt the working hypothesis that moving WHIR verification from BN254 field elements (ProveKit, sol-whir) to ~31-bit field elements should reduce calldata costs substantially, because calldata is byte-proportional and field elements shrink. ### 4.2 Verifier cost anatomy In the WHIR Solidity verifier analysis, calldata is a major contributor to total gas, alongside hashing (Merkle) and modular arithmetic.[^3] Thus, field element encoding directly affects the resulting gas cost. ## 5. Benchmarking Plan ### 5.1 Baseline And Performance Expectations ProveKit (Spartan + WHIR) is adopted as the **baseline system** for client-side proving performance throughout this work. This choice is motivated by the fact that ProveKit: - already integrates WHIR in a production-oriented prover stack, - supports zero-knowledge, - targets developers, - and represents a realistic upper bound on engineering maturity among WHIR-based systems today. We aim to match or outperform ProveKit’s by choosing a system with a different field and the ZKP protocol, while removing constraints that make ProveKit unsuitable for direct post-quantum on-chain verification (e.g., reliance on BN254 and proof wrapping). The central hypothesis is that moving from large prime field (BN254) to small fields (e.g., KoalaBear) provides: - lower arithmetic cost per operation, - better cache locality and SIMD/vectorization friendliness[^8], - reduced witness and transcript size, which together allow a WHIR-based system to achieve ProveKit-comparable or better prover performance. Accordingly, in all benchmarks: - ProveKit serves as the performance floor, and - any candidate system failing to match ProveKit’s prover time or memory footprint is considered non-competitive, regardless of verifier advantages. [^8]: https://github.com/tcoratger/whir-p3/pull/383 ### 5.2 Preliminary Results We compare four WHIR-based candidates with the explicit aim of choosing _one_ base system: - **ProveKit** (Spartan + WHIR; big-field leaning; includes ZK features but also design choices aimed at SNARK wrapping) - **HyperPlonk + WHIR** (p3-playground) - **Whirlaway** - **Ceno** (GKR + WHIR) The benchmark plan is: 1. Update WHIR dependency (https://github.com/tcoratger/whir-p3/) and the corresponding ZKP system code for all Plonky3-based benchmarked systems and ensure consistent primitives (see [section 5.4](#54-Cryptographic-Primitives-For-EVM-Compatibility))). 2. Re-run benchmarks across candidates with unified parameters and measurement scripts. 3. Choose the base system using a **weighted objective** (proof size first, then verifier time/gas proxies, then prover RAM/time). #### 5.2.1 Feb 5th, 2026 Update The benchmarks code is in https://github.com/alxkzmn/csp-benchmarks/tree/the-definitive-csp The latest benchmarks were run on an M4 Pro laptop. The benchmarks were run using equivalent Keccak-256 circuits that perform full hashing for variable input size. Keccak-256 circuit was chosen for the following reasons: - Readily available [Keccak-256 AIR in Plonky3 library](https://github.com/succinctlabs/plonky3/tree/main/keccak-air) - significantly reduces the engineering effort; - Sufficient circuit complexity to be able to judge about systems' capabilities (i.e. not as simple as Poseidon AIR). **Ceno** was discarded early during the test runs for the following reasons: - No full WHIR support yet (only BaseFold) - ~6s proving time, 5+ GB RAM footprint for the smallest SHA-256 input size in CSP benchamrking suite (128B). Such RAM footprint is beyond our feasibility cutoff for client-side proving. Significantly long proving time for such a small circuit ([compared to other systems](https://ethproofs.org/csp-benchmarks)) also hints at the system not being optimized for client-side devices, which may be out of scope for this zkVM. The benchmarks therefore were run for **HyperPlonk-WHIR** and **Whirlaway** against the baseline of **ProveKit**. ![collected_benchmarks_whir_keccak_proof_duration](https://hackmd.io/_uploads/rySqAg7vbg.png) ![collected_benchmarks_whir_keccak_proof_size](https://hackmd.io/_uploads/Sk4cAgQDZl.png) ![collected_benchmarks_whir_keccak_peak_memory](https://hackmd.io/_uploads/S1EcRxXvbe.png) ![collected_benchmarks_whir_keccak_preprocessing_size](https://hackmd.io/_uploads/rJr5CgXw-e.png) ![collected_benchmarks_whir_keccak_verify_duration](https://hackmd.io/_uploads/rkH90g7Pbg.png) :::success The results demonstrate that both proving systems have superior performance with respect to the baseline system. ::: The systems' proving performance (proving time and RAM footprint) is well within the bound of client-side proving. Let's turn our attention to the systems' potential for on-chain verifiability. The results demonstrate that Whirlaway has the smallest proof sizes, whereas HyperPlonk has the lowest verification times. As the proof size is the dominating factor of gas cost via calldata size[^3], we need to assess whether Whirlaway could be the best candidate for the EVM verifier implementation. Whirlaway is already running at 128 bits of security, whereas HyperPlonk only has 100 bits of security. *- TODO would updating HyperPlonk to 128 bits of security potentially increase the proof size (larger extension field)?* *- TODO serialize the proofs into EVM-compatible format to directly compare with the 128kB calldata cutoff* :::warning The further research should focus on 1) Matching the baseline security of 128 bits for HyperPlonk (merge the recent quintic extension for KoalaBear from Plonky3) and re-running the benchmarks to get the updated proof sizes; 2) Exploring ways to reduce the proof size for Whirlaway (priority) and HyperPlonk (optionally). ::: ### 5.3 Previous Results (summer 2025) Because the Plonky3-based WHIR implementation evolves quickly (e.g., ongoing improvements discussed publicly), older benchmark rounds become stale.[^9] You can find the results of the first benchmark round (Summer 2025) in [Appendix A](#Appendix-A-WHIR-Based-Proving-Systems-Research-Summer-2025). [^9]: https://ethresear.ch/t/ntt-as-postquantum-and-starks-settlements-helper-precompile/21775 ### 5.4 Cryptographic Primitives For EVM Compatibility To make results reflect the eventual EVM verifier, we standardize to **Keccak** for: - Transcript challenger (easy in Plonky3, abstracted by a trait) - Merkle hashing (requires minor refactor) This brings the verifier code closer to its target state of being fully EVM-compatible and optimized for cheap on-chain verification. ## 6. (Optional) RAM Footprint Reduction Techniques If the winning candidate exceeds the client-side RAM target (~4GB), we may apply **streaming/time-space trade-offs for sumcheck** to reduce peak memory at some prover-time cost. The relevant line of work studies time-space trade-offs for sumcheck in streaming models.[^10] [^10]: https://eprint.iacr.org/2024/1970, https://eprint.iacr.org/2024/524, https://eprint.iacr.org/2025/1473 ## 7. On-Chain Verifier Design ### 7.1 Reference Implementations We build from the existing Solidity verifier PoC and accompanying EVM-oriented prover modifications: - Solidity verifier PoC: https://github.com/privacy-ethereum/sol-whir - EVM-verifier-oriented prover branch: https://github.com/dmpierre/whir/tree/feat/evm-verifier - Gas-efficiency writeup and breakdown[^3] ### 7.2 Expected Effects Of Small Field We expect improvements mainly from: - **Calldata shrinkage**: smaller field elements → fewer bytes. - **Byte-wise Keccak input shrinkage**: modest hashing savings. - **Arithmetic batching**: for 31-bit field elements, many additions/multiplications can be accumulated in 256-bit lanes before a single modular reduction step, potentially reducing `addmod/mulmod` usage. ### 7.3 Merkle Commitment Parameterization We will incorporate "increase arity closer to the root" style optimizations where they help gas marginally, as already explored in prior analysis[^11]. The gain is expected to be incremental, not foundational. [^11]: https://hackmd.io/@clientsideproving/whir-fri-verifier-opt ### 7.4 Verification Cost Targets A pragmatic target is **< 1M gas** for verification on Ethereum mainnet (accepting that this will not beat Groth16’s ~300–400k class costs, but provides PQ soundness + transparency). Comparison table: |ZKP system|Verification gas cost| |-|-| |Groth16|348K[^4]| |***Target***|***<1M***| |***WHIR (SotA)***|***1.9M***[^3]| |Barretenberg|2.4M[^4]| ## 8. Adding Zero-Knowledge (ZK) To A Non-ZK Base Protocol The chosen base (e.g., HyperPlonk/Whirlaway/Ceno) may not be ZK by default. We plan to add ZK using established techniques for making sumcheck/IOP components zero-knowledge.[^12] We will also reuse implementation patterns from multilinear and WHIR-based systems that already ship ZK in practice (e.g., Spartan2 and ProveKit) to reduce engineering risk. [^12]: https://eprint.iacr.org/2017/305 ## 9. Frontend/Developer Experience Plan ### 9.1 Why "Not Noir" Noir is widely adopted, but its target field and IR assumptions do not match our target stack. Our candidate systems imply: - Different field (not BN254, but a small field like KoalaBear) - Potentially different IR (not ACIR, but AIR with Plonky3 API) This makes direct reuse of Noir libraries and tooling substantially harder, and may require re-implementing "precompile-like" gadgets (e.g., range checks), as observed in practice in ProveKit. ### 9.2 Cairo-Based Frontend Option Cairo already targets small fields and AIR, matching the representation used by Plonky3-based systems that we are considering. This reduces the potential amount of work on an adapter from Cairo to Plonky3 AIR. ## 10. Implementation Milestones **M0 - Reproducible baseline** - Pin versions, unify hash/transcript primitives (Keccak), and produce a reproducible benchmark harness. **M1 - Updated benchmark round** - Re-run ProveKit vs HyperPlonk+WHIR vs Whirlaway vs Ceno benchmark. - Select winner using: proof size → verifier proxy cost → prover RAM/time. **M2 - Solidity verifier for the chosen small-field WHIR system** - Port the existing Bn254 verifier to chosen small field (or build from scratch while closely referencing the existing implementation). - Implement all potential enhancements related to small field (e.g., batch addmod/mulmod). - Profile gas cost and apply further optimizations to reach target range (ideally <1M). **M3 - Add ZK** - Apply zero-knowledge sumcheck/IOP techniques; validate no leakage and measure overhead. **M5 - Frontend prototype** - Cairo-to-(target AIR) compilation path for at least one canonical application (e.g., account abstraction gadget or private transfer toy app). **M6 (Optional) - Low-memory proving option** - Integrate streaming sumcheck if needed for the "4GB RAM" goal. ## 11. Open Questions Miden, Binius64 and Nexus demonstrate fast verification times while retaining reasonable proof sizes[^2]. This combination of properties is stated as a motivation to create the WHIR verifier by the authors of sol-whir[^3]. It is worth exploring whether these systems may be verifiable on-chain by first getting in touch with the developers. --- ## Appendix A: WHIR-Based Proving Systems Research (Summer 2025) ### TLDR: - ProveKit is winning as a well-rounded client-side proving system with a good trade-off between proving time and RAM. - Whirlaway is potentially usable for client-side proving thanks to its constant RAM footprint, although it loses to ProveKit in speed. - Hyperplonk-WHIR excels at small circuits, but cannot compete with the other two at big circuits ### Source Code https://github.com/alxkzmn/whir-based ### Implementations Used - https://github.com/worldfnd/ProveKit (Spartan+WHIR) - https://github.com/han0110/p3-playground (Hyperplonk+WHIR) - https://github.com/TomWambsgans/Whirlaway (A new multilinear PIOP + WHIR) All implementations used are not zero-knowledge. ### Challenges - The proving systems under comparison are using different prime fields, and the implementations are tightly coupled with the field. p3-playground and Whirlaway are using a 31-bit KoalaBear field, and ProveKit is using a Bn254 (254 bit). Therefore, we have created two types of circuits to benchmark in ProveKit - structure-equivalent and input-equivalent. Structure-equivalent circuits perform equivalent operations, e.g., applying the same number of Poseidon permutations with the same Poseidon configuration to an input of the *same number of field elements* as the p3-based counterparts. Input-equivalent circuits operate on the *same byte-size input* as the p3-based counterparts. - p3-playground and Whirlaway are using Plonky3 library for circuit building. Plonky3 is using AIR airthmetization, and only a limited set of AIRs are readily available (Poseidon2, Keccak). This limited our choice of benchmarking targets; - ProveKit doesn't re-implement all available precompiles of Noir, so we had to develop a Noir-native keccak circuit to compare against p3-playground and Whirlaway; - Whirlaway does not support some of the Plonky3 AIR APIs, so we had to modify the existing keccak AIR to work with Whirlaway; - The systems initially used slightly different WHIR parameters (e.g., Plonky3-based systems are using 100 bits of security, citing unavailable extension fields to achieve 128 bits of security), so we had to adjust ProveKit's WHIR parameters to match the other two systems. However, this resulted in a negligible difference in performance. ### Findings - ProveKit is winning as a well-rounded client-side proving system with a good trade-off between prover time and RAM. - Whirlaway is potentially usable for client-side proving thanks to its constant RAM footprint, although it loses to ProveKit in speed. - Hyperplonk-WHIR excels at small circuits, but cannot compete with the other two at big circuits You can find the benchmark results in the [spreadsheet](https://docs.google.com/spreadsheets/d/1WPbr_psCx7GuOUJ2Twa8M7APlqsUiY7DibQ1cdqxtOs/edit?usp=sharing). The benchmarks were performed on a M4 Mac Book Air. All times are in milliseconds. "be" and "se" refer to byte-equivalent and structure-equivalent circuits mentioned [above](#Challenges). "p3w" and "ow" refer to WHIR parameters same as in Plonky3-based systems and original ProveKit, respectively. Poseidon circuits take inputs in the range of 2^6 .. 2^9 bytes, keccak circuits take inputs in the range of 2^5 .. 2^8 bytes. All results in one plot per circuit: ![Screenshot 2025-09-03 at 21.47.54](https://hackmd.io/_uploads/SJDXvTS9ll.png) ProveKit only: ![Screenshot 2025-09-02 at 19.04.03](https://hackmd.io/_uploads/HyNBkLNcee.png) ProveKit times and RAM remain nearly constant across tested circuit sizes. The observed growth is negligible and, at worst, consistent with $O(\log⁡ n)$. In contrast, for HyperPlonk, both proving time and RAM exhibit growth between $O(\log n)$ and $O(n \log n)$ relative to circuit size. For Whirlaway, proving time scales on the order of $O(n \log n)$, while RAM usage remains approximately constant with respect to circuit size. #### Proving Time Breakdown All times are in milliseconds. ##### Hyperplonk-WHIR ```mermaid pie showData title HP-WHIR, Keccak circuit (2^5 input) "WHIR Commit" : 62.6 "Prove PIOP" : 21.2 "WHIR Open" : 168 ``` ```mermaid pie showData title HP-WHIR, Keccak circuit (2^8 input) "WHIR Commit" : 500 "Prove PIOP" : 150 "WHIR Open" : 1510 ``` ##### Whirlaway ```mermaid pie showData title Whirlaway, Poseidon circuit (2^6 input) "WHIR Commit" : 1.41 "Zerocheck" : 3.93 "Inner sumchecks" : 1.22 "WHIR proof" : 10.1 ``` ```mermaid pie showData title Whirlaway, Poseidon circuit (2^9 input) "WHIR Commit" : 1.97 "Zerocheck" : 4.73 "Inner sumchecks" : 1.85 "WHIR proof" : 19.0 ``` ##### ProveKit *TODO Re-do the proving time breakdown for Whirlaway on a bigger (keccak) circuit?* ```mermaid pie title ProveKit, Keccak circuit (2^5 input) "Read input (71 ms)" : 71.3 "Generate witness (0.3 ms)" : 0.312 "Fill witness (1.7 ms)" : 1.67 "Create IO pattern (0.5 ms)" : 0.512 "Sumcheck (30.9 ms)" : 30.9 "External row R1CS (16.5 ms)" : 16.5 "WHIR PCS prover (691 ms)" : 691 "Write output (11.1 ms)" : 11.1 ``` ```mermaid pie title ProveKit, Keccak circuit (2^8 input) "Read input (125 ms)" : 125 "Generate witness (1.9 ms)" : 1.89 "Fill witness (2.4 ms)" : 2.35 "Create IO pattern (0.6 ms)" : 0.55 "Sumcheck (41.3 ms)" : 41.3 "External row R1CS (21.1 ms)" : 21.1 "WHIR PCS prover (774 ms)" : 774 "Write output (10.3 ms)" : 10.3 ``` As we can see, the times are dominated mainly by WHIR, and the share of WHIR stays quite consistent regardless of the proof size. ProveKit timings include file I/O overhead (reading inputs and writing outputs). In an integrated application where data remains in memory, runtimes can be up to 10% lower.