Private collaborative zkVM

# Private collaborative zkVM Zero-knowledge virtual machines (zkVMs) have democratized trusted compute, and prover networks made them deployable at scale. Ironically, the one use case they still struggle to deliver is privacy: as [Vitalik noted](https://x.com/VitalikButerin/status/1838537946586320988), client-side proving remains impractical on many consumer devices. Progress is real—see [Valida](https://www.lita.foundation/blog/client-side-proving-and-verification-with-valida), [Jolt](https://a16zcrypto.com/posts/article/secure-efficient-zkvms-progress/), [Miden](https://0xmiden.github.io/miden-vm/intro/performance.html)—but this still only scratches the surface of what **private delegation** could unlock. This document presents a private **collaborative** zkVM. The co-SNARKs have a strong theoretical foundation ([OB21](https://eprint.iacr.org/2021/1530), [zkSaaS](https://eprint.iacr.org/2023/905), [DFS](https://eprint.iacr.org/2025/296)), and teams like [Taceo](https://taceo.io/) are already productizing them. To my knowledge, this is the first attempt to apply private delegation to a zkVM. If successful, it would make trust-oriented PETs much easier to build and use—and put the “zero-knowledge” back into “zkVM.” > The ongoing implementation is available at [ChainSafe/co-zkvms](https://github.com/ChainSafe/co-zkvms). ## Background ### zk-SNARKs A standard zk-SNARK syntax exposes three interfaces: - $\text{Setup}() \to pp$ - $\text{Prove}(pp, x, w) \to \pi$ - $\text{Verify}(pp, x, \pi) \to \{0,1\}$ It's useful to split $\text{Prove}$ into three parts: - $\text{WitnessExtend}(x, s, C) \to w$; given public $x$ and secret $s$ inputs, extend it for circuit $C$ into a full witness $w$. - Witness-dependent (non-holographic) phase. - Witness-independent (holographic) phase. ### Private proof delegation **Problem:** zk-SNARK proving is resource and time-intensive, limiting applicability and UX. Private Proof Delegation (PPD) lets a resource-constrained client outsource the generation of a zero-knowledge proof to a more powerful server (or servers) while keeping their private data (witness) confidential. For in depth discussion about PPD and approaches see [these](https://hackmd.io/@namncc/H1DmbFBiye) [writeups](https://hackmd.io/@timofey/r1FuxwVsJg). The **prove** algorithm becomes an MPC protocol over shares of $w$; **setup** and **verify** stay unchanged. MPC guarantees honest parties learn nothing beyond the public output. $\text{WitnessExtend}(x,s,C)\to w$ can be (i) computed locally by the client—requiring knowledge of the full private input $s$; or (ii) outsourced and computed under MPC from shares of $s$; the subsequent proving then proceeds over secret shares. ### Delegated Collaborative zk-SNARKs **Problem:** Standard zk-SNARKs assume a single prover who knows the entire witness; this fails when the witness is shared among multiple parties. With Collaborative zk-SNARKs (co-SNARKs), the 3 parties each hold a piece of the secret data (secret witness $w$). They will then interact with each other using MPC to generate a single zk-SNARK. Unlike the classic co-SNARK setting, in the delegated co-SNARKs, MPC nodes don't own the inputs, they merely provide their compute resources to let clients privately outsource collaborative proof generation. This is usually referred to as [**private shared state**](https://hackmd.io/@aztec-network/ByZ6tymhyl). As in single-client PPD, each client secret-shares their input; MPC nodes extend to a full witness per the circuit/program and execute $\text{Prove}$ over shares. Setup and verification remain unchanged. ### MPC↔SNARK Co-design for Scalable Delegation This is a short summary of key techniques used in this work, which were introduced in the [**DFS** paper](https://eprint.iacr.org/2025/296) by _Hu et al._ **MPC (Replicated Secret Sharing).** 3-party RSS: secret $s=s_0+s_1+s_2$, party $P_i$ holds **two** field elements $(s_i,s_{i+1})$—hence “replicated.” Additions are local. To obtain replicated product shares $[xy]$ needs one round (e.g., Beaver or resharing), but each $P_i$ can form **additive** product shares locally: set $c_i=x_i y_i + x_{i+1} y_i + x_i y_{i+1}$, then $\sum_i c_i=xy$; if the product is only used in further additions or opened, no cross-party communication is needed. **Coordinator-centric model.** A lightweight coordinator (with links to all parties), performs following tasks which greatly reduce MPC overhead: - Fiat–Shamir challenge derivation, avoiding costly multi-party randomness/consensus protocols and manipulation risks. - Aggregation of constant-size node messages and final proof assembly; even for non-linear ops, parties compute local **additive** shares whic the coordinator opens, so **no cross-party communication** is required. Because all messages flow through the coordinator, workers are unaware of each other, achieving *malicious security up to additive attacks* without extra cryptographic overhead. - Offloading all heavy compute to workers enables near-linear scaling while the coordinator’s work/comm stays **logarithmic** (<500 Kb total traffic for $2^{24}$ gates circuit; prior work ∼300 GiB). This is further amplified by MPC–SNARK co-design. In private delegation, *the coordinator = delegator.* Since delegator is inherently a trusted (knows the witness and seeks a correct proof), this does not weaken the threat model, but may introduce liveness and UX concerns. Although it cannot be naively outsourced, I will offer a workaround later in the document. **SNARK co-design (sumcheck + holographic IOP; MSM).** Choosing linear, parallelizable SNARK techniques enables scalability by simply adding more workers, while the coordinator stays small: - **Sumcheck.** At round $i$, the output is a univariate $p(t)=\sum_b f(r_1,\ldots,r_{i-1},t,b)$; by linearity, if $f=\sum_j f_j$ (chunking), then $p=\sum_j p_j$. Workers compute local $p_j$ (e.g., evaluations at $0/1$); the coordinator sums them, hashes the transcript to get $r_i$, broadcasts, and the dimension halfs—perfect work-splitting with constant-size per-round communication. - **Holography (Spartan-style).** Clean separation of witness-dependent vs. witness-independent work; the latter dominates (~80%) and runs outside MPC. - **PCS (PST13/MSM-based).** MSMs partition and parallelize cleanly across workers; the coordinator aggregates constant-size openings, matching the distributed proving model. ### Jolt — Multilinear zkVM realizing lookup singularity Among existing zkVMs, Jolt is uniquely suited for private delegation. A dominant Uni-STARK/FRI based designs rely on pervasive non-linear work (hashing, FFTs) that maps poorly to MPC, while Jolt uses multilinear SNARK techniques aligned with DFS’s delegation-friendly co-design. - **Multilinear representation.** Jolt encodes witness (trace, instructions, memory) as multilinear polynomials $\tilde z$. Most prover work is linear and perfectly chunkable under RSS; with coordinator only needing to aggregate constant-size per-round messages. - **Sumcheck-based PIOP.** Jolt reduces all PIOP checks to multilinear sumchecks: instruction logic, bytecode, and RW memory via Lasso lookups (two grand-product arguments each, plus an extra sumcheck for instruction/memory outputs). An additional small “glue” R1CS is enforced via Spartan. These constraints are [uniform](https://hackmd.io/@srinathsetty/spartan#Uniform-constraint-systems) so expensive holographic work is avoided. - **MSM-based PCS.** Jolt/Lasso use MSM-based multilinear PCs (e.g., HyperKZG, [Zeromorph](https://eprint.iacr.org/2023/917.pdf)), both analogous to PST13 with similar parallelization; opening cost is further amortized via a [reduction sumcheck](https://jolt.a16zcrypto.com/background/batched-openings.html). Where MPC private delegation isn’t a trivial fit: - **Witness extension is heavy.** Building the extended witness (instruction outputs, memory timestamps, subtable wiring) uses many non-linear steps (indexing by a secret-shared value, A2B/B2A conversions, modular reductions), which require cross-party MPC communication; similarly, some instruction lookup-combination logic during the primary "collation" sumcheck is non-linear. - **Commitments dominate prover cost.** With secret sharing, polynomial coefficients padded with large correlated masks, so MSM no longer benefits from “small-exponent” MSMs optimizations; this can be thought as the cost of doing business with MPC. - **Jolt is not natively zero-knowledge.** The paper [explicitly suggests](https://eprint.iacr.org/2023/1217.pdf#page=6) composing the final proof in a zkSNARK to achive ZK. Alternatively, [adding random-masking](https://jolt.a16zcrypto.com/future/zk.html) to sumchecks and use a zero-knowledge PCS (e.g., PST13 or Zeromorph) should work as well. This work leverages Jolt’s delegation-friendly structure and addresses shortcomings with various cryptographic and performance engineering techniques. ## Technical overview Jolt prover consists of multiple subprotocols: 1. Bytecode 2. Instruction lookups 3. Read/Write memory 4. Uniform Spartan ("glue" R1CS) ### Witness extension First step is witness extension: given the execution trace (vector of steps each either ELF instruction, bytecode row, or memory operation) and memory IO, prover generates generates multilinear polynomials for each subprotocol. First, I observe that not all polynomials have to be secret shared: #### Bytecode Only the immediate (`v_imm`) carries secret input; addresses/flags/register indices are structural and can remain public. | Shared | Public | | --------------------------- | ------------------------------------------------------------------------------------------------- | | `v_read_write[6]` immediate | `a_read_write` address;`v_read_write[..5]` address; bitflags; `rd`, `rs1`, `rs2` register indices | #### Instruction lookups Flags only select which lookup to apply. `dim` is deterministic based operand values; `read_cts, final_cts, E_polys` are indexed by `dim`. Because subtables are public, opening these would let one reconstruct hidden operands, *although to different extend*. | Shared | Public | |------------------------------------------------------------------------------------------|-----------------------| | `dim` (subtable indices); `read_cts`, `final_cts`; `E_polys` (subtable values); `lookup_outputs` | `instruction_flags` | #### RW memory Read/write values encode secret state; addresses and timestamps enforce ordering/consistency and can be public. | Shared | Public | |---------------------------------------|------------------------------| | `v_read_*` (regs/RAM reads), `v_write_*` (regs/RAM writes) | `a_ram` addresses; `t_*` timestamps | #### R1CS Chunks and many auxiliary polys are formed from hidden operands and must stay secret; flags and purely structural auxiliaries can be public. | Shared | Public | | ----------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- | | `chunks_x`, `chunks_y`<br>`aux.*_lookup_operand`, `aux.product` `aux.relevant_y_chunks`, `aux.next_pc_jump`, `aux.should_branch`, `aux.next_pc` | `circuit_flags` `aux.write_lookup_output_to_rd`, `aux.write_pc_to_rd` | ### Polynomial commitment Commitments are straightforward: for shared polynomials, each party commits to the coefficient vector formed by its **first** RSS component $s_i$ (ignoring $s_{i+1}$); summing commitments from all 3 parties opens the commitment since $s_0+s_1+s_2=\text{coeff}$. Public-polynomial commitments are unchanged. Pairing-based PCS are additively homomorphic: $\mathsf{Com}(a+b)=\mathsf{Com}(a)+\mathsf{Com}(b)$ and $\mathsf{Com}(\lambda a)=\lambda\cdot\mathsf{Com}(a)$. This enables scalable work distribution: - **Chunking.** Split coefficients across workers; each computes a local MSM against SRS powers, $\sum_j a_j [\alpha^j]G_1$. - **Pippenger MSM.** Locally windowed bucketing gives parallel MSMs; workers return one $G_1$ point each. - **Aggregation.** The coordinator sums partial commitments (group additions) into the global commitment. I leave further optimizations to **amortize large-value overhead** for future work. ### Instruction lookups Instruction lookups, bytecode, and read/write memory provers share the same structure: offline memory checking via two grand products; instruction lookups and RW memory add a extra sumcheck to prove outputs. Many challenges are common across these modules; instruction lookups surfaced all of them, so I use it as the case study. ### Primary sumcheck For a trace of length $m$ and $F$ distinct instructions, the collation sumcheck is $$ \sum_{x\in\{0,1\}^{\log_2 m}} \Big[\widetilde{\mathrm{eq}}(r,x)\cdot \sum_{f\in\{0,1\}^{\log_2 F}} \widetilde{\operatorname{flags}_f}(x)\cdot g_f\!\big(\operatorname{terms}_f(x)\big)\Big]=0, $$ where $g_f(\cdot)$ is the $f$-th instruction’s collation. Most $g_f$ are non-interactive and linear, but some, like instructions using the $E_{\text{EQ}}$ subtable, require consecutive multiplications, which becomes a bottleneck on large traces. Furthermore, each round binds many polynomials and evaluates at random points, so parallelism is essential. Mixing CPU and blocking I/O on one Rayon pool can starve the pool (I/O parks threads; required CPU tasks cannot start), causing deadlocks. *The solution:* Use separate pools for CPU vs. I/O, connect them with bounded channels, and pin blocking network ops to the I/O pool only. **SIMD-style batching to amortize I/O.** Treat the work like vector lanes: instead of computing $g_f$ for a single $\operatorname{terms}_f(x)$, batch along two axes. 1. **Domain-axis batching.** Sumcheck reduces the domain size $m\!\to\! m/2$ each round. Let $S_i$ be the remaining $m/2^i$ points. For all $x\in S_i$, precompute and reuse $\widetilde{\mathrm{eq}}(r,x)$ and $\widetilde{\operatorname{flags}_f}(x)$. 2. **Degree-axis batching.** In round $i$, the prover sends a univariate of degree $d$[^1]. For each instruction $f$, precompute $\operatorname{terms}_f(x)$ for all $d$ where $f$ is used ($\widetilde{\operatorname{flags}_f}(x \in S_i) = 1$). **Flatten across both axes → batched $g_f$ → scatter back.** Use sparsity masks (active flags per instruction) to *gather* all needed subtable terms across $S_i$ and degrees into contiguous *per-instruction* batches. Evaluate each $g_f$ *once* on its packed batch (one call per instruction); then *scatter* results to their $(x,\text{degree})$ slots via the same masks, multiply by the flags, and accumulate into the round’s inner sums. Finally, weight by $\widetilde{\mathrm{eq}}(r,x)$ and *reduce* over the halved domain. This preserves sparsity, maximizes locality, and collapses many tiny calls into a few large batched operations. For large $m$ (e.g., $m\!\ge\!2^{20}$), it yields an order-of-magnitude drop in IO calls and framing overhead. *Implementation:* [`instruction_lookups/worker.rs`](https://github.dev/nulltea/co-zkvms/blob/fae80f08666fead6878e8991a5b89b0f125fef50/co-jolt/src/jolt/vm/instruction_lookups/worker.rs#L467-L589). [^1]: In each round the prover sends a univariate polynomial; for the primary sumcheck, $d=\max_f (\deg g_f) + 2$ (one from $\widetilde{\mathrm{eq}}$, one from $\mathrm{flag}_f$). Currently $d=8$. ### Offline memory checking Jolt checks that each memory $E_i$ is well-formed using two subsets: $$ \begin{aligned} &\mathrm{read},\mathrm{write}\subseteq\{(a_i,v_i,t_i)\mid i\in [m]\},\\ &\mathrm{init},\mathrm{final}\subseteq\{(a_i,v_i,t_i)\mid i\in [M]\}, \end{aligned} $$ where $a_i$ is the address (subtable index `dim[i]`), $v_i$ the subtable value `E_poly[i]`, and $t_i$ the timestamp (read counter `read_cts[i]`). Here, $m$ is $\log_2$ of the trace length, and $M$ the number of instruction subtables. Using multiset hashing, it proves $\mathrm{read}\cup\mathrm{final}=\mathrm{write}\cup\mathrm{init}$. This is realized as a grand product over a binary tree of multiplication gates and evaluated via an [optimized GKR protocol](https://eprint.iacr.org/2013/351.pdf). The init/final grand product is dense and constant-sized. The read/write grand product scales with the trace but is sparse (only one instruction active per step). Jolt exploits this by inserting a “toggle” layer that sets inactive leaves to $1$, simplifying calculation of later GKR layers. **Layer construction.** Each GKR layer folds by dotting left/right halves; naively this yields many multiplications and high communication. For dense layers, mitigation is natural—simply [batch](https://github.com/nulltea/co-zkvms/blob/ad5d6469dabdf293aae8cc2e803b057339a1b62e/co-jolt/src/poly/dense_interleaved_poly.rs#L135) the multiplies. For the sparse layer, defer independent multiplies within a layer and fulfill them in one batch at layer end (a *futures*-style trick[^2]), reducing I/O without changing algebra. To prove a grand product, one proves a sequence of sumchecks, one per GKR layer. [^2]: A _future_ represents a value that is not available yet but will be computed later on. Implementation: [`sparce_interleaved_poly.rs`](https://github.com/nulltea/co-zkvms/blob/ad5d6469dabdf293aae8cc2e803b057339a1b62e/co-jolt/src/poly/sparse_interleaved_poly.rs#L152-L185). ### Distributed proving Jolt operates on a sizable witness (255 degree $\ge 2^{16}$ polynomials) that must be kept in RAM during proving. As secret sharing pads coefficients with randomness, memory footprint increases significantly creating a new bottleneck. We can leverage Jolt’s PIOP linearity to distribute this overhead across workers. Witness polynomials are *chunked coefficient-wise*: every worker holds all polynomials but fewer coefficients. Workers compute on their chunk and coordinator (or a lead worker per party) aggregates by summation. Memory per worker drops with $N_{\text{workers}}$, enabling scalable compute. Practically, this requires additional orchestration and a more elaborate networking layer. I will use instruction lookups as a case study again; other modules follow similarly. In the primary sumcheck, workers operate on $\mathrm{eq}$, $\mathrm{flags}$, `E_polys`, and `lookup_outputs` polynomials. All are divided[^3] into coefficient-wise chunks. Workers run the usual sumcheck, but the total rounds are reduced by $\log_2 N_{\text{workers}}$. Each worker communicate inside their 3-party MPC subnet; the coordinator has links to all workers/parties[^4] and receives $N_{\text{workers}}$ round messages, which are summed coefficient-wise. To avoid imposing this extra work on coordinator, a lead worker subnet can aggregate instead (this works because share summation is linear). At the end of this shortened distributed sumcheck, the resulting polynomials have $N_{\text{workers}}$ unbound coefficients. The final $\log_2 N_{\text{workers}}$ rounds are then executed; we again delegate these to the lead worker subnet. [^3]: In the final implementation, *chunking* of the execution trace occurs during *witness extension* by the coordinator; workers never touch data outside their chunk. [^4]: Direct links to all workers are strictly not required: a *proxy* can multiplex per-worker channels with *end-to-end encryption* and *authentication*, simplifying coordinator networking and reducing handshake overhead. ### R1CS Jolt uses R1CS to enforce RISC-V rules and cross-module consistency. The R1CS witness is assembled from selected coefficients from various Jolt polynomials, forming a *mixed polynomial* whose coefficients are either **shared** or **public**. Rust enums proved to be incredible useful here. Two types encapsulate this: [`SharedOrPublic`](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/src/utils/shared_or_public.rs#L14) and [`MixedPolynomial`](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/src/poly/mixed_polynomial.rs#L12). ```rust /// Stores and implements interation between different value types. enum SharedOrPublic<F: JoltField> { Public(F), Shared(Rep3PrimeFieldShare<F>), Additive(AdditiveShare<F>), } /// Polynomial comprised of `SharedOrPublic` coefficients. struct MixedPolynomial<F: JoltField> { pub evals: Vec<SharedOrPublic<F>>, num_vars: usize, len: usize, party_id: PartyID, } ``` With this representation, uniform Spartan proceeds unchanged: three sumchecks with previously-discussed challenges and the optimizations (communication batching, Gruen split-eq, worker-enabled scalability). ### Witness extension in MPC Delegator can perform witness extension, but this is suboptimal — the secret-shared extended witness is top large (for a $2^{22}$-step trace, secret-shared witness ≈50 GB). This also rules out private shared state: the coordinator must know all inputs to extend the witness. Moving witness extension into MPC fixes both, but introduces new challenges; instruction lookups illustrate these well. In $\text{Prove}$ we use arithmetic shares for linear algebra and field relations. Witness extension needs bit ops, comparisons, and indexed lookups, so we use binary shares over a ring (e.g. `u32`). The difference is in representation: field elements $\rightarrow$ packed bits (unsigned integers); and in supported operations: local ADD→XOR, interactive MUL→AND. Conversions between two use **A2B/B2A** (bit-decomposition/recomposition) protocols. #### Generating subtable lookup indices From word-sized operands, compute indices of queried subtables. Each operand is either public (instance) or a binary share (witness). This step is local with no communication; e.g., [`chunk_and_concatenate_operands`](https://github.com/a16z/jolt/blob/42de0ca1f581dd212dda7ff44feee806556531d2/jolt-core/src/utils/instruction_utils.rs#L130) $\rightarrow$ [`rep3_chunk_and_concatenate_operands`](https://github.com/ChainSafe/co-zkvms/blob/72ed5ccf68053152729dade897c890902906e5d2/co-jolt/src/utils/instruction_utils.rs#L56). Next, we convert binary-shared indices to arithmetically shared `dim` polynomials via B2A. Per-share B2A is slow: many small IO calls congest workers, but batching all B2As isn't ideal also: sending large packets leads to poor resource utilization — workers do nothing while waiting to download. The middle-ground approach is most optimal: chunk B2As and compute in parallel on network fork subnets. #### Constructing counters and E_polys `read_cts/final_cts` and `E_polys` polynomials encode subtable query counters and per-step queried values respectively, involves indexing by subtable indices. To handle reads/writes by shared subtable indices, I use [LUT gadget](https://github.com/TaceoLabs/co-snarks/blob/main/mpc-core/src/protocols/rep3_ring/gadgets/lut.rs) developed by Taceo team based on [MAESTRO](https://eprint.iacr.org/2024/1317.pdf) paper[^5]: For memory $i \in M$: 1. Initialize `materialized_subtables` and `final_cts` as arithmetically shared LUTs ($m=2^{16}$ size vectors of ring shares) 2. Generate a one-hot vector (OHV) for subtable indices of all used trace steps. First as binary shares bit shares, then *inject these bits into ring shares*. 3. For used trace step $j \in N$: - Read the counter from `final_cts` using OHV and set it to `read_cts[j]` - Increment the counter and write back to `final_cts` - Read from the public subtable LUT and set result into `E_polys[j]` Naive OHV generation injects $m=2^{16}$ bits per access, giving IO $O(m \cdot N \cdot M)$. Instead, precompute a random OHV with a hidden index $r \in M$ and use XOR shifting: Naive OHV generation injects $m=2^{16}$ bits per access (subtable lookup index) This creats a bottleneck: each OHV is $m=2^{16}$ bits — total IO blows up to $O(m \cdot N \cdot M)$. Instead, precompute a random OHV where the position of that 1 is a secret, uniformly random index $r \in M$. Preprocess (offline, repeatable): - Sample $(r,\texttt{ohv}) \leftarrow \mathrm{FRandOHV}(M=2^k)$ where $\texttt{ohv}$ is a replicated one-hot $M$-length vector at position $r$. Online per address $a$: - Given $a$ as a $k$-bit binary share, open $c=a \oplus r$ (one round). - Use `ohv[i ^ c]` as the OHV bit for position $i \in M$. With OHVs in place, step 3 uses no communication except the final resharing. ``` for (j, ohv) in memory_addr_ohvs.enumerate() { // sample masks from correlated randomness (PRNG) let mut counter = prng.masking_element(); let mut subtable_lookup = prng.masking_element(); for (i, b) in final_cts.iter_mut().enumerate() { let e = ohv[i ^ c]; counter += b * e; // rep3 mul -> additive share *v += e; // ohv_bit (either 0 or 1) } read_cts_local_a[j] = counter; for (i, b) in materialized_subtable.iter().enumerate() { let e = ohv[i ^ c]; subtable_lookup += b * e; } lookup_subtables_local_a[j] = subtable_lookup; } // Reshare additive `read_cts_local_a`, `lookup_subtables_local_a` -> `rep3 read_cts`, `lookup_subtables` ``` Reusing the same randOHV (mask $r$) makes $c=v\oplus r$ linkable: $c_i=c_j \Leftrightarrow a_i=a_j$, and pairwise XORs $\Delta_{ij}=c_i\oplus c_j=a_i\oplus a_j$ expose frequency, strides, and clusters. If any anchor leaks (a single true $a_i$), then $r=c_i\oplus a_i$ reveals all $a_j=c_j\oplus r$. So authors recommend not to reuse: instead precompute fresh randOHVs per access *in advance*, or at minimal rotate masks in short epochs to cap linkability. #### Compute lookup outputs At each execution-trace step, given operand values (public or shared), compute the instruction output. The difficulty comes from non-linearity in many instructions (multiplication, [comparisons](https://github.com/TaceoLabs/co-snarks/blob/main/mpc-core/src/protocols/rep3/arithmetic.rs#L430-L720), [modulo reduction](https://link.springer.com/chapter/10.1007/978-3-642-17373-8_28)). MPC handles these but costs compound on large traces. Batching many steps per instruction amortizes IO and improves locality, so overhead remains small. #### The remaining modules Witness extension for R1CS, Read/Write Memory, and Bytecode is mostly local. The notable cost is casting ring binary shares into arithmetic prime-field shares. Future work aims to explore IO-optimal B2A (e.g., [edaBits](https://eprint.iacr.org/2020/338.pdf)) to reduce this overhead. [^5]: The core technique is based one-hot vector (OHV), a LUT-size bit mask (1 at lookup index, 0 elsewhere) that turns a lookup into a linear inner product with a low-round online phase—$O(\sqrt N)$ or $O(\lambda \log N)$ but higher local compute cost (using PIR). ## Results All tracing logs are provided in [the repo](https://github.com/nulltea/co-zkvms/tree/main/co-jolt/traces) and can be viewed in `chrome://tracing` or [ui.perfetto.dev](https://ui.perfetto.dev/). #### Setup - 3 parties (1 worker per party) — m7i.2xlarge, 8 vCPUs 32 GiB RAM - coordinator — m7i.2xlarge, 8 vCPUs 32 GiB RAM - program: [SHA2 chain (100-300 iterations)](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/examples/sha3-chain/guest/src/lib.rs#L6) ### Prover #### Prover time — Per module ![image](https://hackmd.io/_uploads/SJvXhnYcge.png) Older (less optimized) results on 3x m7i.8xlarge (32 vCPUs 192 GiB RAM) demonstrate linear prover complexity: ![sha2-chain-100-old](https://hackmd.io/_uploads/SyiN4XaOee.png) #### Prover time — Core operations ![image](https://hackmd.io/_uploads/SJZDhnFcxl.png) ### Witness extension #### Witness — Per module ![image](https://hackmd.io/_uploads/HkvGLTF5el.png) ### Witness — Core operations (aggregated across threads) ![image](https://hackmd.io/_uploads/B1DrmaK5xg.png) ## Future work ### Coordinator role delegation Although coordinator work and IO are logarithmic, assigning this role to the delegator is still problematic. It centralizes liveness: the coordinator must stay online and responsive; any pause stalls the protocol. This is workable in practice: browsers can keep a tab active; Android maintains background connectivity ([Doze/Standby](https://developer.android.com/training/monitoring-device-state/doze-standby?utm_source=chatgpt.com)), and iOS is more restricted but [workarounds](https://developer.apple.com/documentation/usernotifications/pushing-background-updates-to-your-app?utm_source=chatgpt.common) exist. *Private shared state* is a bigger concern. Multiple delegators can secret-share inputs independently, but distributed coordination is questionable. With zero-knowledge via [masking polynomials](https://people.cs.georgetown.edu/jthaler/ProofsArgsAndZK.pdf#page=203), delegators could jointly sample masks, but this creates challenges with maintaining a shared transcript (Fiat–Shamir consistency). An alternative is **delegating coordination to a TEE.** Delegators secret-share their inputs locally and send shares to workers; a TEE coordinator samples challenges and masks, aggregates constant-size messages, and assembles the proof. This improves UX at the cost of a TEE trust assumption. The degree of privacy compromise is a function of a non-ZK sumcheck information leakage. If a TEE is compromised, a malicious coordinator could send transparent (zero) or biased masks. Under Fiat–Shamir (ROM), a PPT “dishonest verifier”[^6] cannot bias challenges: each $r_i$ is fixed by the transcript via the random oracle ([Thaler p.78](https://people.cs.georgetown.edu/jthaler/ProofsArgsAndZK.pdf#page=78)). In similar vein, have coordinator sample mask coefficients $g$ and random weight $\rho$ from the transcript and workers to commit to the masking polynomial $g$ ([Xie p.41](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-35.pdf#page=41)). This does not stop the coordinator from *using* different masks, but doing so will cause verification to fail[^7], revealing misbehavior. Let's assume the worst case: the TEE is compromised and the operator is indifferent to detection. The sumcheck is unmasked and the verifier is dishonest but computationally bounded. An unmasked sumcheck on an $\ell$-variate witness with per-round degree $\le d$ leaks exactly $\ell(d+1)$ independent linear constraints (the per-round $d\!+\!1$ evaluations, i.e., weighted inner sums), plus in GKR two MLE evaluations per layer ([Xie p.31](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-35.pdf#page=43), [Libra p.4](https://www.cs.yale.edu/homes/cpap/published/libra-crypto19.pdf#page=4)). This leakage is not negligible, but it is exponentially smaller than a full witness of size $2^{\ell}$ ([Chiesa p.3](https://arxiv.org/pdf/1704.02086#page=3), [Thaler p.38](https://people.cs.georgetown.edu/jthaler/ProofsArgsAndZK.pdf#page=38)). Thus, biasing masks in a single run cannot reconstruct the full witness. Conversely, since masks are sampled from the transcript and $g$ is committed, unless the TEE operator also colludes with parties to repeat proving many times (multi-run grinding), the proof will fail to verify, signaling misbehavior to delegators. Alternatively, workers can generate masking polynomials via a distributed randomness source (DRB/VRF). Derive the mask coefficients as $$ g:=\text{FS}(\texttt { DVRF }) ; \quad c:=\text{ Commit }(\; g\;) ; \quad \rho:=\text{FS}(\; \rho \;\| \texttt { round }) $$ Commit to $g$ before sampling $\rho, r$ and open $g(r)$ when required. Any deviation from the committed $g$ fails the opening, and the proof rejects. This protects against a *non-colluding* malicious coordinator. #### Zero-Knowledge The techniques just described assume zero-knowledge via masking polynomials. Jolt documentation lists additional approaches. Wrapping the Jolt proof in another zkSNARK (Groth’16) works but requires an extra MPC prover. Taceo’s optimized [`co-groth16`](https://github.com/TaceoLabs/co-snarks/tree/main/co-circom/co-groth16) is handy, but a Jolt-proof verification circuit is still not around. Moreover, this does not prevent an honest-but-curious coordinator from learning a logarithmic amount of information per proof, so it does not solve coordinator delegation. [ZK-through-folding](https://eprint.iacr.org/2023/573.pdf#page=27) and [Pedersen-style](https://eprint.iacr.org/2017/1132.pdf) homomorphic commitments are also promising. I leave further analysis of all applicable approaches for future work. Regardless, this benefits both this project and upstream Jolt. [^6]: A compromised coordinator functions as a [dishonest verifier](https://people.cs.georgetown.edu/jthaler/ProofsArgsAndZK.pdf#page=171), since it samples challenges. [^7]: Workers must open $g(r)$ so the verifier subtracts $\rho g(r)$ ($\rho$ is sampled after $g$ commit is written to transcript); using a different mask breaks the opening and the check fails. [CFS'17](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-35.pdf#page=41) ### Known issues, Optimizations, and Features - **Distributed proof delegation:** Currently implemented only for the primary sumcheck in instruction lookups. Extend work distribution to all prover modules and to witness extension for scalable memory and compute. - **ZK for Memory IO:** Jolt verifier currently assumes memory IO knowledge. Without recursion into a zkSNARK, this leaks values. Replace plaintext IO with hiding commitments and verify openings instead. - **RISC-V trace generation MPC:** Final component for private shared state: generate the RISC-V execution trace under MPC so inputs remain secret-shared end-to-end. [StoffelVM](https://github.com/Stoffel-Labs/StoffelVM) could be a good fit here. - **Optimize for public values:** Some prover paths treat public polynomials as [trivial shares](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/src/poly/multilinear_polynomial.rs#L100) (e.g., [Bytecode init/final grand product](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/src/jolt/vm/bytecode/worker.rs#L134-L136)). Move these to a public prover; done for the [timestamp range check](https://github.com/nulltea/co-zkvms/blob/dc2b785b3d3a4873f36215e6e272665199bba885/co-jolt/src/jolt/vm/read_write_memory/worker.rs#L80-L106) (not yet distributed). - **Alternative PCS support:** Add multilinear, ZK PCs such as [Zeromorph](https://eprint.iacr.org/2023/917), [Hyrax](https://eprint.iacr.org/2017/1132), [BMMTV](https://eprint.iacr.org/2019/1177). Investigate techniques to reduce large-value commitment overhead. (suggestions are welcomed) - **Existing Jolt optimizations:** Integrate Spartan [small-value optimization](https://github.com/a16z/jolt/blob/42de0ca1f581dd212dda7ff44feee806556531d2/jolt-core/src/subprotocols/sumcheck.rs#L192) (SVO), [Twist and Shout](https://eprint.iacr.org/2025/105), and [Quarks](https://eprint.iacr.org/2020/1275). These are missing in the current implementation. - **EVM verifier tweaks:** Only needed with PST13; otherwise unnecessary once supported PCs are in place. - **Jolt SDK extension:** Provide delegation and private-shared-state APIs (shared values; distributed witness derive macros). *[RSS]: Replicated Secret Sharing *[PIOP]: Polynomial Interaction Oracle Proof *[PCS]: Polynomial Commitment Scheme *[A2B]: Arithmetic To Binary *[B2A]: Binary To Arithmetic *[MSM]: Multi-Scalar Multiplication *[MLE]: Multilinear Extension *[ROM]: Random Oracle Model *[TEE]: Trusted Execution Environment *[PPT]: Probabilistic Polynomial-Time *[DRB]: Distributed Randomness Beacon *[VRF]: Verifiable Random Function