Performance Trade-offs in zkVM Arithmetic: Lessons from mul_mod

# Fat vs. Lean Precompiles in zkVMs: Short-Term Speed or Long-Term Sustainability? Today, zkVM teams are in a race to see who will be the first to achieve "real-time proving" of an Ethereum block with an efficient amount of GPU usage. This race has introduced a technique for cutting down both execution time and proving time—known to us as **precompiles**. Precompiles in zkVMs are specialized, built-in circuits for expensive cryptographic operations (e.g., hashing, elliptic curve math, pairings). Instead of expressing these operations through the zkVM’s general instruction set—which would be slow and require millions of constraints—precompiles shortcut directly to optimized constraint systems designed for those operations. Precompiles offer great advantages to the Ethereum ecosystem in its goal to SNARKify the execution layer, but little by little, a problem is brewing: **“How many precompiles are too many?”** On average, zkVM teams have about 10 precompiles baked into their systems. These precompiles are not easily adaptable across zkVMs, forcing each team to maintain its own patched crates. This fragmentation exposes Ethereum to multiple attack vectors if any of these patches are compromised. The term **“fat-precompile”** refers to a precompile tasked with performing a heavy-duty operation, which could instead be broken down into smaller units of work. These smaller units are referred to as **“lean-precompiles.”** In this document, we will explore a program simulating a tripartite Diffie-Hellman key exchange. This program performs operations involving common elliptic curve primitives like scalar multiplication and pairings, operating over the `bn254` field. ```rust= // General Guest Program for bn254 Key Exchange pub fn general_guest_program() { let rands = init_rands_bn_batched(); for rand in rands { // Generate private keys let alice_sk = rand; let bob_sk = rand + Fr::one(); let carol_sk = bob_sk + Fr::one(); // Generate public keys in G1 and G2 let (alice_pk1, alice_pk2) = (G1::one() * alice_sk, G2::one() * alice_sk); let (bob_pk1, bob_pk2) = (G1::one() * bob_sk, G2::one() * bob_sk); let (carol_pk1, carol_pk2) = (G1::one() * carol_sk, G2::one() * carol_sk); // Each party computes the shared secret let alice_ss = pairing(bob_pk1, carol_pk2).pow(alice_sk); let bob_ss = pairing(carol_pk1, alice_pk2).pow(bob_sk); let carol_ss = pairing(alice_pk1, bob_pk2).pow(carol_sk); } } ``` The goal of this document is to propose that moving from fat-precompiles like `bn254-pairing` to lean-precompiles such as `bigint-mul_mod` is more optimal and better for the ecosystem in the long term. To test this idea, six programs were created. Each of these programs runs the guest program described above, with slight modifications: 1. **bn-pairing**: Uses the standard `substrate_bn` crate without precompile optimizations. 2. **bigint-pairing**: Uses a modified `substrate_bn` crate that relies on `crypto-bigint` for $U256$ operations, also without precompiles. 3. **ark-pairing**: Uses the `ark_bn254` crate from `arkworks` without precompile optimizations. 4. **bn-pairing-patched**: Identical to `bn-pairing` but enables SP1's "fat" `bn` precompile, which accelerates the full suite of `bn256` operations. 5. **bigint-pairing-patched**: Identical to `bigint-pairing` but enables a generic `bigint` precompile to accelerate low-level $U256$ arithmetic. 6. **ark-pairing-patched**: This is Identical to the `ark-pairing` guest program, but the bigint operation swapped with `sp1::mul_mod` precompiles where it can be applied, this should cut down the execution cycle count. ### zkVM Guest Program execution Here is a result expressing the number of cycle required for running this guest programs with these modification; | Configuration | Base Crate | Precompile Enabled | Cycle Count | Execution Time | | :--- | :--- | :--- | :--- | :--- | | `bn-pairing` | `substrate_bn` | No | 1,105,498,339 | 26.8 s | | `bigint-pairing` | `crypto-bigint` | No | 1,523,558,068 | 26.0 s | | `ark-pairing` | `ark_bn254` | No | **428,207,591** | **7.96 s** | | `bigint-pairing-patched` | `crypto-bigint` | Yes (`bigint`) | 518,877,400 | 11.8 s | | `ark-pairing-patched` | `ark_bn254` | Yes (`ff-bigint`) | **466,770,300** | **9.03 s** | | `bn-pairing-patched` | `substrate_bn` | Yes (`bn`) | **40,014,404** | **2.16 s** | #### Report Analysis of Execution From the cycle count results above, a few key insights emerge: 1. **Baseline comparisons (no precompiles)** Among the unpatched programs, the `ark-pairing` implementation (using `ark_bn254`) was the most efficient, completing in **\~428M cycles (7.96s)** compared to over **1.1B cycles (26.8s)** for `substrate_bn` and **1.5B cycles (26.0s)** for `crypto-bigint`. This shows that *library design choices alone* (algorithmic optimizations, field representation, memory layout) can already create an order-of-magnitude difference before precompiles are even considered. 2. **Effect of lean-precompiles** When we introduce the `bigint` precompile, cycle counts drop significantly: * `bigint-pairing` improves from **1.52B → 518M cycles** (a **3× reduction**). * `ark-pairing` improves from **428M → 466M cycles**, which may seem counterintuitive at first. The small increase comes from the overhead of crossing the VM/precompile boundary, suggesting that not all libraries benefit equally from low-level bigint acceleration. Still, the results confirm that **targeted bigint acceleration is enough to cut down the bulk of the cost** in many elliptic curve operations without requiring a “fat” operation-level precompile. 3. **Effect of fat-precompiles** By contrast, enabling the `bn` fat precompile yields dramatic performance gains: * `bn-pairing` drops from **1.1B cycles → 40M cycles** (a **27× improvement**). Execution time plunges to just **2.16 seconds**, far surpassing any lean-precompile approach. However, this comes at the cost of **hardcoding a full elliptic curve operation suite** into the zkVM. While effective in the short term, this increases long-term fragmentation: each zkVM team maintains its own patched version of `substrate_bn`, and switching curves or libraries requires new fat-precompiles to be designed and audited. 4. **Trade-offs and fragmentation risk** * Fat-precompiles deliver **immediate performance** but **lock the ecosystem** into brittle, curve-specific shortcuts. * Lean-precompiles offer **modularity and reusability**: any cryptographic library—whether `arkworks`, `crypto-bigint`, or `halo2curves`—can benefit from the same low-level `mul_mod` acceleration. * As zkVMs evolve to support multiple proving backends and cryptographic primitives, the lean-precompile path scales better with less maintenance overhead. Fat-precompiles like `bn254-pairing` win by a landslide in raw numbers. But the data shows that **lean-precompiles narrow the gap enough** to make a compelling case: a generic `bigint` accelerator cuts execution by **3×–4×** while remaining flexible, auditable, and portable across libraries. The long-term health of zkVMs will depend not just on raw proving speed, but on **maintainability, interoperability, and security guarantees**. On that axis, lean-precompiles appear to be the more sustainable path forward. ### zkVM Guest Program Proof generation A very similar result was experienced when generating proofs for these guest programs; The total time for proof generation and verification was recorded for each configuration. The results, derived from the prover and verifier logs, are summarized below. **Shorter times indicate better performance.** | Configuration | Base Crate | Precompile Enabled | Cycle Count | Proof Generation Time | Verification Time | Total Time | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | `bigint-pairing` | `crypto-bigint` | No | 1,523,558,068 | ~4 hr 50 min | ~2 min 29 sec | ~4 hr 52 min | | `bn-pairing` | `substrate_bn` | No | 1,105,498,339 | ~4 hr 56 min | ~1 min 52 sec | ~4 hr 58 min | | `bigint-pairing-patched` | `crypto-bigint` | Yes (`bigint`) | 518,877,400 | ~3 hr 16 min | ~1 min 19 sec | ~3 hr 17 min | | `ark-pairing-patched` | `ark_bn254` | **Yes** | **422,122,898** | **~1 hr 23 min** | **~44 sec** | **~1 hr 24 min** | | `ark-pairing` | `ark_bn254` | **No** | **428,207,591** | **~1 hr 17 min** | **~44 sec** | **~1 hr 18 min** | | `bn-pairing-patched` | `substrate_bn` | Yes (`bn`) | **40,014,404** | **~40 min 47 sec** | **~21 sec** | **~41 min 8 sec** 🏆 | #### Report Analysis of Proof Generation Proof generation is where the real bottleneck for zkVMs lies, and the data reflects how precompiles impact not just execution but the cost of proving the execution trace. A few notable points stand out: 1. **Baseline proving costs (no precompiles)** Without precompiles, both `bn-pairing` and `bigint-pairing` suffer from prohibitively long proving times—**nearly 5 hours** in both cases. The cycle counts (1.1B–1.5B) directly inflate the proving trace, and while `bigint` and `bn` differ in raw execution cycles, the end-to-end proving times are almost indistinguishable due to proof system overhead dominating beyond a certain trace size. Interestingly, `ark-pairing` achieves far better numbers: **\~1 hr 17 min total proving time**, over **4× faster** than the other baselines. This matches the execution results where `arkworks` already had a leaner trace, reinforcing the importance of library efficiency before introducing precompiles. 2. **Effect of lean-precompiles** With `bigint-pairing-patched`, proof generation drops from nearly **5 hours → \~3 hr 16 min**. This **35% reduction** shows that a generic low-level bigint accelerator directly reduces proving burden without introducing curve-specific logic. However, the `ark-pairing` case demonstrates diminishing returns: patched vs. unpatched differ by only minutes (**\~1 hr 18 min → \~1 hr 24 min**). This suggests that for already-optimized libraries, crossing into the precompile boundary introduces enough overhead to offset the expected savings. This aligns with our execution analysis: lean-precompiles work best where the underlying library is heavy on raw arithmetic operations, but offer smaller gains for libraries that already optimize aggressively. 3. **Effect of fat-precompiles** The `bn-pairing-patched` case is the clear outlier winner. Proof generation drops to **\~40 min**, a **7× speedup** over unpatched `bn` and even **2× faster** than `arkworks`. The fat-precompile effectively collapses the entire elliptic curve suite into optimized circuits, producing the shortest trace and lowest proof cost in practice. But this comes with the same systemic risks as in execution: heavy reliance on a curve-specific “black box” circuit that is difficult to port, audit, or reuse across zkVMs. 4. **Verification costs** Verification time remains small (tens of seconds to a few minutes) across all cases, showing that the primary bottleneck in zkVM performance lies overwhelmingly in **proof generation, not verification**. This reinforces why execution cycle counts correlate so strongly with prover time. The data confirms a consistent story: * **Fat-precompiles** like `bn254-pairing` deliver unmatched raw performance, with proof generation times under an hour—by far the fastest in this benchmark. * **Lean-precompiles** deliver smaller but still meaningful gains, shaving **30–40%** off proving time for heavy bigint-based libraries, without locking zkVMs into specialized circuits. * **Library efficiency matters as much as precompiles**: `arkworks` shows that smart software design can reduce proving times by 4× even before precompiles are considered. In short: * If the goal is **short-term performance**, fat-precompiles win. * If the goal is **long-term sustainability, ecosystem portability, and security**, lean-precompiles remain the healthier path—especially when combined with efficient cryptographic libraries like `arkworks`. The benchmarks make one thing clear: **fat-precompiles win today on raw numbers**. A `bn254-pairing` precompile can slash execution cycles by 27× and proving time by 7×, delivering performance that looks almost impossible to match with lean-precompiles alone. But that short-term win comes with a hidden cost: fragmentation, maintainability burdens, and a reliance on curve-specific “black box” circuits that do not generalize well across zkVM ecosystems. By contrast, **lean-precompiles like `mul_mod` offer a more modular and portable path**. They may not always produce headline-grabbing speedups, but they enable incremental improvements across any cryptographic library, reduce ecosystem fragmentation, and allow zkVM teams to share the same low-level primitives instead of maintaining siloed, patched crates. Importantly, the results show that lean-precompiles are already powerful enough to cut down cycle counts and proof times by **3×–4×**, proving that we don’t need fat-precompiles to see meaningful gains. The real opportunity lies ahead: zkVM teams understand their stacks better than anyone else, and with continued iteration they are in a position to **push the optimization frontier even further**. By investing in lean-precompile infrastructure and combining it with efficient cryptographic libraries, the ecosystem can continue closing the gap on performance—without sacrificing security, interoperability, or long-term maintainability. fat-precompiles are a quick sprint, but lean-precompiles set the pace for the marathon. If the Ethereum proving ecosystem is to scale sustainably, the case for porting to lean-precompiles is strong.