Folding Schemes: Why they matter and how they are not employed in the best way possible (Part 1)

In this multi-part series, we aim to make folding schemes less mysterious.

In the first part, we focus on answering simple question: what is the fundamental advantage of folding schemes over alternate proving schemes? We also examine why some projects experience poor performance despite using folding. TLDR: it is because of sub-optimal choices and things built around folding schemes (e.g., circuits fed to folding schemes) rather than folding schemes themselves.

In future parts, we will provide a gentle introduction to NeutronNova's and Neo's folding schemes. Here's the TLDR.

Among all existing folding schemes, NeutronNova has the best asymptotics and concrete efficiency. NeutronNova also shows how to handle non-uniformity efficiently that arises when dealing with zkVMs. In particular, NeutronNova provides an approach to build folding schemes for complex statements including VM executions.
After NeutronNova, we will dive into Neo, which combines the best of all existing schemes:
- incurs tiny recursion overheads (like Nova)
- leverages small fields (like STARKs)
- incurs pay-per-bit commitment costs (like Binius)
- leverages the sum-check protocol (like Spartan and Jolt)
- provides plausible post-quantum security

Now, let's begin. We start with our central question.

What is the fundamental advantage of folding schemes over traditional monolithic proving systems?

By traditional monolithic proof systems, we mean proof systems such as Spartan (which powers Jolt, the state-of-the-art zkVM) and STARKs (which underlies other zkVMs).

First, monolithic proving systems follow a relatively elaborate pipeline:

Commit to the witness
Run a Polynomial IOP (PIOP) – often introducing additional commitments like quotient polynomials
Prove evaluations of the committed polynomials – in many cases more expensive than committing to the witness

Note that if we have a circuit with a few million constraints/rows in AIR, R1CS, or Plonkish (or their generalization CCS), the best proving system turns out to be Spartan. See this article for a deeper dive on Spartan. In the current article, we are concerned with very large computations (e.g., those that require many millions or billions of CPU cycles to run).

Folding-based systems, especially modern ones like NeutronNova, transform this paradigm by:

Eliminating step (3) entirely. Rather, folding schemes work with commitments to smaller witnesses and prove an evaluation only once for a small folded polynomial.
Drastically reducing the cost of step (2). For example, NeutronNova runs an early-stopping sum-check after committing to the witness. We do not know any PIOP that is more efficient than an early-stopping sum-check.

Second, monolithic proof systems cannot scale beyond a certain scale as they would run out of memory. To address this, they use recursion: produce proofs of smaller statements and recursively compose those proofs. This incurs additional prover work. Folding schemes are designed to make recursion efficient–even without producing a proof in the first place (i.e., they simply commit to a witness and run a prefix of the PIOP).

Let’s unpack what that means in practice.

🚀 Folding Schemes Slash Proof Generation Costs

In many traditional proof systems, opening polynomial evaluations (step 3) is surprisingly expensive—often more expensive than committing to the original witness.

Folding eliminates the need for repeated polynomial evaluations. For example, NeutronNova:

Uses no quotient commitments (which some protocols like Plonk or STARKs require)
Implements single-round sum-checks (instead of logarithmic rounds)

This means fewer rounds, fewer commitments, faster prover times, and lower recursion overheads.

Note: Hypothetically, assuming that the prover has unbounded memory, NeutronNova's total prover work is similar to the monolithic prover time in sum-check-based proof systems (e.g., Uniform Spartan). However, unbounded memory is not realistic (e.g., monolithic proof systems that do not use recursion run out of memory when proving a single Ethereum block) and NeutronNova is able to achieve the same prover time efficiency (as the hypothetical solution) while providing significant space efficiency benefits.

Additional Note: Besides the folding scheme, another major contribution of NeutronNova is to reduce more complex statements (e.g., those that arise in zkVMs) to a form where folding can be applied. Concretely, NeutronNova describes reductions by adapting techniques from existing monolithic state-of-the-art zkVMs like Jolt. So, a well-engineered NeutronNova will only be faster than Jolt–-while providing sigificant space efficiency. The recipe in NeutronNova can be used to reduce other techniques in existing monolithic settings to a form where folding can be applied.

🔄 Folding Enables Cheap and Scalable Recursion

Another massive benefit: folding enables efficient recursion.

Most proof systems can’t scale to large computations in a single shot. They rely on recursion—but naively combining proofs incurs massive overhead, especially when the base prover is expensive. Moreover, some proof systems like STARKs use the so-called "brute-force" recursion, which is significantly more expensive. An alternative approach is the very recent streaming sum-check. However, leveraging this requires adopting a host of other techniques and is concretely more expensive than a non-streaming linear-time sum-check or a folding-based solution such as NeutronNova.

Folding-based systems like Nova shine here. Recursion is built in, cheap, and incremental. Proof composition is natural, and there’s no need to re-prove everything from scratch.

🧠 Implementation Details Matter (A Lot)

Despite this promise, we see a lot of projects (e.g., Nexus's zkVM) exhibit poor performance despite using folding, compared to solutions built on monolithic proving systems. Many folding-based systems miss out on performance because of how they use folding, rather than folding itself.

A few critical mistakes that we have seen in prior attempts that use folding schemes. Some projects fix some of these critical mistakes (e.g., switch from Merkle trees to offline memory checking) while switching to a different proof system and attibute the improvement to the different proof system. This creates confusion

Using Merkle trees for memory checking
→ Requires committing to thousands of random witness values (even when using SNARK-friendly hash functions like Poseidon) per memory op—totally impractical.
→ State-of-the-art approaches such as Nebula address this. For context, Nebula ports to the folding setting the now-popular (i.e., used in most zkVMs such as Jolt, SP1, and AirBender) offline memory checking method. The offline memory checking was first combined with succinct proof systems by the Spice and Spartan papers. The Spice work used elliptic-curve multiset hash functions (which was later refined and used in SP1 zkVM) and the Spartan work uses fingerprint-based multiset hash functions (which was later refined and used in Jolt zkVM).
→ Even while using these state-of-the-art techniques, one must be careful not to pay for arbitrary witness values. If one does not implement those required optimizations described in the paper (which are not part of the core protocol), they are not bound to reap the full benefits of Nebula. This is a common mistake that we see in the wild.
Using universal circuits that pay for every instruction even when they are not invoked. There are several different ways to handle non-uniformity. If you want to explore, the NeutronNova paper provides a detailed recipe. We will unpack this in a future post.
Using tiny step circuits. Folding works by repeatedly folding a step circuit. The work done in each step circuit should be larger than the recursive verifier circuit. Some of the early implementations of zkVMs based on folding used a single CPU cycle for each recursive step!
Linear folding instead of tree folding. Many zkVMs based on monolithic provers employ multiple parallel provers and use brute-force recursion. The same strategy should be used when folding, to get the best possible latency.
Sub-optimal witness generation
Ignoring bit-width optimizations during multi-scalar multiplication (MSM), a core contibutor to the folding costs. In fact, MSM performance can vary drastically based on implementation.

📊 Data Speaks: MSM Performance Benchmarks

Folding isn't just a theoretical win—it shines when backed by practical optimizations. Take multi-scalar multiplication (MSM), a core building block in ZK systems. Many libraries like halo2curves and arkworks don’t optimize for small values, but Nova’s commitment routines do.

Let’s compare mean times (in milliseconds) for committing 2^20 scalars across different bit-widths. The benchmark code is available here

Input Bit-width	`halo2curves` Commit (ms)	`Nova` Specialized Commit (ms)
1	226.72	8.66
10	249.75	15.73
16	370.24	34.76
32	417.82	52.37
64	282.60	98.53

This comparison makes two things clear:

Bit-width aware MSMs matter. The Nova routines massively outperform unoptimized ones when committing small values.
Poor MSM implementations can hide folding's potential. Libraries not optimized for real-world workloads leave performance on the table. In particular, existing curve libraries are optimized for committing to random/arbitrary values. As we can see from the table, small values are sometimes more expensive to commit with halo2curves than bigger values.

NOTE: halo2curves takes more time to commit to 16-bit and 32-bit values than 64-bit values. We suspect this is due to work imbalance among rayon threads.

📌 Key Takeaways

Folding reduces prover's cost by eliminating significant parts of the prover's work.
Recursive proof composition becomes practical and efficient.
Implementation details—especially memory checking and MSM—make or break performance.
Popular elliptic curve libraries often miss optimization opportunities, leaving performance on the table.

👉 In Part 2, we’ll explore the NeutronNova folding scheme and how to handle non-uniformity.