# Optimizing the Sumcheck Prover: Small Values and Equality Polynomials
The sumcheck protocol is important in modern SNARKs, but its prover runtime can be a significant bottleneck. This document details two complementary optimizations that accelerate sumcheck proving in common applications: **(1)** when the polynomial evaluations involve values small relative to the field size, and **(2)** when the summed polynomial contains an equality polynomial factor, $eq(w, X)$.
References used for this note:
- *Bagad and al. paper:* https://eprint.iacr.org/2025/1117.pdf
- LambdaClass blog posts:
- *Optimizing sumcheck:* https://blog.lambdaclass.com/optimizing-sumcheck/
- *How factoring equality polynomials optimizes sumcheck:* https://blog.lambdaclass.com/how-factoring-equality-polynomials-optimizes-sumcheck/
## Notation and Assumptions
### Notation
| Symbol | Definition |
| :--- | :--- |
| $\ell$ | The number of variables in the multilinear polynomials. |
| $N$ | The size of the summation domain, $N = 2^\ell$. |
| $\mathbb{F}$ | A finite field. |
| $p_1, \dots, p_d$ | $d$ multilinear polynomials in $\ell$ variables. |
| $g(X)$ | The polynomial being summed, often $g(X) = \prod_{k=1}^d p_k(X)$. |
| $s_i(X_i)$ | The univariate polynomial sent by the prover in round $i$. |
| $r_1, \dots, r_\ell$ | Random challenges from the verifier. We denote $r_{[<i]} = (r_1, \dots, r_{i-1})$. |
| $\text{eq}(w, X)$ | The multilinear extension of the equality function for $w \in \{0, 1\}^\ell$. |
| $ss, sl, ll$ | Cost of a small-by-small, small-by-large, and large-by-large field multiplication. |
| $\ell_0$ | The number of initial rounds to apply the small-value optimization. |
| $U_d$ | An evaluation set for degree-$d$ polynomials, e.g., $U_d := \{\infty, 0, 1, \dots, d-1\}$. |
### Assumptions
1. **Streaming Access**: We assume the prover can stream evaluations of the base polynomials $p_k(x)$ for $x \in \{0, 1\}^\ell$ in lexicographical order. This is typical in zkVMs where evaluations correspond to an execution trace.
2. **Small Values**: In the "small-value" setting, we assume that for all $k \in [d]$ and $x \in \{0, 1\}^\ell$, the evaluation $p_k(x)$ is "small" (e.g., fits in a single machine word) relative to elements of the prime field $\mathbb{F}$.
## The Baseline: Linear-Time Sumcheck
The sumcheck protocol proves a claim of the form $\sum_{x \in \{0,1\}^\ell} g(x) = C$. In each round $i \in [1, \ell]$, the prover's main task is to compute the univariate polynomial:
\begin{equation}
s_i(X_i) = \sum_{x' \in \{0,1\}^{\ell-i}} g(r_1, \dots, r_{i-1}, X_i, x').
\end{equation}
The prover then sends evaluations of $s_i$ on a set $U_d$ to the verifier. A standard linear-time, linear-space algorithm maintains arrays of polynomial evaluations, halving their size in each round by binding the verifier's challenge.
**Cost Profile**:
* **Round 1**: All initial evaluations $\{p_k(x)\}$ are small. Computing $s_1$ involves only cheap $ss$ multiplications. Binding the first challenge $r_1$ requires $sl$ multiplications.
* **Round 2 onwards**: The polynomial evaluations are now random linear combinations involving $r_1$ and are thus "large" elements of $\mathbb{F}$. All subsequent computations, including products and challenge binding, require expensive $ll$ multiplications.
The dominant cost is therefore concentrated in the first few rounds after round 1, where the summation domain is largest. The total cost is dominated by $\approx \frac{d(d+1)}{2} N \cdot \text{cost}(ll)$. Our goal is to mitigate this high cost.
## Optimization 1: Small Value Sumcheck
This optimization replaces expensive `ll` multiplications with cheaper `ss` and `sl` multiplications during the initial $\ell_0$ rounds of the protocol.
### Core Idea: Lagrange Decomposition
The idea is to separate the computation involving large random challenges from the computation involving small initial evaluations. For a round $i \le \ell_0$, the polynomial $s_i(u)$ for some $u \in U_d$ is a sum of products:
\begin{equation}
s_i(u) = \sum_{x' \in \{0,1\}^{\ell-i}} \underbrace{\prod_{k=1}^d p_k(\overbrace{r_1, \dots, r_{i-1}}^{\text{large}}, \overbrace{u, x'}^{\text{small/binary}})}_{\text{A function of } (r_1, \dots, r_{i-1})}.
\end{equation}
This expression, viewed as a function of the large challenges $r_{[<i]}$, is a multilinear polynomial. We can express it using Lagrange interpolation over a grid of small evaluation points. This yields a decomposition:
\begin{equation}
s_i(u) = \sum_{v \in U_d^{i-1}} \underbrace{\left( \sum_{x' \in \{0,1\}^{\ell-i}} \prod_{k=1}^d p_k(v, u, x') \right)}_{A_i(v, u): \text{ Accumulator}} \cdot \underbrace{L_v(r_1, \dots, r_{i-1})}_{\text{Lagrange Basis Poly}}.
\end{equation}
Here, $L_v$ is a multivariate Lagrange basis polynomial.
* The **Accumulators** $A_i(v,u)$ depend only on evaluations at "small" points from the grid $U_d^{i-1}$ and the hypercube $\{0,1\}^{\ell-i}$. They can be precomputed using only cheap `ss` multiplications.
* The **Online Phase** involves computing the Lagrange polynomials at the large challenges $r_{[<i]}$ and then taking an inner product with the precomputed accumulator vector. This primarily involves `sl` multiplications.
:::info
### Example: Small-Value Optimization
Let's illustrate the Lagrange decomposition with a simple setup.
* **Parameters**:
* Number of polynomials $d=2$.
* Number of variables $\ell=3$.
* Number of optimized rounds $\ell_0=2$.
* **Polynomials**:
* $p_1(X_1, X_2, X_3) = X_1 + X_2 + X_3$
* $p_2(X_1, X_2, X_3) = X_1 + 2X_2$
* The polynomial being summed is $g(X) = p_1(X) \cdot p_2(X)$.
* **Goal**: Compute the prover's message for **round $i=2$**. After receiving the first random challenge $r_1$, the prover must compute $s_2(u)$ for $u \in U_2 = \{\infty, 0, 1\}$. We'll compute the specific value $s_2(1)$.
The direct computation would be:
\begin{equation}
s_2(1) = \sum_{x_3 \in \{0,1\}} \underbrace{p_1(r_1, 1, x_3)}_{\text{large}} \cdot \underbrace{p_2(r_1, 1, x_3)}_{\text{large}}
\end{equation}
Since $r_1$ is a large, random field element, both $p_1$ and $p_2$ evaluations are large, making this an expensive `ll` multiplication inside the sum.
**Applying the Optimization**
The prover uses the Lagrange decomposition for round $i=2$. The expression for $s_2(u)$ is a quadratic polynomial in $r_1$, so we can interpolate it from its evaluations at three small points, which are our accumulators. The formula is:
\begin{equation}
s_2(u) = \underbrace{A_2(\infty, u)}_{\text{coeff of } r_1^2} \cdot r_1(r_1-1) + \underbrace{A_2(0, u)}_{f(0)} \cdot (1-r_1) + \underbrace{A_2(1, u)}_{f(1)} \cdot r_1
\end{equation}
**Step 1: Precomputation (Offline)**
The prover computes the accumulators for $u=1$. These calculations only involve small numbers from $\{0,1\}$ and are therefore cheap `ss` operations.
* **Compute $A_2(0, 1)$**:
\begin{align*}
A_2(0, 1) &= \sum_{x_3 \in \{0,1\}} p_1(0, 1, x_3) \cdot p_2(0, 1, x_3) \\
&= (p_1(0,1,0) \cdot p_2(0,1,0)) + (p_1(0,1,1) \cdot p_2(0,1,1)) \\
&= ((0+1+0) \cdot (0+2(1))) + ((0+1+1) \cdot (0+2(1))) \\
&= (1 \cdot 2) + (2 \cdot 2) = \mathbf{6}
\end{align*}
* **Compute $A_2(1, 1)$**:
\begin{align*}
A_2(1, 1) &= \sum_{x_3 \in \{0,1\}} p_1(1, 1, x_3) \cdot p_2(1, 1, x_3) \\
&= (p_1(1,1,0) \cdot p_2(1,1,0)) + (p_1(1,1,1) \cdot p_2(1,1,1)) \\
&= ((1+1+0) \cdot (1+2(1))) + ((1+1+1) \cdot (1+2(1))) \\
&= (2 \cdot 3) + (3 \cdot 3) = \mathbf{15}
\end{align*}
* **Compute $A_2(\infty, 1)$** (the leading coefficient):
This is the sum of the leading coefficients (in $X_1$) of the product. The leading coefficient of $p_1(X_1, 1, x_3)$ is 1, and for $p_2(X_1, 1, x_3)$ it's also 1.
\begin{align*}
A_2(\infty, 1) &= \sum_{x_3 \in \{0,1\}} \text{coeff}(p_1, X_1) \cdot \text{coeff}(p_2, X_1) \\
&= (1 \cdot 1) + (1 \cdot 1) = \mathbf{2}
\end{align*}
**Step 2: Online Computation**
The verifier sends a random challenge, say $r_1 = 10$. The prover computes $s_2(1)$ using the precomputed accumulators and the Lagrange formula.
\begin{align*}
s_2(1) &= A_2(\infty, 1) \cdot r_1(r_1-1) + A_2(0, 1) \cdot (1-r_1) + A_2(1, 1) \cdot r_1 \\
&= \underbrace{2 \cdot (10 \cdot 9)}_{sl} + \underbrace{6 \cdot (1-10)}_{sl} + \underbrace{15 \cdot 10}_{sl} \\
&= 180 - 54 + 150 = \mathbf{276}
\end{align*}
Notice that this computation only involves multiplying a small precomputed accumulator by a large value derived from $r_1$—these are all cheaper `sl` multiplications.
**Verification (Comparing with Direct Method)**
Let's compute $s_2(1)$ directly with $r_1=10$ to check our work.
\begin{align*}
s_2(1) &= \sum_{x_3 \in \{0,1\}} p_1(10, 1, x_3) \cdot p_2(10, 1, x_3) \\
&= (p_1(10,1,0) \cdot p_2(10,1,0)) + (p_1(10,1,1) \cdot p_2(10,1,1)) \\
&= ((10+1+0) \cdot (10+2)) + ((10+1+1) \cdot (10+2)) \\
&= \underbrace{(11 \cdot 12)}_{ll} + \underbrace{(12 \cdot 12)}_{ll} \\
&= 132 + 144 = \mathbf{276}
\end{align*}
We can see that we arrive at the same result in both cases.
The optimization successfully replaced two expensive `ll` multiplications with a cheap `ss` precomputation and a fast `sl` online phase.
:::
### The Hybrid Algorithm
A hybrid strategy is employed:
1. **Precomputation (Offline)**: For the first $\ell_0$ rounds, compute all necessary accumulator values $A_i(v, u)$. This is the most expensive part but uses only `ss` multiplications over streamed data.
2. **Rounds $i \in [1, \ell_0]$ (Online)**: For each round, compute the prover message $s_i(u)$ using the inner product from the Lagrange formula. This is fast and dominated by `sl` multiplications.
3. **Transition (Round $\ell_0+1$)**: After round $\ell_0$, explicitly compute the evaluation arrays $\{p_k(r_{[\le \ell_0]}, x')\}$ for the remaining variables $x'$.
4. **Rounds $i \in [\ell_0+1, \ell]$ (Online)**: Switch back to the standard linear-time algorithm, which now operates on a smaller domain of size $N/2^{\ell_0}$ and uses `ll` multiplications.
### Optimal Switchover Point $\ell_0$
The optimal choice of $\ell_0$ balances the exponential cost of precomputation against the savings in the standard rounds. If `ll` multiplications are $\kappa$ times more expensive than `ss` multiplications, the optimal $\ell_0$ is approximately:
\begin{equation}
\ell_0 \approx \log_{d+1} \left( \frac{\kappa \cdot d^2}{2(d-1)} \right).
\end{equation}
This formula balances the precomputation cost, which grows as $(d+1)^{\ell_0}$, with the linear-time cost, which is reduced by a factor of $2^{\ell_0}$.
## Optimization 2: Equality Polynomial Sumcheck
This optimization targets the common case where $g(X) = \text{eq}(w, X) \cdot p(X)$. It drastically reduces memory by avoiding the need to store the $N$-sized evaluation table for $\text{eq}(w, X)$.
### Core Idea: Variable Splitting and Factoring
The multilinear equality polynomial has a factorization property. If we split the variables $X$ into two halves, $X = (X_L, X_R)$, then:
\begin{equation}
\text{eq}(w, X) = \text{eq}((w_L, w_R), (X_L, X_R)) = \underbrace{\text{eq}(w_L, X_L)}_{\text{Depends on } X_L} \cdot \underbrace{\text{eq}(w_R, X_R)}_{\text{Depends on } X_R}.
\end{equation}
We can exploit this by splitting the summation domain. In round $i$, the sum for $t_i(u)$ (the non-trivial part of the prover's message) can be rewritten as a nested sum:
\begin{equation}
t_i(u) = \sum_{x' \in \{0,1\}^{\ell-i}} \text{eq}(w_{>i}, x') \cdot p(r_{[<i]}, u, x').
\end{equation}
By splitting $x'$ into $x_L$ and $x_R$, we get:
\begin{equation}
t_i(u) = \sum_{x_R} \text{eq}(w_R, x_R) \cdot \left( \sum_{x_L} \text{eq}(w_L, x_L) \cdot p(r_{[<i]}, u, x_L, x_R) \right).
\end{equation}
This nested structure is key. The prover can first compute the inner sum over the smaller domain $\{0,1\}^{|x_L|}$ and then use those results in the outer sum.
### The Algorithm
The prover operates as follows for the first $\ell/2$ rounds:
1. **Precompute**: The prover only needs tables for $\text{eq}(w_L, \cdot)$ and $\text{eq}(w_R, \cdot)$, which are of size $O(\sqrt{N})$, instead of one table of size $O(N)$.
2. **Compute Inner Sum**: For each value of $x_R$, compute the inner sum over all $x_L$.
3. **Compute Outer Sum**: Combine the results from the inner sums, weighted by $\text{eq}(w_R, x_R)$, to get the final value.
This method reduces memory from $O(N)$ to $O(\sqrt{N})$ and saves approximately $N \cdot \text{cost}(ll)$ operations by avoiding multiplications with the full `eq` table.
:::info
### Example with a zkVM Trace
Let's setup a fake example to illustrate the previous words: proving a fact about a program's execution trace inside a simple zkVM.
The core idea is to use the `eq` polynomial as a selector—a tool to pinpoint one exact state out of thousands or millions and check its properties.
#### The Scenario: Pinpointing a State in a VM Trace
Imagine we have a short, 4-cycle execution trace. We want to prove that the value in **Register A** was exactly **42** at a specific moment: during **cycle 2**, when the instruction was **LOAD**.
* **State Representation**: We can represent any state in our trace using $\ell=4$ binary variables. The total number of states is $N=2^4=16$.
* $(X_1, X_2)$: These two bits encode the **cycle number** (from `00` for cycle 0 to `11` for cycle 3).
* $(X_3, X_4)$: These two bits encode the **instruction opcode** (e.g., `00`=ADD, `01`=MUL, `10`=LOAD).
* **Trace Polynomial $p(X)$**: This is a multilinear polynomial that represents the entire execution trace. When evaluated at a specific state $x$, $p(x)$ gives us the value that was in **Register A** at that time.
* **Target Vector $w$**: This is our selector. It defines the exact state we want to isolate.
* Cycle 2 is binary `10`.
* The LOAD opcode is binary `10`.
* Therefore, our target state is $w = (1, 0, 1, 0)$.
* **The Goal**: The prover must compute the sum $S = \sum_{x \in \{0,1\}^4} \text{eq}(w, x) \cdot p(x)$. This sum effectively filters the entire trace, isolating the value of Register A at our target state. If the trace is valid, the result should be 42.
### The Brute-Force Method: The Memory Bottleneck
The straightforward approach would be to iterate through all 16 states of the trace. The `eq(w, x)` term would be zero for 15 of them, and one for the state where $x=w$. The sum would correctly collapse to $p(w)$.
The problem isn't the computation, but the **memory**. This method conceptually requires a lookup table for the polynomial $p(X)$ over all $N=16$ states. For a real-world trace with $\ell=30$ variables (${\sim}1$ billion states), this requires a huge memory. ❌
### The Optimized Method: Divide and Conquer
This is where variable splitting shines. Instead of treating the state as a single 4-bit string, we recognize its internal structure.
**Step 1: Deconstruct the State and Reduce Memory**
We split the variables according to their meaning:
* **Cycle Variables**: $X_L = (X_1, X_2)$ with the target cycle $w_L = (1, 0)$.
* **Opcode Variables**: $X_R = (X_3, X_4)$ with the target opcode $w_R = (1, 0)$.
This allows us to replace one giant table of size 16 with two small tables of size 4: one for cycle checks and one for opcode checks.
* **Memory Required**: $4 + 4 = 8$ locations, instead of 16.
* **The General Case**: The memory cost drops from $O(N)$ to $O(\sqrt{N})$. This exponential reduction is the key benefit.
**Step 2: The Nested Sum - A Two-Stage Filter**
The sum is now rewritten as a nested computation, which acts as a two-stage filter:
\begin{equation}
S = \underbrace{\sum_{x_R \in \{\text{opcodes}\}} \text{eq}(w_R, x_R)}_{\text{Stage 1: Filter by Opcode}} \cdot \left( \underbrace{\sum_{x_L \in \{\text{cycles}\}} \text{eq}(w_L, x_L) \cdot p(x_L, x_R)}_{\text{Stage 2: Filter by Cycle}} \right)
\end{equation}
**Step 3: Execution Flow**
The prover executes this nested sum efficiently:
1. **Stage 1: Filter by Opcode**. The prover iterates through all possible opcodes. The term $\text{eq}(w_R, x_R) = \text{eq}((1,0), x_R)$ will be 1 only when $x_R$ is `(1,0)` (LOAD). For all other opcodes, the result is 0, so the prover can discard those parts of the trace without doing any further work on them.
2. **Stage 2: Filter by Cycle**. Now, working only on the subset of the trace related to the LOAD instruction, the prover iterates through the cycles. The term $\text{eq}(w_L, x_L) = \text{eq}((1,0), x_L)$ will be 1 only when $x_L$ is `(1,0)` (Cycle 2).
The inner sum therefore collapses to a single term:
\begin{align*}
\text{InnerSum} &= 1 \cdot p(\text{cycle } (1,0), \text{opcode } (1,0)) \\
&= p(1,0,1,0)
\end{align*}
Assuming a correct trace, we know $p(1,0,1,0) = 42$.
3. **Final Combination**. The final result is the single non-zero contribution from the outer loop, which is $1 \cdot \text{InnerSum} = 42$.
**Remark**
Both methods yield the same answer. However, the optimized "divide and conquer" strategy transforms a huge memory problem into a highly efficient, targeted lookup. By splitting the variables and processing the sum as a nested filter, the prover can navigate an exponentially large domain using only polynomially-sized resources.
:::
## Summary
This document outlined two primary optimizations for the sumcheck prover that deliver significant performance improvements.
1. **Small-Value Optimization**: By using a hybrid strategy with a Lagrange-based precomputation phase for the first $\ell_0$ rounds, we replace a large number of expensive `ll` multiplications with cheaper `ss` and `sl` ones. This yields both asymptotic and practical speedups.
2. **Equality Polynomial Optimization**: By splitting the summation domain and factoring the `eq` polynomial, we reduce the prover's memory from $O(N)$ to $O(\sqrt{N})$ and save a substantial number of field multiplications.
When combined, these techniques make sumcheck-based systems like Jolt significantly faster and more memory-efficient, reducing the prover runtime by factors of 2-3x in typical scenarios and over 20x in memory-constrained settings.