Optimizing the Sum-Check Prover: Small Values and Equality Polynomials

# Optimizing the Sumcheck Prover: Small Values and Equality Polynomials The sumcheck protocol is important in modern SNARKs, but its prover runtime can be a significant bottleneck. This document details two complementary optimizations that accelerate sumcheck proving in common applications: **(1)** when the polynomial evaluations involve values small relative to the field size, and **(2)** when the summed polynomial contains an equality polynomial factor, $eq(w, X)$. References used for this note: - *Bagad and al. paper:* https://eprint.iacr.org/2025/1117.pdf - LambdaClass blog posts: - *Optimizing sumcheck:* https://blog.lambdaclass.com/optimizing-sumcheck/ - *How factoring equality polynomials optimizes sumcheck:* https://blog.lambdaclass.com/how-factoring-equality-polynomials-optimizes-sumcheck/ ## Notation and Assumptions ### Notation | Symbol | Definition | | :--- | :--- | | $\ell$ | The number of variables in the multilinear polynomials. | | $N$ | The size of the summation domain, $N = 2^\ell$. | | $\mathbb{F}$ | A finite field. | | $p_1, \dots, p_d$ | $d$ multilinear polynomials in $\ell$ variables. | | $g(X)$ | The polynomial being summed, often $g(X) = \prod_{k=1}^d p_k(X)$. | | $s_i(X_i)$ | The univariate polynomial sent by the prover in round $i$. | | $r_1, \dots, r_\ell$ | Random challenges from the verifier. We denote $r_{[<i]} = (r_1, \dots, r_{i-1})$. | | $\text{eq}(w, X)$ | The multilinear extension of the equality function for $w \in \{0, 1\}^\ell$. | | $ss, sl, ll$ | Cost of a small-by-small, small-by-large, and large-by-large field multiplication. | | $\ell_0$ | The number of initial rounds to apply the small-value optimization. | | $U_d$ | An evaluation set for degree-$d$ polynomials, e.g., $U_d := \{\infty, 0, 1, \dots, d-1\}$. | ### Assumptions 1. **Streaming Access**: We assume the prover can stream evaluations of the base polynomials $p_k(x)$ for $x \in \{0, 1\}^\ell$ in lexicographical order. This is typical in zkVMs where evaluations correspond to an execution trace. 2. **Small Values**: In the "small-value" setting, we assume that for all $k \in [d]$ and $x \in \{0, 1\}^\ell$, the evaluation $p_k(x)$ is "small" (e.g., fits in a single machine word) relative to elements of the prime field $\mathbb{F}$. ## The Baseline: Linear-Time Sumcheck The sumcheck protocol proves a claim of the form $\sum_{x \in \{0,1\}^\ell} g(x) = C$. In each round $i \in [1, \ell]$, the prover's main task is to compute the univariate polynomial: \begin{equation} s_i(X_i) = \sum_{x' \in \{0,1\}^{\ell-i}} g(r_1, \dots, r_{i-1}, X_i, x'). \end{equation} The prover then sends evaluations of $s_i$ on a set $U_d$ to the verifier. A standard linear-time, linear-space algorithm maintains arrays of polynomial evaluations, halving their size in each round by binding the verifier's challenge. **Cost Profile**: * **Round 1**: All initial evaluations $\{p_k(x)\}$ are small. Computing $s_1$ involves only cheap $ss$ multiplications. Binding the first challenge $r_1$ requires $sl$ multiplications. * **Round 2 onwards**: The polynomial evaluations are now random linear combinations involving $r_1$ and are thus "large" elements of $\mathbb{F}$. All subsequent computations, including products and challenge binding, require expensive $ll$ multiplications. The dominant cost is therefore concentrated in the first few rounds after round 1, where the summation domain is largest. The total cost is dominated by $\approx \frac{d(d+1)}{2} N \cdot \text{cost}(ll)$. Our goal is to mitigate this high cost. ## Optimization 1: Small Value Sumcheck This optimization replaces expensive `ll` multiplications with cheaper `ss` and `sl` multiplications during the initial $\ell_0$ rounds of the protocol. ### Core Idea: Lagrange Decomposition The idea is to separate the computation involving large random challenges from the computation involving small initial evaluations. For a round $i \le \ell_0$, the polynomial $s_i(u)$ for some $u \in U_d$ is a sum of products: \begin{equation} s_i(u) = \sum_{x' \in \{0,1\}^{\ell-i}} \underbrace{\prod_{k=1}^d p_k(\overbrace{r_1, \dots, r_{i-1}}^{\text{large}}, \overbrace{u, x'}^{\text{small/binary}})}_{\text{A function of } (r_1, \dots, r_{i-1})}. \end{equation} This expression, viewed as a function of the large challenges $r_{[<i]}$, is a multilinear polynomial. We can express it using Lagrange interpolation over a grid of small evaluation points. This yields a decomposition: \begin{equation} s_i(u) = \sum_{v \in U_d^{i-1}} \underbrace{\left( \sum_{x' \in \{0,1\}^{\ell-i}} \prod_{k=1}^d p_k(v, u, x') \right)}_{A_i(v, u): \text{ Accumulator}} \cdot \underbrace{L_v(r_1, \dots, r_{i-1})}_{\text{Lagrange Basis Poly}}. \end{equation} Here, $L_v$ is a multivariate Lagrange basis polynomial. * The **Accumulators** $A_i(v,u)$ depend only on evaluations at "small" points from the grid $U_d^{i-1}$ and the hypercube $\{0,1\}^{\ell-i}$. They can be precomputed using only cheap `ss` multiplications. * The **Online Phase** involves computing the Lagrange polynomials at the large challenges $r_{[<i]}$ and then taking an inner product with the precomputed accumulator vector. This primarily involves `sl` multiplications. :::info ### Example: Small-Value Optimization Let's illustrate the Lagrange decomposition with a simple setup. * **Parameters**: * Number of polynomials $d=2$. * Number of variables $\ell=3$. * Number of optimized rounds $\ell_0=2$. * **Polynomials**: * $p_1(X_1, X_2, X_3) = X_1 + X_2 + X_3$ * $p_2(X_1, X_2, X_3) = X_1 + 2X_2$ * The polynomial being summed is $g(X) = p_1(X) \cdot p_2(X)$. * **Goal**: Compute the prover's message for **round $i=2$**. After receiving the first random challenge $r_1$, the prover must compute $s_2(u)$ for $u \in U_2 = \{\infty, 0, 1\}$. We'll compute the specific value $s_2(1)$. The direct computation would be: \begin{equation} s_2(1) = \sum_{x_3 \in \{0,1\}} \underbrace{p_1(r_1, 1, x_3)}_{\text{large}} \cdot \underbrace{p_2(r_1, 1, x_3)}_{\text{large}} \end{equation} Since $r_1$ is a large, random field element, both $p_1$ and $p_2$ evaluations are large, making this an expensive `ll` multiplication inside the sum. **Applying the Optimization** The prover uses the Lagrange decomposition for round $i=2$. The expression for $s_2(u)$ is a quadratic polynomial in $r_1$, so we can interpolate it from its evaluations at three small points, which are our accumulators. The formula is: \begin{equation} s_2(u) = \underbrace{A_2(\infty, u)}_{\text{coeff of } r_1^2} \cdot r_1(r_1-1) + \underbrace{A_2(0, u)}_{f(0)} \cdot (1-r_1) + \underbrace{A_2(1, u)}_{f(1)} \cdot r_1 \end{equation} **Step 1: Precomputation (Offline)** The prover computes the accumulators for $u=1$. These calculations only involve small numbers from $\{0,1\}$ and are therefore cheap `ss` operations. * **Compute $A_2(0, 1)$**: \begin{align*} A_2(0, 1) &= \sum_{x_3 \in \{0,1\}} p_1(0, 1, x_3) \cdot p_2(0, 1, x_3) \\ &= (p_1(0,1,0) \cdot p_2(0,1,0)) + (p_1(0,1,1) \cdot p_2(0,1,1)) \\ &= ((0+1+0) \cdot (0+2(1))) + ((0+1+1) \cdot (0+2(1))) \\ &= (1 \cdot 2) + (2 \cdot 2) = \mathbf{6} \end{align*} * **Compute $A_2(1, 1)$**: \begin{align*} A_2(1, 1) &= \sum_{x_3 \in \{0,1\}} p_1(1, 1, x_3) \cdot p_2(1, 1, x_3) \\ &= (p_1(1,1,0) \cdot p_2(1,1,0)) + (p_1(1,1,1) \cdot p_2(1,1,1)) \\ &= ((1+1+0) \cdot (1+2(1))) + ((1+1+1) \cdot (1+2(1))) \\ &= (2 \cdot 3) + (3 \cdot 3) = \mathbf{15} \end{align*} * **Compute $A_2(\infty, 1)$** (the leading coefficient): This is the sum of the leading coefficients (in $X_1$) of the product. The leading coefficient of $p_1(X_1, 1, x_3)$ is 1, and for $p_2(X_1, 1, x_3)$ it's also 1. \begin{align*} A_2(\infty, 1) &= \sum_{x_3 \in \{0,1\}} \text{coeff}(p_1, X_1) \cdot \text{coeff}(p_2, X_1) \\ &= (1 \cdot 1) + (1 \cdot 1) = \mathbf{2} \end{align*} **Step 2: Online Computation** The verifier sends a random challenge, say $r_1 = 10$. The prover computes $s_2(1)$ using the precomputed accumulators and the Lagrange formula. \begin{align*} s_2(1) &= A_2(\infty, 1) \cdot r_1(r_1-1) + A_2(0, 1) \cdot (1-r_1) + A_2(1, 1) \cdot r_1 \\ &= \underbrace{2 \cdot (10 \cdot 9)}_{sl} + \underbrace{6 \cdot (1-10)}_{sl} + \underbrace{15 \cdot 10}_{sl} \\ &= 180 - 54 + 150 = \mathbf{276} \end{align*} Notice that this computation only involves multiplying a small precomputed accumulator by a large value derived from $r_1$—these are all cheaper `sl` multiplications. **Verification (Comparing with Direct Method)** Let's compute $s_2(1)$ directly with $r_1=10$ to check our work. \begin{align*} s_2(1) &= \sum_{x_3 \in \{0,1\}} p_1(10, 1, x_3) \cdot p_2(10, 1, x_3) \\ &= (p_1(10,1,0) \cdot p_2(10,1,0)) + (p_1(10,1,1) \cdot p_2(10,1,1)) \\ &= ((10+1+0) \cdot (10+2)) + ((10+1+1) \cdot (10+2)) \\ &= \underbrace{(11 \cdot 12)}_{ll} + \underbrace{(12 \cdot 12)}_{ll} \\ &= 132 + 144 = \mathbf{276} \end{align*} We can see that we arrive at the same result in both cases. The optimization successfully replaced two expensive `ll` multiplications with a cheap `ss` precomputation and a fast `sl` online phase. ::: ### The Hybrid Algorithm A hybrid strategy is employed: 1. **Precomputation (Offline)**: For the first $\ell_0$ rounds, compute all necessary accumulator values $A_i(v, u)$. This is the most expensive part but uses only `ss` multiplications over streamed data. 2. **Rounds $i \in [1, \ell_0]$ (Online)**: For each round, compute the prover message $s_i(u)$ using the inner product from the Lagrange formula. This is fast and dominated by `sl` multiplications. 3. **Transition (Round $\ell_0+1$)**: After round $\ell_0$, explicitly compute the evaluation arrays $\{p_k(r_{[\le \ell_0]}, x')\}$ for the remaining variables $x'$. 4. **Rounds $i \in [\ell_0+1, \ell]$ (Online)**: Switch back to the standard linear-time algorithm, which now operates on a smaller domain of size $N/2^{\ell_0}$ and uses `ll` multiplications. ### Optimal Switchover Point $\ell_0$ The optimal choice of $\ell_0$ balances the exponential cost of precomputation against the savings in the standard rounds. If `ll` multiplications are $\kappa$ times more expensive than `ss` multiplications, the optimal $\ell_0$ is approximately: \begin{equation} \ell_0 \approx \log_{d+1} \left( \frac{\kappa \cdot d^2}{2(d-1)} \right). \end{equation} This formula balances the precomputation cost, which grows as $(d+1)^{\ell_0}$, with the linear-time cost, which is reduced by a factor of $2^{\ell_0}$. ## Optimization 2: Equality Polynomial Sumcheck This optimization targets the common case where $g(X) = \text{eq}(w, X) \cdot p(X)$. It drastically reduces memory by avoiding the need to store the $N$-sized evaluation table for $\text{eq}(w, X)$. ### Core Idea: Variable Splitting and Factoring The multilinear equality polynomial has a factorization property. If we split the variables $X$ into two halves, $X = (X_L, X_R)$, then: \begin{equation} \text{eq}(w, X) = \text{eq}((w_L, w_R), (X_L, X_R)) = \underbrace{\text{eq}(w_L, X_L)}_{\text{Depends on } X_L} \cdot \underbrace{\text{eq}(w_R, X_R)}_{\text{Depends on } X_R}. \end{equation} We can exploit this by splitting the summation domain. In round $i$, the sum for $t_i(u)$ (the non-trivial part of the prover's message) can be rewritten as a nested sum: \begin{equation} t_i(u) = \sum_{x' \in \{0,1\}^{\ell-i}} \text{eq}(w_{>i}, x') \cdot p(r_{[<i]}, u, x'). \end{equation} By splitting $x'$ into $x_L$ and $x_R$, we get: \begin{equation} t_i(u) = \sum_{x_R} \text{eq}(w_R, x_R) \cdot \left( \sum_{x_L} \text{eq}(w_L, x_L) \cdot p(r_{[<i]}, u, x_L, x_R) \right). \end{equation} This nested structure is key. The prover can first compute the inner sum over the smaller domain $\{0,1\}^{|x_L|}$ and then use those results in the outer sum. ### The Algorithm The prover operates as follows for the first $\ell/2$ rounds: 1. **Precompute**: The prover only needs tables for $\text{eq}(w_L, \cdot)$ and $\text{eq}(w_R, \cdot)$, which are of size $O(\sqrt{N})$, instead of one table of size $O(N)$. 2. **Compute Inner Sum**: For each value of $x_R$, compute the inner sum over all $x_L$. 3. **Compute Outer Sum**: Combine the results from the inner sums, weighted by $\text{eq}(w_R, x_R)$, to get the final value. This method reduces memory from $O(N)$ to $O(\sqrt{N})$ and saves approximately $N \cdot \text{cost}(ll)$ operations by avoiding multiplications with the full `eq` table. :::info ### Example with a zkVM Trace Let's setup a fake example to illustrate the previous words: proving a fact about a program's execution trace inside a simple zkVM. The core idea is to use the `eq` polynomial as a selector—a tool to pinpoint one exact state out of thousands or millions and check its properties. #### The Scenario: Pinpointing a State in a VM Trace Imagine we have a short, 4-cycle execution trace. We want to prove that the value in **Register A** was exactly **42** at a specific moment: during **cycle 2**, when the instruction was **LOAD**. * **State Representation**: We can represent any state in our trace using $\ell=4$ binary variables. The total number of states is $N=2^4=16$. * $(X_1, X_2)$: These two bits encode the **cycle number** (from `00` for cycle 0 to `11` for cycle 3). * $(X_3, X_4)$: These two bits encode the **instruction opcode** (e.g., `00`=ADD, `01`=MUL, `10`=LOAD). * **Trace Polynomial $p(X)$**: This is a multilinear polynomial that represents the entire execution trace. When evaluated at a specific state $x$, $p(x)$ gives us the value that was in **Register A** at that time. * **Target Vector $w$**: This is our selector. It defines the exact state we want to isolate. * Cycle 2 is binary `10`. * The LOAD opcode is binary `10`. * Therefore, our target state is $w = (1, 0, 1, 0)$. * **The Goal**: The prover must compute the sum $S = \sum_{x \in \{0,1\}^4} \text{eq}(w, x) \cdot p(x)$. This sum effectively filters the entire trace, isolating the value of Register A at our target state. If the trace is valid, the result should be 42. ### The Brute-Force Method: The Memory Bottleneck The straightforward approach would be to iterate through all 16 states of the trace. The `eq(w, x)` term would be zero for 15 of them, and one for the state where $x=w$. The sum would correctly collapse to $p(w)$. The problem isn't the computation, but the **memory**. This method conceptually requires a lookup table for the polynomial $p(X)$ over all $N=16$ states. For a real-world trace with $\ell=30$ variables (${\sim}1$ billion states), this requires a huge memory. ❌ ### The Optimized Method: Divide and Conquer This is where variable splitting shines. Instead of treating the state as a single 4-bit string, we recognize its internal structure. **Step 1: Deconstruct the State and Reduce Memory** We split the variables according to their meaning: * **Cycle Variables**: $X_L = (X_1, X_2)$ with the target cycle $w_L = (1, 0)$. * **Opcode Variables**: $X_R = (X_3, X_4)$ with the target opcode $w_R = (1, 0)$. This allows us to replace one giant table of size 16 with two small tables of size 4: one for cycle checks and one for opcode checks. * **Memory Required**: $4 + 4 = 8$ locations, instead of 16. * **The General Case**: The memory cost drops from $O(N)$ to $O(\sqrt{N})$. This exponential reduction is the key benefit. **Step 2: The Nested Sum - A Two-Stage Filter** The sum is now rewritten as a nested computation, which acts as a two-stage filter: \begin{equation} S = \underbrace{\sum_{x_R \in \{\text{opcodes}\}} \text{eq}(w_R, x_R)}_{\text{Stage 1: Filter by Opcode}} \cdot \left( \underbrace{\sum_{x_L \in \{\text{cycles}\}} \text{eq}(w_L, x_L) \cdot p(x_L, x_R)}_{\text{Stage 2: Filter by Cycle}} \right) \end{equation} **Step 3: Execution Flow** The prover executes this nested sum efficiently: 1. **Stage 1: Filter by Opcode**. The prover iterates through all possible opcodes. The term $\text{eq}(w_R, x_R) = \text{eq}((1,0), x_R)$ will be 1 only when $x_R$ is `(1,0)` (LOAD). For all other opcodes, the result is 0, so the prover can discard those parts of the trace without doing any further work on them. 2. **Stage 2: Filter by Cycle**. Now, working only on the subset of the trace related to the LOAD instruction, the prover iterates through the cycles. The term $\text{eq}(w_L, x_L) = \text{eq}((1,0), x_L)$ will be 1 only when $x_L$ is `(1,0)` (Cycle 2). The inner sum therefore collapses to a single term: \begin{align*} \text{InnerSum} &= 1 \cdot p(\text{cycle } (1,0), \text{opcode } (1,0)) \\ &= p(1,0,1,0) \end{align*} Assuming a correct trace, we know $p(1,0,1,0) = 42$. 3. **Final Combination**. The final result is the single non-zero contribution from the outer loop, which is $1 \cdot \text{InnerSum} = 42$. **Remark** Both methods yield the same answer. However, the optimized "divide and conquer" strategy transforms a huge memory problem into a highly efficient, targeted lookup. By splitting the variables and processing the sum as a nested filter, the prover can navigate an exponentially large domain using only polynomially-sized resources. ::: ## Summary This document outlined two primary optimizations for the sumcheck prover that deliver significant performance improvements. 1. **Small-Value Optimization**: By using a hybrid strategy with a Lagrange-based precomputation phase for the first $\ell_0$ rounds, we replace a large number of expensive `ll` multiplications with cheaper `ss` and `sl` ones. This yields both asymptotic and practical speedups. 2. **Equality Polynomial Optimization**: By splitting the summation domain and factoring the `eq` polynomial, we reduce the prover's memory from $O(N)$ to $O(\sqrt{N})$ and save a substantial number of field multiplications. When combined, these techniques make sumcheck-based systems like Jolt significantly faster and more memory-efficient, reducing the prover runtime by factors of 2-3x in typical scenarios and over 20x in memory-constrained settings.