Speeding Up Sum-Check Proving

# Speeding Up Sum-Check Proving WARNING: A lot of equations 🫠 https://eprint.iacr.org/2025/1117.pdf ## What this doc covers - [X] optimized sumcheck proving algorithm for small values - [ ] optimized sumcheck proving algorithm for eq - [ ] optimized implementation for **sl** multiplications in large prime fields ## Motivation ### Jolt Three components: - A read-only memory checking protocol - optimized - A read/write memory checking protocol - optimized - A highly structured R1CS instance - need to be optimized! (variant of Spartan) > "The sum-check protocol is so effective at reducing commitment costs that its own proving time now dominates the prover's end-to-end cost" 😆 ### Prover Bottlenecks Key Observation: The values being summed up by Spartan's sum-check are *small* because the execution traces are values stored in VM's registers (32-bit words), while the proving field used in Jolt is very large **Categorization of multiplications** - $\mathfrak{ss}$: small value x small value - $\mathfrak{sl}$: small value x large value - $\mathfrak{ll}$: large value x large value **GOAL** Reduce the number of $\mathfrak{ll}$ multiplications! ## Preliminaries ### MLE $\widetilde{\mathrm{eq}}$ converts any function $f: \{0,1\}^l \to \mathbb{F}$ to a unique multilinear polynomial $p:\mathbb{F}^l \to \mathbb{F}$ such that $p(x)=f(x)$ for all $x\in\{0,1\}^l$. $$ \widetilde{\mathrm{eq}}(X, Y) = \prod_{i=1}^{\ell} \left( X_i \cdot Y_i + (1 - X_i) \cdot (1 - Y_i) \right) $$ Since we will be running the sum-check protocol on this $p$ polynomial, we can express the polynomial evaluation at each round as the following: $$ p(r_1, \ldots, r_i, x') = \sum_{y \in \{0,1\}^i} \widetilde{\mathrm{eq}}((r_1, \ldots, r_i), y) \cdot p(y, x') $$ #### When $i = 1,$ $$ \begin{align} \\ p(r_1, x') &= \sum_{y \in \{0,1\}^1} \widetilde{\mathrm{eq}}(r_1, y) \cdot p(y, x')\\ &= (1 - r_1) \cdot p(0, x') + r_1 \cdot p(1, x')\\ \end{align} $$ #### When $i = l,$ $$ p(r_1, \ldots, r_\ell) = \sum_{y \in \{0,1\}^\ell} \widetilde{\mathrm{eq}}(r_1, \ldots, r_\ell, y) \cdot p(y) $$ ### Variant of Lagrange Interpolation A conventional Lagrange interpolation of a polynomial looks like the following: $$ s(X) = \sum_{i=0}^{d} s(x_i) \cdot \mathcal{L}_{U,i}(X), \quad \text{where} \quad \mathcal{L}_{U,i}(X) = \prod_{j \ne i} \frac{X - x_j}{x_i - x_j}. $$ But the paper uses a variant which includes evaluting at a "point in infinity", where $s(\infty)$ is the highest-degree coefficient of $s(X)$. Thus, for a degree-$d$ polynomial, we can evaluate it over $U = \{\infty\} \cup \{x_1, \ldots, x_d\}$ instead of $\{x_0, \ldots, x_d\}$ and define the polynomial as the following: $$ s(X) = a \cdot \prod_{k=1}^{d} (X - x_k) + \sum_{k=1}^{d} s(x_k) \cdot \mathcal{L}_{\{x_i\},k}(X) $$ ### Sum-check Every round, the prover has to send a univariate polynomial $s_i(X)$: $$ s_i(X) = \sum\limits_{x\in\{0,1\}^{n-i}} p_k(r_1,\dots,r_{i-1},X,x) $$ In this paper, we will mostly deal with multiplications of multilinear polynomials, so we define $s_i(X)$ as a product of $d$ multilinear polynomials: $$ s_i(X) = \sum\limits_{x\in\{0,1\}^{n-i}} \prod\limits_{k=1}^d p_k(r_1,\dots,r_{i-1},X,x) $$ We will also restrict ourselves to the case where $p_i(x)$ has small values for all $i\in\{1,\dots,d\}$ and $x\in\{0,1\}^l$. **Note that starting from round $i\ge 2$, the evaluations $p_k(r_1,\dots,r_{i-1},u,x)$ are no longer small, due to the random challenges $r_1,\dots,r_{i-1}$ that are sampled from the big field.** ## Algorithm 1: Linear Time and Linear Space So, we first create a table that contains all evaluations of $p_i(X)$ over the boolean hypercube: $d$ arrays $P_1,\dots,P_d$, storing all evaluations of $p_1,\dots,p_d$ over $\{0,1\}^l$. Since at every round of sum-check, the polynomial will be evaluated at one extra point, we only need evaluations at half the points: $$ \{0,1\}^l \to \{0,1\}^{l-1} \to \dots \to\{0,1\} $$ To understand what the new polynomial looks like, recall that evaluating $p$ over the boolean hypercube except for a single variable looks like the following: $$ \begin{align} \\ p(X, x') &= \sum_{y \in \{0,1\}^1} \widetilde{\mathrm{eq}}(X, y) \cdot p(y, x')\\ &= (1 - X) \cdot p(0, x') + X \cdot p(1, x')\\ \end{align} $$ Applying this to our table, we can write the evaluations after the first round of sum-check using the original evaluations: $$ P_k^{(1)}[\underbrace{0,\ldots,0}_{l\ \text{values}}] = (1 - r_1)\cdot P_k^{(0)}[\underbrace{0,\ldots,0}_{l-1\ \text{values}}] + r_1\cdot P_k^{(0)}[\underbrace{1,\ldots,0}_{l-1\ \text{values}}]. $$ Where the subscript $^{(i)}$ denotes the current round in sum-check and $k \in \{1,...,d\}$. ![JPEG image-4A70-84E3-78-0](https://hackmd.io/_uploads/H1OFvVgyWg.jpg) Since in each round, we want to send a polynomial that multiplies all $d$ univariate polynomials, we will define it as the multiplication of all the evaluations: $$ s_i(X) := \sum_{x' \in \{0,1\}^{\ell - i}} \prod_{k=1}^{d} \left( \left( P_k^{(i)}[1, x'] - P_k^{(i)}[0, x'] \right) \cdot X + P_k^{(i)}[0, x'] \right). $$ ### Cost Bad space complexity and also bad time complexity as we end up doing $\mathfrak{ll}$ operations after the first round. ## Algorithm 2: Quasilinear Time & Square-Root Space Can we decrease space complexity at the expense of time complexity? Yes, by not creating an evaluation table for the first $\ell/2$ rounds and directly computing the polynomial at each round. Basically, we can compute the following for rounds $i = 1,...,\ell/2$: $$ s_i(X) = \sum_{x' \in \{0,1\}^{\ell - i}} \prod_{k=1}^{d} \left( \sum_{b \in \{0,1\}^{i-1}} \widetilde{\mathrm{eq}}\big(r_{<i}, b\big) \cdot p_k(b, X, x') \right) $$ And run Algorithm 1 for the remaining rounds. ### Cost But still we have bad time complexity. ## Algorithm 3: Less space and time using precomputation & accumulation Can we delay the $\mathfrak{ll}$ multiplications as much as possible? Yes, by separating the part that depends on the random variables from the part that does not depend on them. Let's rewrite the equation from Algorithm 2 to achieve this. ![ZK-32 5](https://hackmd.io/_uploads/rJfGLplkWg.jpg) Essentially, we have rewritten the equation to be an inner product of two vectors of length $2^{d(i-1)}$ (the length is calculated by $\sum\limits_{b_1,...,b_d\in\{0,1\}^{i-1}}$, which enumerates over $d$ number of $(i-1)$-bit strings), where the first vector depends on a large value $r$, while the second vector is solely made up of small values. So we separated the cheap part from the expensive part, yay! **There are two nice things about this new equation:** 1. we can incrementally compute the value that does depend on $r$ 2. we can precompute the value that doesn't depend on $r$ ### Some voodoo stuff... But before we explain how the nice things work, we need to change the equation again... Notice that there are two representations of $b$ are used: $$ \sum\limits_{b_1,...,b_d\in\{0,1\}^{i-1}},\hspace{1cm} \sum\limits_{k=1}^{d}b_k[j] $$ The first sum goes through all possible $d$ values of $(i-1)$-bit strings, while the second sum has more structure: it goes through the $j$-bit in all $b_1,...,b_d$ and takes the sum, which will range from $0$ to $d$. And if we check the vector that relies on $r$, $$ \prod_{j=1}^{i-1} \big( (1 - r_j)^{d-\sum_{k=1}^{d} b_k[j]} \cdot r_j^{\sum_{k=1}^{d} b_k[j]} \big), $$ we can see that it does not rely on $\sum\limits_{b_1,...,b_d\in\{0,1\}^{i-1}}$. So what can we do? We can introduce a new variable $v =\sum\limits_{k=1}^{d}b_k$ (i.e. the sum of the $j$-th indices is equal to $v[j]$) and enumerate over the sums instead. Since the sum ranges over $[0,d]$, we now move from enumerating over $2^{d(i-1)}$ values to $(d+1)^{i-1}$ values. Note that this new set will always be smaller than the previous set for $d \ge 2$. You can check that directly by applying the values to the following inequality: $$ 2^{d(i-1)} > (d+1)^{i-1} $$ ![IMG_5E9631225011-1](https://hackmd.io/_uploads/ryGA-uxJbx.jpg) Thus, we can rewrite the equation as follows: ![ZK-34 2](https://hackmd.io/_uploads/Hyj69uekWe.jpg) Yes, this is headache-inducing, but essentially the only parts that changed are the highlighted parts. As we discussed above, the left term now enumerates over $(d+1)^{i-1}$ possible values. The right term becomes slightly more complicated as it still iterates over all $2^{d(i-1)}$ values but in a more structured way, such that given a certain value $v$, iterates only over the bit strings that sum up to $v$. ### Incrementally computing the $r$-dependent term Let's come back to why we did all this refactoring in the first place. $$ \prod_{j=1}^{i-1} (1 - r_j)^{d - v[j]} r_j^{v[j]} $$ So if we take a closer look at the $r$-dependent term, we can see that it is essentially a tensor product of $(d+1)$ terms over a $(i-1)$-dimensional space. The best way to see this is to see how the next term builds upon the current term: $$ \begin{align} &\mathsf{R_1}=1,\\ &\mathsf{R_2}=\mathsf{R_1}\otimes\left((1-r_1)^{d-k}\cdot r_1^k\right)_{k=0}^d,\\ &\mathsf{R_3}=\mathsf{R_2}\otimes\left((1-r_2)^{d-k}\cdot r_2^k\right)_{k=0}^d... \end{align} $$ So essentially, at each round, the number of terms is blown up by a factor of $(d+1)$. This means that we don't have to recompute $r$-dependent terms in the sum-check protocol from scratch every round. ### Precomputing the non-$r$-dependent term, i.e. the accumulator Now let's look at the non-$r$-dependent term: $$ \sum_{\substack{b_1, \ldots, b_d \in \{0,1\}^{i-1} \\ v = \sum_{k=1}^{d} b_k}} \sum_{x'} \prod_{k=1}^{d} p_k(b_k, X, x') $$ Since this depends on the indeterminates $v$ and $X$, we can define a new function $\mathsf{A}_i(v,X)$ that we will call the accumulator function. Note that this can be precomputed by the prover as it does not depend on any $r$ values. ### Putting it all together At each round in the sum-check protocol, the prover will send the following polynomial: $$ s_i(X) = \sum\limits_{v\in[0,d]^{i-1}}\mathsf{R_i}[\text{index}(v)]\cdot \mathsf{A_i}(v,X) $$ Note that this will be performed for some $\ell_0$ initial layers (the exact value is determined by doing a more thorough cost analysis), but the general idea is that this will use significantly less memory since it only keeps accumulators and the random vectors but not the entire evaluation table. It is also fast due to the precomputation. ## Algorithm 4 (This part is going to be much simpler, I promise.) But in Algorithm 3, we end up doing a lot of computation for the accumulator function. Is there a way to modify it to make it more efficient? The key idea in Algorithm 4 is to use the Lagrange polynomial and we are going to precompute the coefficients required at every round. We can express the original sum-check polynomial in each $i$ round as a polynomial with $i-1$ variables: $$ F(Y_1, \ldots, Y_{i-1}) = \prod_{k=1}^{d} p_k(Y_1, \ldots, Y_{i-1}, X, x') $$ evaluated at a random point $(r_1,...,r_{i-1})$. We intentionally ignore $X$ here as we will assume that it is evaluated at some point $u \in U$ of size $d$. Given that all $Y_1,...,Y_{i-1}$ variables have degree $d$, we can rewrite $F$ using the following Lagrange polynomials. $$ \sum_{v_1 \in U_d} \mathcal{L}_{U_d, v_1}(Y_1) \cdots \sum_{v_{i-1} \in U_d} \mathcal{L}_{U_d, v_{i-1}}(Y_{i-1}) \prod_{k=1}^{d} p_k(\mathbf{v}, u, x') =\sum_{\mathbf{v} \in U_d^{i-1}} \prod_{j=1}^{i-1} \mathcal{L}_{U_d, v_j}(Y_j) \cdot \prod_{k=1}^{d} p_k(\mathbf{v}, u, x') $$ Again, we can think of the Lagrange polynomials as a tensor product of $i-1$ values, each of size $d$. So the whole polynomial $s_i(X)$ that the prover sends to the verifier looks like the following: $$ \begin{align} s_i(X) &= \sum_{x' \in \{0,1\}^{\ell - i}} \sum_{v \in U_d^{i-1}} \prod_{j=1}^{i-1} \mathcal{L}_{U_d, v_j}(r_j) \cdot \prod_{k=1}^{d} p_k(v, X, x') \\[1em] &=\sum_{v \in U_d^{i-1}} \prod_{j=1}^{i-1} \mathcal{L}_{U_d, v_j}(r_j) \cdot \sum_{x' \in \{0,1\}^{\ell - i}} \prod_{k=1}^{d} p_k(v, X, x') \end{align} $$ And if we define an accumulator as $$ \mathsf{A}_i(v,X)=\sum\limits_{x' \in \{0,1\}^{\ell - i}} \prod\limits_{k=1}^dp_k(v, X, x') $$ we can see that the final equation looks the same as the one in Algorithm 3: $$ s_i(X) = \sum\limits_{v\in[0,d]^{i-1}}\mathsf{R_i}[\text{index}(v)]\cdot \mathsf{A_i}(v,X) $$ except that $\mathsf{R_i}$ is defined as follows: $$ \begin{align} &\mathsf{R_1}=1,\\ &\mathsf{R_2}=\mathsf{R_1}\otimes\left(\mathcal{L}_{U_d, k}(r_1)\right)_{k=0}^d,\\ &\mathsf{R_3}=\mathsf{R_2}\otimes\left(\mathcal{L}_{U_d, k}(r_2)\right)_{k=0}^d... \end{align} $$ ### Cost The improvement over Algorithm 3 is that we end up doing less multiplications for the accumulator polynomial. Please refer to the paper for a precise accounting.