$\newcommand{\zero}{\color{red}{\alpha}}$ $\newcommand{\one}{\color{lightgreen}{\beta}}$ $\newcommand{\xzero}[1]{\color{grey}{(1 - #1)}}$ $\newcommand{\xone}[1]{\color{grey}{#1}}$ $\def\stackbelow#1#2{\underset{\displaystyle\overset{\displaystyle\shortparallel}{#2}}{#1}}$ # Sumcheck and Open-Binius ### Background In late 2023, [Justin Thaler](https://x.com/SuccinctJT) posted a [paper](https://people.cs.georgetown.edu/jthaler/small-sumcheck.pdf) on optimized sumcheck for fields of small characteristics. Having experience with low-level hardware-optimized [modular multiplication](https://www.ingonyama.com/blog/multi-precision-fast-modular-multiplication#:~:text=The%20Barrett%2DDomb%20modular%20multiplication,of%20the%20Barrett%2DDomb%20algorithm.), [Yuval Domb](https://x.com/yuval_domb) began exploring ways to further optimize Justin's results using his extensive knowledge of efficient multiplication techniques. He wondered if the [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm) could be applied to reduce the number of base-field multiplications in Justin's optimized sumcheck prover algorithm (referred to as Algorithm 3 in the paper). The Karatsuba algorithm reduces the number of multiplications needed for two large integers from four to three, offering a 25% improvement. While this may seem modest, Justin helped us see that even small improvements are significant in the context of the sumcheck protocol over smaller fields. This is particularly relevant as commitment schemes like the one in [Binius](https://www.irreducible.com/posts/binius-hardware-optimized-snark) have become so efficient that the sumcheck protocol could become a bottleneck in SNARKs. Thus, we focused on incorporating the Karatsuba algorithm into Justin's Algorithm 3 for the case when the sumcheck polynomial degree is $d = 2$. The two main challenges were: - For the $d = 2$ case, how do we apply the Karatsuba algorithm across multiple rounds of the sumcheck protocol? - If we solve the first challenge, how do we generalize the Karatsuba algorithm to any degree $d > 2$? We successfully optimized Justin's algorithm, reducing the number of base-field multiplications in the sumcheck protocol over small fields from $\mathcal{O}(2^d)$ to $\mathcal{O}(d)$ using a generalized Karatsuba algorithm. We refer to this as Algorithm 4. The details of both Algorithms 3 and 4 are presented in our recent [paper](https://eprint.iacr.org/2024/1046)[^bdt24], co-authored with Justin. In this blog, we explain the motivation behind our work and how our algorithms can improve the sumcheck protocol over binary fields. We also provide a high-level overview of how the two algorithms function. ### Why Sumcheck over Binary Fields? Recent work by Diamond and Posen [[DP23](https://eprint.iacr.org/2023/1784)][^a], titled _Binius_, successfully concludes the long-standing quest for enabling very fast SNARKs using standard, efficient hash functions like Keccak. Binius provides a significantly faster hashing-based commitment scheme for small values and integrates this scheme with sum-check-based polynomial IOPs. Efforts like [open-binius](https://github.com/ingonyama-zk/open-binius) have already started looking into accelerating SNARK proof generation over binary fields. With these developments, the sumcheck protocol over binary fields could soon become a bottleneck for such provers. These advancements inspired us to optimize the sum-check protocol for fields with small characteristics. Binius enables efficient SNARKs over the binary field $\mathsf{GF}[2]$ and its tower-field extensions. Our algorithms for running the sumcheck protocol over the binary field and its tower extensions are typically an order of magnitude faster than existing algorithms. We are particularly interested in the factor improvement of our algorithms compared to the naive algorithm. Factor improvement is defined as the ratio of the estimated time to run the naive algorithm to the estimated time to run Algorithm 3 or 4. Thus, a higher factor improvement indicates better performance of Algorithm 3 or 4 relative to the naive algorithm. \begin{aligned} F_a := \frac{ n_{\textsf{bb}}^{(1)} \cdot t_{\textsf{bb}} + n_{\textsf{be}}^{(1)} \cdot t_{\textsf{be}} + n_{\textsf{ee}}^{(1)} \cdot t_{\textsf{ee}} }{ n_{\textsf{bb}}^{(a)} \cdot t_{\textsf{bb}} + n_{\textsf{be}}^{(a)} \cdot t_{\textsf{be}} + n_{\textsf{ee}}^{(a)} \cdot t_{\textsf{ee}} } \end{aligned} | Notation | Description | | -------- | -------- | | $a \in \{1, 3, 4\}$ | Algorithm index (naive algorithm is 1, our algorithms are 3 and 4) | | $n_{\textsf{bb}}^{(a)}$ | Number of base-base multiplications in algorithm $a$ | | $n_{\textsf{be}}^{(a)}$ | Number of base-extension multiplications in algorithm $a$ | | $n_{\textsf{ee}}^{(a)}$ | Number of extension-extension multiplications in algorithm $a$ | | $t_{\textsf{bb}}$ | Time required to perform one base-base multiplications | | $t_{\textsf{be}}$ | Time required to perform one base-extension multiplications | | $t_{\textsf{ee}}$ | Time required to perform one extension-extension multiplications | Note that the unit times for each operation have been measured using the benchmarks in the [binius](https://gitlab.com/IrreducibleOSS/binius) repository on a 24-core Intel Core i9-13900K machine. Since our algorithms focus on minimizing extension-field multiplications at the cost of additional base-field multiplications, we expect the performance of our algorithms to be better when extension-field multiplications are significantly more expensive compared to base-field multiplications. Below is the ratio of running times of an extension-field multiplication to a base-field multiplication for different binary fields. | Base field | Extension field | Degree of extension | $t_{\textsf{ee}} / t_{\textsf{bb}}$ | | -------- | -------- | -------- | ----- | | $\textsf{GF}[2]$ | $\textsf{GF}[2^{128}]$ | $128$ | $145 \times 10^{2}$ | | $\textsf{GF}[2^2]$ | $\textsf{GF}[2^{128}]$ | $64$ |$2.5 \times 10^{2}$ | | $\textsf{GF}[2^4]$ | $\textsf{GF}[2^{128}]$ | $32$ |$0.5 \times 10^{2}$ | This aligns with our expectations, as for $\textsf{GF}[2]$, the base-field multiplication is simply a one-bit XOR, leading to a very high ratio of $t_\textsf{ee}$ to $t_\textsf{bb}$. This ratio decreases as the size of the base field increases. We plot the factor improvement of our algorithms for these base fields with respect to the switchover round $t$. Since Algorithms 3 and 4 are memory-intensive, we use them only for the first $t$ rounds of the sumcheck protocol and then switch to the naive algorithm for the remaining rounds. For the plot below, we set the degree of the sumcheck polynomial to $d = 3$ as the most relevant use cases involve cubic sumcheck[^d3]. We set the sumcheck instance size to be $2^\ell = 2^{20}$. ![image](https://hackmd.io/_uploads/B1lqNjewC.png) The plot shows the factor improvement of our optimized algorithms (Algorithms 3 and 4) compared to the naive algorithm for different values of the switchover round $t$. The y-axis represents the factor improvement, while the x-axis represents the switchover round $t$. The switchover round indicates the point at which we switch from using our optimized algorithms to the naive algorithm during the sumcheck protocol. The key takeaways are: 1. **Optimal Switchover Point**: The plot reveals an optimal switchover point where the factor improvement reaches its peak, following a parabolic trend. Beyond this point, the performance benefits of continuing with the optimized algorithms diminish, and efficiency may decrease as the number of base-field multiplications significantly exceeds the more costly extension-field multiplications. 2. **Effect of Base Field Size**: The trend lines indicate that factor improvement is greater for smaller base fields, such as $\textsf{GF}[2]$. This aligns with expectations, as the time ratio of extension-field to base-field multiplications is higher for smaller fields, making the optimized algorithms more advantageous. For practical applications involving cubic sumcheck ($d = 3$), the plot shows that our optimized algorithms can achieve up to an order of magnitude improvement over the naive algorithm. For instance, with $\textsf{GF}[2]$ as the base field, Algorithm 4 outperforms the naive algorithm by almost a factor of 30. 3. **Significant Gains from the Karatsuba Optimization**: The integration of the Karatsuba trick notably enhances the performance of Algorithm 3. For the base field $\textsf{GF}[2]$, the peak factor improvements for Algorithms 3 and 4 are 6 and 30, respectively, demonstrating that Algorithm 4 is five times more efficient than Algorithm 3 in this case. Additionally, the optimal points for the optimized algorithms occur at $t = 5$ for Algorithm 3 and $t = 8$ for Algorithm 4, indicating that Algorithm 4 can efficiently handle more initial rounds. This ensures superior overall performance and highlights the substantial benefits of the Karatsuba optimization. Aside from runtime improvements, Algorithm 4 offers a significant advantage over Algorithm 3 in terms of memory efficiency. Both algorithms are memory-intensive as they require the prover to store the results of the base-field multiplications. However, the memory requirements for Algorithms 3 and 4 are $\mathcal{O}(2^{d \cdot t})$ and $\mathcal{O}(d \cdot t)$, respectively. For $t = 8$, Algorithm 4 requires only 0.26 MB of memory, while Algorithm 3 needs 67 MB to store the base-field products. This substantial difference makes Algorithm 4 much more suitable for client-side proving on memory-constrained devices. Independent work of Gruen [[Gru24](https://eprint.iacr.org/2024/108)][^gru24] also seeks to optimize the sum-check prover over small fields by using the univariate skip technique. This technique applies the sumcheck protocol when the degree of the first variable of the sumcheck polynomial is much higher than the other variables. Although this results in prover costs similar to our algorithms, our techniques apply to existing sum-check-based SNARKs "as is". We discuss the differences in our techniques in further details in our paper. Next, we describe how our optimized algorithms work. Please note that the following section assumes familiarity with multilinear polynomials and basic linear algebra. If you prefer to avoid the mathematical details, feel free to skip ahead to the conclusion. ### The Premise Let $p_1, p_2, \dots, p_d$ be multilinear polynomials. To run the sum-check protocol over the product of these polynomials, we define \begin{aligned} p(x) := p_1(x) \cdot p_2(x) \cdots p_d(x) \end{aligned} for all $x \in \{0, 1\}^\ell$. The degree of the product polynomial $p(x)$ is $d$. In round $i$ of the sum-check protocol, the prover needs to compute the polynomial \begin{aligned} s_i(c) := \sum_{x \in \{0, 1\}^{\ell - i}} p(r_1, r_2, \dots, r_{i - 1}, c, x) \end{aligned} for $c \in \{0, 1, \dots, d\}$. We focus on computing the following polynomial \begin{aligned} G_i(c, x) :=&\ p(r_1, r_2, \dots, r_{i - 1}, c, x) = \prod_{j=1}^{d} p_j(r_1, r_2, \dots, r_{i - 1}, c, x). \end{aligned} Since $p_j$ is multilinear, we can expand it as follows. \begin{array}{lll} p_j(r_1, \dots, r_{i - 1}, c, x) =& \overbrace{\xzero{r_1} \cdots \xzero{r_{i - 2}}\xzero{r_{i - 1}}}^{\textsf{challenge-terms}} \cdot & \overbrace{p_j(0, \dots, 0, 0, c, x)}^{\textsf{witness-terms}} \ + \\ & \xzero{r_1} \cdots \xzero{r_{i - 2}} \cdot \xone{r_{i - 1}} \cdot & p_j(0, \dots, 0, 1, c, x) \ + \\ & \xzero{r_1} \cdots \xone{r_{i - 2}} \cdot \xzero{r_{i - 1}} \cdot & p_j(0, \dots, 1, 0, c, x) \ + \\ & \xzero{r_1} \cdots \xone{r_{i - 2}} \cdot \xone{r_{i - 1}} \cdot & p_j(0, \dots, 1, 1, c, x) \ + \\ & \vdots \\ & \xone{r_1} \cdots \xone{r_{i - 2}} \cdot \xone{r_{i - 1}} \cdot & p_j(1, \dots, 1, 1, c, x). \end{array} The main idea of our algorithms is to separate out the challenge (extension-field elements) and witness terms (base-field elements). This ensures that the extension-field and base-field multiplications are separate in the expression of $G_i(c, x)$, hence minimizing the more costlier extension-field multiplication. With this premise, we now want to figure out how exactly do we compute $G_i(c, x)$. We start by separating out the first challenge $r_1$ as \begin{aligned} p_j(r_1, \dots, r_{i - 1}, c, x) = \Big( \xzero{r_1} \cdot \underbrace{p_j(0, \dots, r_{i-1}, c, x)}_{=: \ \zero_j} + \xone{r_1} \cdot \underbrace{p_j(1, \dots, r_{i-1}, c, x)}_{=: \ \one_j} \Big). \end{aligned} Substituting $\zero_j , \one_j \in \mathbb{F}$ to simplify the expression of $G_i(c, x)$, we get \begin{aligned} G_i(c, x) =&\ \prod_{j=1}^{d} (\xzero{r_1} \cdot \zero_j + \xone{r_1} \cdot \one_j). \end{aligned} At this point, we need to figure out a way to multiply $d$ linear polynomials. Such a product will become the building block of our algorithm. ### The Building Block Let $f_1(x), f_2(x), \dots, f_d(x)$ be linear polynomials defined over a field $\mathbb{F}$ as \begin{aligned} f_j(x) := \xzero{x} \cdot \zero_j + \xone{x} \cdot \one_j. \end{aligned} for all $j \in \{1, \dots, d\}$. We are interested in computing the product polynomial $f(x)$ as \begin{aligned} f(x) &= \prod_{j=1}^{d} f_j(x) = \prod_{j=1}^{d} \left(\xzero{x} \cdot \zero_j + \xone{x} \cdot \one_j\right). \end{aligned} Using schoolbook multiplication to compute this product, we get \begin{array}{lll} f(x) =& \xzero{x}^d \cdot \xone{x}^0 \ \cdot & \left( \zero_1 \cdot \zero_2 \ \cdots \ \zero_{d - 1} \cdot \zero_d \right) \ + \\[2pt] & \xzero{x}^{d - 1} \cdot \xone{x}^1 \ \cdot & \left( \zero_1 \cdot \zero_2 \ \cdots \ \zero_{d - 1} \cdot \one_d \right) \ + \\[2pt] & \xzero{x}^{d - 1} \cdot \xone{x}^1 \ \cdot & \left( \zero_1 \cdot \zero_2 \ \cdots \ \one_{d - 1} \cdot \zero_d \right) \ + \\[2pt] & \xzero{x}^{d - 2} \cdot \xone{x}^2 \ \cdot & \left( \zero_1 \cdot \zero_2 \ \cdots \ \one_{d - 1} \cdot \one_d \right) \ + \\[2pt] & \vdots \\[2pt] & \xzero{x}^{1} \cdot \xone{x}^{d - 1} \ \cdot & \left( \one_1 \cdot \one_2 \ \cdots \ \one_{d - 1} \cdot \zero_d \right) \ + \\[2pt] & \underbrace{\xzero{x}^0 \cdot \xone{x}^d \ \cdot}_{\textsf{variable-products}} & \underbrace{\left( \one_1 \cdot \one_2 \ \cdots \ \one_{d - 1} \cdot \one_d \right)}_{\textsf{coefficient-products}}. \end{array} \begin{equation} \label{eq:schoolbook} \implies f(x) = \sum_{k=0}^{2^d-1} \xzero{x}^{d - H(k)} \cdot \xone{x}^{H(k)} \cdot \prod_{j=1}^{d} \left(\xzero{b_j(k)} \cdot \zero_j + \xone{b_j(k)} \cdot \one_j \right) \tag{1} \end{equation} where $H(k)$ is the hamming weight of $k$ and $b_j(k)$ denotes the $j$-th most-significant bit of $k$. The expression of $f(x)$ consists of $2^d$ coefficient-products. Each coefficient-product is a $d$-way product, so we need $(d - 1)$ multiplications to compute one coefficient-product. In total, we need $2^d \cdot (d - 1)$ multiplications with the coefficients. The question is: can we do better to reduce the number of coefficient-products? Inspired by the [Toom-Cook multiplication](https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication) algorithm for multiplying two integers, we extend it to efficiently multiply $d$ linear polynomials. The main idea behind Toom-Cook based polynomial multiplication is to multiply polynomials in their evaluation forms and then recover the monomial form of the resulting polynomial. Since the degree of $f(x)$ is $d$, we need to first choose an evaluation set[^eval] of size $(d + 1)$ to compute the evaluation forms of the polynomials $f_1, \dots, f_d$. The choice of the evaluation set is arbitrary as long as its size is $(d + 1)$. In this case, we choose it to be $$ E = \left\{0, 1, -1, 2, -2, \dots, \left\lfloor\frac{d}{2}\right\rfloor, \ \infty \right\}, $$ to ensure that the magnitude of the integers in the evaluation set is minimized. We will explain the reason for this choice later. For example, when $d = 5$, we have $$ E_5 := \{0, 1, -1, 2, \infty \}. $$ By convention, evaluation of a polynomial at $\infty$ equals the coefficient of its highest-degree term. For example, if $g(x) = 7x^2 - 3x + 4$, then $g(\infty) = 7$. Returning to our product polynomial $f(x)$, for each evaluation point $e \in E$ we evaluate $f$ as $$ f(e) = \prod_{j=1}^{d} f_j(e) = \prod_{j=1}^{d} \left( \xzero{e} \cdot \zero_j + \xone{e} \cdot \one_j \right). $$ Notice that the coefficients are multiplied by the constants $\xzero{e}$ and $\xone{e}$. It's crucial for this operation to impose negligible overhead on the prover. This multiplication can indeed be considered negligible if the magnitude of the integer $\xone{e}$ is much smaller than the size of the field $\mathbb{F}$. Hence, the choice of minimizing the magnitude of points in the evaluation set is pivotal for optimizing performance in practical implementations. Computing $f(e)$ requires a $d$-way product and hence needs $(d - 1)$ multiplications in the field $\mathbb{F}$. Thus, we require a total of $(d + 1) \cdot (d - 1)$ multiplications to compute the evaluations of $f$ over $E$. To recover the monomial form $f(x) = c_0 + c_1 \cdot \xone{x} + \dots + c_d \cdot \xone{x}^d$, we write \begin{aligned} f(e_0) &= c_0 + c_1 \cdot e_0 + \cdots + c_d \cdot e_0^d, \\ f(e_1) &= c_0 + c_1 \cdot e_1 + \cdots + c_d \cdot e_1^d, \\ \vdots \\ f(e_d) &= c_0 + c_1 \cdot e_d + \cdots + c_d \cdot e_d^d, \end{aligned} for the evaluation points $e_0, \dots, e_d \in E$. We can rewrite this as \begin{aligned} \begin{bmatrix} f(e_0) \\[2pt] f(e_1) \\[2pt] \vdots \\[2pt] f(e_d) \end{bmatrix} = \underbrace{\begin{bmatrix} \ 1 & e_0 & e_0^2 & \ldots & e_0^d \ \\ \ 1 & e_1 & e_1^2 & \ldots & e_1^d \ \\ \vdots & \vdots & \ldots & \vdots \\ \ 1 & e_d & e_d^2 & \ldots & e_d^d \ \end{bmatrix}}_{\textsf{evaluation matrix } V} \begin{bmatrix} c_0 \\[2pt] c_1 \\[2pt] \vdots \\[2pt] c_d \end{bmatrix}. \end{aligned} Therefore, given its evaluations over $E$, the coefficients of $f(x)$ can be computed as \begin{aligned} \begin{bmatrix} c_0 \\[2pt] c_1 \\[2pt] \vdots \\[2pt] c_d \end{bmatrix} = \underbrace{ \begin{bmatrix} \ 1 & e_0 & e_0^2 & \ldots & e_0^d \ \\ \ 1 & e_1 & e_1^2 & \ldots & e_1^d \ \\ \vdots & \vdots & \ldots & \vdots \\ \ 1 & e_d & e_d^2 & \ldots & e_d^d \ \end{bmatrix}^{-1}}_{\textsf{interpolation matrix } I \ := \ V^{-1}} \begin{bmatrix} f(e_0) \\[2pt] f(e_1) \\[2pt] \vdots \\[2pt] f(e_d) \end{bmatrix}. \end{aligned} We can rewrite the polynomial $f(x)$ using its coefficients as \begin{aligned} f(y) := \begin{bmatrix} \xone{1} & \xone{x} & \ldots & \xone{x}^d \end{bmatrix} \begin{bmatrix} c_0 \\[2pt] c_1 \\[2pt] \vdots \\[2pt] c_d \end{bmatrix} = \begin{bmatrix} \xone{1} & \xone{x} & \ldots & \xone{x}^d \end{bmatrix} \begin{bmatrix} \ 1 & e_0 & e_0^2 & \ldots & e_0^d \ \\ \ 1 & e_1 & e_1^2 & \ldots & e_1^d \ \\ \vdots & \vdots & \ldots & \vdots \\ \ 1 & e_d & e_d^2 & \ldots & e_d^d \ \end{bmatrix}^{-1} \begin{bmatrix} f(e_0) \\[2pt] f(e_1) \\[2pt] \vdots \\[2pt] f(e_d) \end{bmatrix}. \end{aligned} Note that the interpolation matrix depends only on the evaluation set $E$. Given the interpolation matrix, we define interpolation maps as \begin{aligned} \begin{bmatrix} L_0(\xone{x}) & L_1(\xone{x}) & \ldots & L_d(\xone{x}) \end{bmatrix} := \begin{bmatrix} \xone{1} & \xone{x} & \ldots & \xone{x}^d \end{bmatrix} \begin{bmatrix} \ 1 & e_0 & e_0^2 & \ldots & e_0^d \ \\ \ 1 & e_1 & e_1^2 & \ldots & e_1^d \ \\ \vdots & \vdots & \ldots & \vdots \\ \ 1 & e_d & e_d^2 & \ldots & e_d^d \ \end{bmatrix}^{-1}. \end{aligned} Note that the interpolation maps only depend on the evaluation set $E$ and the degree $d$. These interpolation maps can be pre-computed and cached. Thus, we can write $f(\xone{x})$ as a linear combination of its evaluations over the set $E$. \begin{equation} f(\xone{x}) := \sum_{k=0}^{d} L_k(\xone{x}) \cdot f(e_k), \end{equation} \begin{equation} \label{eq:toomcook} \implies f(\xone{x}) = \sum_{k=0}^{d} L_k(\xone{x}) \cdot \prod_{j=1}^{d} \left( \xzero{e_k} \cdot \zero_j + \xone{e_k} \cdot \one_j \right). \tag{2} \end{equation} Notice the similarity of expressions $\eqref{eq:schoolbook}$ and $\eqref{eq:toomcook}$ except the fact that the latter consists of $(d + 1)$ coefficient-products while the former $2^d$ coefficient-products. This reduction in the number of coefficient-products from $\mathcal{O}(2^d)$ to $\mathcal{O}(d)$ is the main advantage of the Toom-Cook based multiplication over schoolbook multiplication. We can write a general expression of $f(x)$ that encompasses both the schoolbook and Toom-Cook multiplication. \begin{equation} \label{eq:common} f(\xone{x}) := \sum_{k=0}^{M} V_k(\xone{x}) \cdot \prod_{j=1}^{d} \left( \xzero{\epsilon_k} \cdot \zero_j + \xone{\epsilon_k} \cdot \one_j \right). \tag{3} \end{equation} | | Schoolbook | Toom-Cook | | -------- | -------- | -------- | | Range $M$ | $2^d - 1$ | $d$ | | Variable term $V_k(\xone{x})$ | $\xzero{x}^{d - H(k)} \cdot \xone{x}^{H(k)}$ | $L_k(\xone{x})$ (interpolation map) | \\ |Evaluation point $\xone{\epsilon_k}$| $\xone{b_j(k)} \in \{0, 1\}$ | $\xone{e_k} \in E$ | ### Putting Everything Together Let us refocus on computing $G_i(c, x)$ given as \begin{aligned} G_i(c, x) =&\ \prod_{j=1}^{d} (\xzero{r_1} \cdot \zero_{j} + \xone{r_1} \cdot \one_{j}) \quad \text{s.t.} \\[4pt] \zero_{j} :=&\ p_j(\color{red}{0}, r_2, \dots, r_{i-1}, c, x), \\ \one_{j} :=&\ p_j(\color{lightgreen}{1}, r_2, \dots, r_{i-1}, c, x). \end{aligned} Using the general expression $\eqref{eq:common}$ of $f(\xone{x})$ with $\xone{x} = \xone{r_1}$, we get \begin{aligned} G_i(c, x) =&\ \sum_{k_1=0}^{M} V_{k_1}(\xone{r_1}) \cdot \prod_{j=1}^{d} \left( \xzero{\epsilon_{k_1}} \cdot \zero_{j} + \xone{\epsilon_{k_1}} \cdot \one_{j} \right). \end{aligned} Notice what we've done here: we have carefully separated out the challenge $\xone{r_1}$, which creates a new product term that _does not_ contain the challenge $\xone{r_1}$. Let us concentrate on the new product term: \begin{aligned} \left( \xzero{\epsilon_{k_1}} \cdot \zero_{j} + \xone{\epsilon_{k_1}} \cdot \one_{j} \right) = \xzero{\epsilon_{k_1}} \cdot &\ p_j(\color{red}{0}, r_2, \dots, r_{i-1}, c, x) \ + \\ \xone{\epsilon_{k_1}} \cdot &\ p_j(\color{lightgreen}{1}, r_2, \dots, r_{i-1}, c, x). \end{aligned} Since $p_j$ is multilinear, we can "absorb" the $\xone{\epsilon_{k_1}}$ inside $p_j$, that is: \begin{aligned} \left( \xzero{\epsilon_{k_1}} \cdot \zero_{j} + \xone{\epsilon_{k_1}} \cdot \one_{j} \right) = p_j(\xone{\epsilon_{k_1}}, r_2, \dots, r_{i-1}, c, x). \end{aligned} Plugging this back in expression of $G_i(c, x)$, we get \begin{aligned} G_i(c, x) =&\ \sum_{k_1=0}^{M} V_{k_1}(\xone{r_1}) \cdot \prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, r_2, \dots, r_{i-1}, c, x). \end{aligned} Now we have a new product term that does not contain $\xone{r_1}$. We can apply this same strategy on the subsequent challenges. We show how this works out for $\xone{r_2}$ and $\xone{r_3}$. \begin{aligned} G_i(c, x) =&\ \color{lightblue}{ \underbrace{\color{white}{\prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, r_2, r_3, \dots, r_{i-1}, c, x)}}_{=} } \\[-6pt] =&\ \color{lightblue}{ \overbrace{\color{white}{\sum_{k_1=0}^{M} V_{k_1}(\xone{r_1}) \cdot \color{violet}{\underbrace{\color{white}{\prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, r_2, r_3, \dots, r_{i-1}, c, x)}}_{=}} }} } \\[-6pt] =&\ \sum_{k_1=0}^{M} V_{k_1}(\xone{r_1}) \color{violet}{ \overbrace{\color{white}{\sum_{k_2=0}^{M} V_{k_2}(\xone{r_2}) \cdot \color{olive}{\underbrace{\color{white}{\prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, \xone{\epsilon_{k_2}}, r_3, \dots, r_{i-1}, c, x)}}_{=}} }} } \\[-6pt] =&\ \sum_{k_1=0}^{M} V_{k_1}(\xone{r_1}) \sum_{k_2=0}^{M} V_{k_2}(\xone{r_1}) \color{olive}{ \overbrace{\color{white}{\sum_{k_2=0}^{M} V_{k_2}(\xone{r_2}) \cdot \prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, \xone{\epsilon_{k_2}}, \xone{\epsilon_{k_3}}, \dots, r_{i-1}, c, x) }} } \end{aligned} Therefore, we can write the expression of $G_i(c, x)$ after separating all challenges as \begin{aligned} G_i(c, x) =&\ \sum_{k_1=0}^{M} \sum_{k_2=0}^{M} \ldots \sum_{k_{i - 1}=0}^{M} \underbrace{\left( \prod_{m = 1}^{i - 1} V_{k_m}(\xone{r_m}) \right)}_{\textsf{challenge-terms}} \cdot \underbrace{ \left( \prod_{j=1}^{d} p_j(\xone{\epsilon_{k_1}}, \xone{\epsilon_{k_2}}, \dots, \xone{\epsilon_{k_{i-1}}}, c, x) \right)}_{\textsf{witness-terms}}. \end{aligned} Note that the witness terms require only base-field multiplications while the challenge terms require only extension-field multiplications. Here is the summary of costs in round $i$. | Type of mult. | # mults. for given $(c, x)$ | Total # mults. | | -------- | -------- | -------- | | base-base | $(d - 1) \cdot (M + 1)^{i - 1}$ | $(d - 1) \cdot (M + 1)^{i} \cdot 2^{\ell - i}$ | | base-extension | $(d + 1)^{i - 1}$ | $(d + 1)^{i - 1}$ | | extension-extension | $(d - 1) + (d + 1)^{i - 1}$ | $(d - 1) + (d + 1)^{i - 1}$ | From the expression of $G_i(c, x)$, the number of base-extension multiplications should have been $(M + 1)^{i - 1}$. This equals $(d + 1)^{i - 1}$ for the Toom-Cook approach as $M = d$. But $M \neq d$ for the schoolbook approach but we still claim that the base-extension multiplications is $(d + 1)^{i - 1}$. Can you figure out why? ### Conclusion Building upon Justin Thaler's pioneering work on optimizing the sumcheck prover for small fields, we have developed a novel algorithm that delivers significant advancements in both computational efficiency and memory usage. We believe our algorithm effectively addresses a critical bottleneck in binary-field based SNARKs: the sumcheck protocol. We have implemented both of our optimized algorithms in Rust, and the source code is publicly available [here](https://github.com/ingonyama-zk/smallfield-super-sumcheck). To facilitate comparison with existing methods, we have also implemented algorithms from [[Thaler13](https://eprint.iacr.org/2013/351)][^t13] and [[CTY11](https://www.vldb.org/pvldb/vol5/p025_grahamcormode_vldb2012.pdf)][^cty11]. We welcome contributions to enhance the robustness and feature completeness of our implementation. Please get in touch with us at `suyash@ingonyama.com` for questions, feedback or suggestions. **Acknowledgments**: This work would not have been possible without the invaluable help and insights from Justin Thaler. We also extend our gratitude to Omer Shlomovits and Yuval Domb for their feedback on earlier versions of this blog post. [^a]: Benjamin E. Diamond and Jim Posen. Succinct arguments over towers of binary fields. Cryptology ePrint Archive, Paper 2023/1784, 2023 [^eval]: The choice of the evaluation set is driven by the intention of minimizing overhead costs. In this case, we want the magnitude of the points in the evaluation set to be minimum. This is because our polynomials are of the form $g(x) = \xzero{x} \cdot \zero + \xone{x} \cdot \one$ and we would want the terms $\xzero{x}$ and $\xone{x}$ to be "small" integers. [^t13]: Justin Thaler. Time-optimal interactive proofs for circuit evaluation. In CRYPTO, 2013. [^cty11]: Graham Cormode, Justin Thaler, and Ke Yi. Verifying computations with streaming interactive proofs. Proc. VLDB Endow., 5(1):25–36, 2011. [^d3]: In the context of SNARKs, cubic sumcheck appears most often. For example in the Jolt zkVM, we notice degree $d = 3$ in several places ([1](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/subprotocols/grand_product_quarks.rs#L273-L280), [2](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/jolt/vm/read_write_memory.rs#L1561-L1569), [3](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/subprotocols/grand_product.rs#L1394-L1395), [4](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/subprotocols/grand_product.rs#L215-L216), [5](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/subprotocols/sumcheck.rs#L28-L64)) as well as in proof systems like spartan ([6](https://github.com/a16z/jolt/blob/5d1b57cb1dac6b675e90e5a1db1271c44366a6b4/jolt-core/src/r1cs/spartan.rs#L237-L247)) and marlin. [^bdt24]: Suyash Bagad, Yuval Domb and Justin Thaler. The Sum-Check Protocol over Fields of Small Characteristic. Cryptology ePrint Archive, Paper [2024/1046](https://eprint.iacr.org/2024/1046), 2024 [^gru24]: Angus Gruen. Some Improvements for the PIOP for ZeroCheck. Cryptology ePrint Archive, Paper [2024/108](https://eprint.iacr.org/2024/108), 2024