Distributed GKR

# Distributed GKR $\newcommand{\add}{\textsf{add}} \newcommand{\mul}{\textsf{mul}} \newcommand{\FF}{\mathbb{F}} \newcommand{\P}{\mathcal{P}} \newcommand{\V}{\mathcal{V}} \newcommand{\eq}{\tilde{\mathsf{eq}}}$ ## References - [GKR Review](https://hackmd.io/WUh-NzVNRYeXdEamoQN6PQ) - [Virgo](https://eprint.iacr.org/2019/1482.pdf) GKR + ZK via novel PCS - [zkBridge / diVirgo](https://arxiv.org/abs/2210.00264) Distributed GKR ## TLDR - For data parallel circuits, the GKR prover can be distributed across $N$ machines with linear speed-up - Main idea: distributed sum-check - No additional communication between prover/verifier - Same proof size (assuming no use of PCS) - Significant communication between provers ## Data Parallel Circuits - Circuit $C$ is a union of sub-circuits $C_1, \ldots C_N$ (assume $N=2^n$) - No cross-circuit connections (Gate in layer $i$ of sub-circuit $C_j$ is connected only to gates in layer $i+1$ of sub-circuit $C_j$) - Sub-circuits do not have to be identical, but must have equal size $S_i =2^{s_i}$ at each layer (otherwise can be padded to have equal size) - Goal: Distribute GKR prover across $N$ machines ![image](https://hackmd.io/_uploads/BJPQuKp2A.png) ### Decomposing Indices (Note: To simplify notation we'll index starting from 1 throughout) - Observe: Layer $i$ of $C$ consists of $S_i \times N = 2^{s_i+n}$ gates - Say $y = (y_1, \ldots y_{s_i+n})$ indexes a gate in layer $i$ of $C$ using $s_i+n$ variables - Decompose: - First $s_i$ variables $(y_1, \ldots y_{s_i})$ indexing *gate within a sub-circuit* using $s_i$ variables - Last $n$ variables $(y_{s_i+1}, \ldots y_{s_i+n})$ indexing a *sub-circuit within* $C$ using $n$ variables ### Decomposing Multilinear Extension Polynomials **Claim:** Witness polynomial for layer $i$ of $C$ decomposes into sum of witness polynomials for layer $i$ of $C_1, \ldots C_N$ Recall the basis polynomials (say $x \in \FF^k$ fixed, $Y$ consists of $k$ variables) $$\eq(x|Y) = \prod_{i=1}^k x_i Y_i + (1-x_i)(1-Y_i)$$ Letting $x$ vary over the boolean hypercube $B_k$, these form a basis for the multilinear polynomials in $k$ variables. We use them as a "Lagrange basis" to form multilinear extensions: For witness values $W(1), \ldots W(2^k)$, the MLE is $$\tilde W(Y) = \sum_{x \in B_k} W(x) \eq(x | Y)$$ Observe that these basis polynomials decompose as products: $$\eq(x_1, \ldots x_{s+n} | Y_1, \ldots Y_{s+n}) = \eq(x_1, \ldots x_s| Y_1, \ldots Y_s) \eq(x_{s+1}, \ldots x_{s+n} | Y_{s+1}, \ldots Y_{s+n})$$ Therefore $$\begin{align*} \tilde W(Y) &= \sum_{x_1, \ldots x_{s+n}} W(x_1, \ldots x_{s+n}) \eq(x_1, \ldots x_{s+n} | Y) \\ &= \sum_{x_{s+1}, \ldots x_{s+n}} \bigg( \sum_{x_1, \ldots x_s} W(x_1, \ldots x_{s+n}) \eq(x_1, \ldots x_{s} | Y_1, \ldots Y_s) \bigg) \eq(x_{s+1}, \ldots x_{s+n} | Y_{s+1}, \ldots Y_{s+n}) \\ &= \sum_{\text{sub-circuit indices}} \quad \bigg( \text{witness poly of sub-circuit} \bigg) (\text{basis polynomial}) \\ &= \sum_{j=1}^N W^{(j)}(Y_1, \ldots Y_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n}) \end{align*}$$ where we've defined the witness polynomial for the $j$th sub-circuit as $$ W^{(j)}(Y_1, \ldots Y_s) = \sum_{x_1, \ldots x_s} W(x_1, \ldots x_{s}, [j]) \eq(x_1, \ldots x_{s} | Y_1, \ldots Y_s)$$ ($[j]$ denotes the binary decomposition of $j$ in $n$ bits). ### Decomposing Wiring Polynomials TLDR: Because the circuit is data-parallel (no wiring between different sub-circuits), the $\tilde\add$, $\tilde\mul$ polynomials decompose by sub-circuit as well. More precisely say the $\tilde\mul$ polynomial for layer $i$ of the full circuit $C$ is $$\tilde\mul(X,Y,Z) = \sum_{x_1, \ldots x_{s_i+n}} \sum_{y_1, \ldots y_{s_{i+1}+n}} \sum_{z_1, \ldots z_{s_{i+1}+n}} \mul(x,y,z) \eq(x| X) \eq(y | Y) \eq(z | Z)$$ Because $C$ is data-parallel, we know the only potentially non-zero terms $\mul(x,y,z)$ are those for which $(x,y,z)$ label gates within the same sub-circuit. Therefore the above expression has redundant variables: $X_{s_i+1}, \ldots X_{s_i+n}$, $Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n}$, and $Z_{s_{i+1}+1}, \ldots Z_{s_{i+1}+n}$ are all indexing sub-circuits. We can discard $Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n}$ and $Z_{s_{i+1}+1}, \ldots Z_{s_{i+1}+n}$ and write $$\begin{align*} \tilde\mul(X, Y, Z) &= \tilde\mul(X_1, \ldots X_{s_i}, X_{s_i+1}, \ldots X_{s_i+n}, Y_1, \ldots Y_{s_{i+1}}, Z_1, \ldots Z_{s_{i+1}} ) \\ &= \sum_{j=1}^N \tilde\mul^{(j)}(X_1, \ldots X_{s_i}, Y,Z) \eq([j]| X_{s_i+1}, \ldots X_{s_i+n}) \\ &= \sum_{\text{sub-circuit indices}} (\text{mul poly. of sub-circuit})(\text{basis polynomial}) \end{align*}$$ That is, we use a multilinear extension of $\mul$ in $s_i + 2*s_{i+1} + n$ variables - saves $2n$ variables compared to a non-data-parallel circuit of same size as $C$ - structure as a sum over sub-circuits will allow for distributed sum-check ### GKR Polynomial GKR at layer $i$ applies sum-check to the polynomial $$f_{r_i}(Y,Z) = \tilde\add_i(r_i,Y,Z) \bigg(\tilde W_{i+1}(Y) + \tilde W_{i+1}(Z) \bigg) + \tilde\mul_i(r_i,Y,Z) \bigg( \tilde W_{i+1}(Y) \cdot \tilde W_{i+1}(Z) \bigg)$$ Using the data-parallel structure of $C$ and a similar argument, we can decompose this too: $$f_{r_i}(Y_1, \ldots Y_{s_{i+1}+n},Z_1, \ldots Z_{s_{i+1}}) = \sum_{j=1}^N f_{r_i}^{(j)}(Y_1, \ldots Y_{s_{i+1}},Z_1, \ldots Z_{s_{i+1}})\eq([j]| Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n})$$ - Q: Would this work for a term like $\tilde{POW5}(Y) W(Y)^5$ ? ## Distributed Sum-Check In a given GKR round, prover's claim is of the form $$m_{i} = \sum_{x_1, \ldots x_{s+n}} f(x_1, \ldots x_{s+n})$$ This will be proven using sum-check in $s+n$ rounds. When $f$ is of the form $$\begin{align*}f(Y_1, \ldots Y_{s+n}) &= \sum_{j=1}^N f^{(j)}(Y_1, \ldots Y_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n}) \\ &= \sum_{\text{sub-circuit indices}} (\text{sub-circuit poly})(\text{basis poly}) \end{align*}$$ we can distribute the work across $N$ prover by writing the sum as $$m_{i} = \sum_{j=1}^N \sum_{x_1, \ldots x_{s}} f^{(j)}(x_1, \ldots x_{s_i})$$ Each prover $\P_i$ holds the $s$-variate polynomial $f^{(i)}$. A single prover, say $\P_1$, will also act as coordinator. ### $s$ Distributed Rounds The first $s$ rounds of sum-check go as follows: 1. Each $\P_j$ computes the univariate polynomial $$f^{(j)}_1(Y) = \sum_{x_2, \ldots x_s} f^{(j)}(Y, x_2, \ldots x_s)$$ and sends $f^{(j)}_1$ to $\P_1$ (coordinator) 2. $\P_1$ computes $f_1(Y) = \sum_1^N f^{(j)}_1(Y)$, sends to $\V$ 3. $\V$ samples $r_1 \in \FF$, sends to $\P_1$ 4. $\P_1$ sends $r_1$ to each $\P_j$ 5. Each $\P_j$ computes $$f^{(j)}_2(Y) = \sum_{x_3, \ldots x_s} f^{(j)}(r_1, Y, x_3, \ldots x_s)$$ and sends $f^{(j)}_2$ to $\P_1$ (coordinator) 6. [continue round-by-round] 7. $\V$ samples $r_s \in \FF$ and sends to $\P_1$ 8. $\P_1$ sends $r_s$ to each $\P_j$ 9. $\P_j$ computes $f^{(j)}(r_1, \ldots r_s)$, sends to $\P_1$ ### $n$ Ordinary Rounds At this point the first $s$ of $s+n$ sum-check rounds have been performed and $\P_1$ holds the values $f^{(j)}(r_1, \ldots r_s)$ for $j=1,\ldots N$. The final $n$ rounds will be carried out between $\P_1$ and $\V$ using the polynomial $$f(r_1, \ldots r_s, Y_{s+1}, \ldots Y_{s+n}) = \sum_{j=1}^N f^{(j)}(r_1, \ldots r_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n}) $$ # Expander SIMD Expander uses SIMD field implementations to prove/verify multiple instances of a circuit in parallel. They use some similar ideas to those above to achieve some form of aggregation. This is an attempt to write out the protocol they're effectively using. In `fn sumcheck_prove_gkr_layer`, they have a few sumcheck rounds in between the `x` and `y` sumcheck rounds (described by Libra as Phases 1 and 2). ```rust=36 helper.prepare_simd(); helper.prepare_x_vals(); for i_var in 0..helper.input_var_num { let evals = helper.poly_evals_at_rx(i_var, 2); let r = transcript_io::<C>(&evals, transcript); helper.receive_rx(i_var, r); } helper.prepare_simd_var_vals(); for i_var in 0..helper.simd_var_num { let evals = helper.poly_evals_at_r_simd_var(i_var, 2); let r = transcript_io::<C>(&evals, transcript); helper.receive_r_simd_var(i_var, r); } let vx_claim = helper.vx_claim(); transcript.append_challenge_f::<C>(&vx_claim); helper.prepare_y_vals(); for i_var in 0..helper.input_var_num { let evals = helper.poly_evals_at_ry(i_var, 2); let r = transcript_io::<C>(&evals, transcript); helper.receive_ry(i_var, r); } ``` Say $N$ is the SIMD packing width (16 for M31 and BabyBear). In `helper.poly_evals_at_rx(i_var, 2)`, the polynomial evaluation first produces $N$ polynomials: ```rust=399 let [p0, p1, p2] = self.xy_helper.poly_eval_at::<C>( var_idx, degree, &mut self.sp.v_evals, &mut self.sp.hg_evals, &self.layer.input_vals, &self.sp.gate_exists_5, ); ``` So far this is just the prover doing $N$ parallel sum-check rounds. But these $N$ polynomials are not returned, they are instead combined and evaluated: ```rust=408 [ Self::unpack_and_combine(p0, &self.sp.eq_evals_at_r_simd0), Self::unpack_and_combine(p1, &self.sp.eq_evals_at_r_simd0), Self::unpack_and_combine(p2, &self.sp.eq_evals_at_r_simd0), ] ``` where ```rust=326 pub(crate) fn unpack_and_combine(p: C::Field, coef: &[C::ChallengeField]) -> C::ChallengeField { let p_unpacked = p.unpack(); p_unpacked .into_iter() .zip(coef) .map(|(p_i, coef_i)| p_i * coef_i) .sum() } ``` The coefficients being used, `self.sp.eq_evals_at_r_simd0` are the evaluations $\eq(r_{simd}, b)$ over the boolean hypercube of size $n = \log_2 N$. So that means the polynomial returned by `helper.poly_evals_at_rx` is actually $$p(X) = \sum_{i=0}^{N-1} \eq(r_{simd},[i]) p^{(i)}(X)$$ > Note that $r_{simd}$ is passed in as an argument to this sum-check round, meaning it was computed over the course of the *previous* sum-check round, or squeezed from the initialized transcript in the first round. That suggests that in the background a polynomial in $s+n$ variables has been defined (where $s= \log_2 S$ is the number of $X$-variables) $$f(Z_0, \ldots Z_{n-1}, X_0, \ldots X_{s-1}) = \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(X)$$ exactly as above in the distributed sum-check protocol. If we include the $Y_0,\ldots Y_{s-1}$ variables, we're doing sum-check on a polynomial that looks like $$f(Z_0, \ldots Z_{n-1}, X_0, \ldots X_{s-1},Y_0,\ldots Y_{s-1}) = \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(X,Y)$$ In this case each polynomial $f^{(i)}$ has the form $f_1(g,X,Y)f_2(X)f_3(Y)$, and the Libra 2-phase trick is to sum over $f^{(i)}$ as $$\begin{align} \sum_{x,y} f^{(i)}(g,X,Y) & = \sum_x f_2(X) \bigg( \sum_y f_1(g,X,Y)f_3(Y) \bigg) \\ &= \sum_x f_2(X) h_g(X) \\ \text{where}\quad h_g(X) &:= \sum_y f_1(g,X,Y)f_3(Y) \end{align}$$ ### SIMD Arithmetization Rewriting this with the $n$ extra $Z$-variables to sum over SIMD copies: > Should we use a different $g^{(i)}$ in each SIMD copy? $$\begin{align} f(g,Z,X,Y) &:= \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(g,X,Y) \\ \\ \sum_{z,x,y} f(g,Z,X,Y) &= \sum_z \eq(Z,[i]) \sum_{x} f_2^{(i)}(X) \bigg( \sum_y f_1^{(i)}(g,X,Y)f_3^{(i)}(Y) \bigg) \\ &= \sum_z \eq(Z,[i]) \sum_{x} f_2^{(i)}(X) h^{(i)}_g(X) \end{align}$$ When written like this it makes more sense to see the $\sum_x$ followed by the $\sum_z$ over the SIMD variables. Note that because we're using SIMD to evaluate $h^{(i)}_g(X)$ we would need to require all the $f_1^{(i)}, f_2^{(i)}$ polynomials to be equal, i.e. we're using identical sub-circuits. Therefore we may write $h^{(i)}_g = h_g$. So the $s+n$ rounds of sum-check reduce this down to a single claim $$h_g(r_0, \ldots r_{s-1}) = \texttt{vx_claim}$$ This gets proved with sum-check rounds over the $y$-variables, and again there's the `unpack_and_combine` thing going on, which suggests that actually the definition of $h_g$ should be again a sum over $N$ terms... OTOH in `prepare_y_vals` we are preparing the $f_3$ evaluations the prover will use in the $y$-sum-check rounds and that is a "fake SIMD" operation: although we store these values in `h_g_vals: Vec<Field>` (recall `Field` is the SIMD version of `ChallengeField`), we are actually always using `C::Field::from()` to convert extension field values to their SIMD version by just repeating the value $N$ times. And there's even a note in the code that this is not optimal, it was probably just complicated to make it work with their current types/fn signatures. Though whether or not the y-evaluations are all equal depends on whether the `bk_f` values used in `poly_eval_at` are all equal. Running the prover shows that they re not, so the SIMD computation is nontrivial here as well. ## MPI / SIMD What is the exact sum-check being proven in Expander? How are the MPI/SIMD copies of the circuit included in the arithmetization? ### $r_{z_0}, r_{SIMD}, r_{MPI}$ The first sum-check is kicked off by sampling three random points: $r_{z_0}, r_{SIMD}, r_{MPI}$. Their lengths are - $r_{z_0}$: `output_var_num` (log2 size of output layer) - $r_{SIMD}$: `C::get_field_pack_size().trailing_zeros()` (log2 of SIMD packing size) - $r_{MPI}$: `mpi_config.world_size().trailing_zeros()` (log2 of MPI world size) If $W_0$ is the output layer witness polynomial then we: - Evaluate $W_0(r_{z_0})$ using SIMD, i.e. for each of the SIMD-copies of the circuit. Obtain `claimed_v_simd`, an element of the SIMD field - Unpack and use `claimed_v_simd` as the coefficients of a multilinear polynomial and evaluate it at $r_{SIMD}$. Obtain `claimed_v_local`, an element of the challenge field - Collect all `config.world_size()`-many `claimed_v_local` and use these as the coefficients of a MLP, evaluate it at $r_{MPI}$. ### Indexing Distributed Provers Say - $i = 0, \ldots n-1$ indexes variables for current layer within a circuit. Use $x_i, y_i, z_i$ for concrete values, $X_i, Y_i, Z_i$ for variables. - $j=0, \ldots s-1$ indexes variables for SIMD (so SIMD packing size is $S=2^s$). Use $a_i$ for concrete values and $A_i$ for variables - $k=0,\ldots, m-1$ indexes variables for MPI (so MPI world size is $M=2^m$). Use $b_i$ for concrete values and $B_i$ for variables Then $W^{a,b}(x_0, \ldots x_{n-1})$ is a value in the output layer of the $a$th SIMD circuit for the $b$th MPI participant. The total circuit witness polynomial is therefore $$\begin{align*} \tilde W(Z, A, B) &= \sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \quad \sum_{z_0, \ldots z_{n-1}} W^{a,b}(z_0, \ldots z_{n-1}) \eq(z,Z) \eq(a,A) \eq(b,B) \\ \\ &=\sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \tilde W^{a,b}(Z)\eq(a,A) \eq(b,B) \end{align*}$$ in terms of the per-circuit witness polynomial $\tilde W^{a,b}(Z) = \sum_z W^{a,b}(z) \eq(z,Z)$.  ### Libra 2-phase sum-check Notation: Libra's convention of using $x,y$ to index the first/second inputs to the fan-in 2 mul gate. In this section we fix a single $(a,b)$ SIMD/MPI circuit and use $x, y$ as in Libra.) The GKR relation (expressing $W_{i-1}(Z)$ in terms of $W_i(X), W_i(Y)$) in Expander is $$\begin{align} W_{i-1}(g) &= \sum_{x_0, \ldots x_{n-1}, y_0, \ldots y_{n-1}} \mul(g,x, y) W_i(x) W_i(y) + \sum_{x_0, \ldots x_{n-1}} \add(g, x) W_i(x) \\ \\ &= \sum_{x_0, \ldots x_{n-1}} W_i(x) \bigg( \sum_{y_0, \ldots y_{n-1}} \mul(g, x, y) W_i(y) + \add(g,x) \bigg) \\ &= \sum_{x_0, \ldots x_{n-1}} W_i(x) h_g(x) \\ \text{where} \\ h_g(x) &:= \sum_{y_0, \ldots y_{n-1}} \mul(g, x, y) W_i(y) + \add(g,x) \end{align}$$  Sum-check over the $x$-variables ("phase 1") reduces $\sum_{x_0, \ldots x_{n-1}} W_i(x) h_g(x)$ to a single claim `val`$= W_i(r_x) h_g(r_x)$. We'll let the claimed value $W_i(r_x)$ be something for the prover to prove in next GKR layer. We'll apply sum-check ("phase 2") to claim on $h_g(r_x)$. Phase 2 will result in another claimed $W_i(r_y)$ evaluation, hence the two claims about $W_i$ that we reduce to 1 claim with an RLC (two-to-one reduction). ### Evaluating $h_g$ The prover computes all values of $h_g(x)$ in time proportional to the number of gates using the following sparse representations of the $\tilde\mul, \tilde\add$ polynomials: $$\begin{align} \tilde\mul(g,x,y) & = \sum_{z \in B_{n_i}} \mul(z,x,y) \eq(g,z) \\ \tilde\add(g,x) &= \sum_{z \in B_{n_i}} \add(z,x) \eq(g,z) \end{align}$$ That's written as a sum over all gates $z$ of the $i$th layer, but of course $\mul(z,x,y)=0$ if $z$ does not actually index a mul gate with inputs $x,y$. So really that is a sum over the mul gates of the $i$th layer (resp. add gates). Therefore an algorithm to compute $h_g$ is ``` for mul(z,x,y) in mul_gates // meaning z += mul(z,x,y) * x * y h_g[x] += mul(z,x,y) * eq(g,z) * W_i[y] for add(z,x) in add_gates // meaning z += add(z,x) * x h_g[x] += add(z,x) * eq(g,z) ``` This is what is done in `SumCheckGkrHelper::prepare_x_vals`. Note that the coefficients `mul(z,x,y), add(z, x)` and evaluations `eq(g,z)` are common to all the circuits (assuming they have same challenge point `g`), so SIMD-packing the various `W_i[y]` values allows the thread to evaluate `h_g` for all $S$ circuits simultaneously. Ignoring SIMD/MPI, what happens in `sumcheck_prove_gkr_layer` is 1. Evaluate $h_g(x)$ for all $x$ 2. Sum-check for $\sum_x h_g(x) W_i(x)$ (degree 2 summand) 3. Arrive at claimed value `vx_claim` of $h_g(r_x) W_i(r_x)$ (Not actually that value, but rather a multilinear combination of those values from the various circuits, due to SIMD/MPI) > Although it seems like `vx_claim` ought to be $h_g(r_x) W_i(r_x)$, it's actually `sp.mpi_var_v_evals[0]`, and the verifier uses it as a coefficient of $\add(g, r_x)$, suggesting it's actually just a SIMD/MPI-related coefficient...? 4. Clear `hg_vals` memory for reuse -- the new assigned values do *not* have the same interpretation as above. 5. TODO: What is new interpretation of `hg_vals` ??? We'll call this polynomial $h_g'(y)$ 6. Sum-check for $\sum_y h_g'(y) W_i(y)$ (degree 2) summand Observe - $W_i(r_x)$ is just a claimed evaluation. For second sum-check phase it's a constant. - $h_g(r_x)$ claim is what 2nd phase proves via sum-check. By definition $$h_g(r_x) = \sum_{y_0, \ldots y_{n-1}} \mul(g, r_x, y) W_i(y) + \add(g,r_x)$$ Second term is a constant that verifier can compute directly. First term is what we actually sum-check on: $\sum_{y_0, \ldots y_{n-1}} \mul(g, r_x, y) W_i(y)$ Therefore what we called $h'_g(y)$ above should just be $\mul(g,r_x, y)$. We can get all its values by ``` for mul(z,x,y) in mul_gates h_g'[y] += mul(z,x,y) * eq(g, z) * eq(r_x, x) ``` Note that in the code, $h_g'$ is actually just assigned to the same memory slice `h_g`, for efficiency. ### All together $$\begin{align*} \tilde W(Z, A, B) &= \sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \quad \sum_{z_0, \ldots z_{n-1}} W(a,b,z) \eq(z,Z) \eq(a,A) \eq(b,B) \\ \end{align*}$$ The total sum-check being done would then be $$\begin{align} \tilde W_{i-1}(r_z, r_{SIMD}, r_{MPI}) &= \sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \sum_{z_0,\ldots z_{n-1}} W_{i-1}(z,a,b)\eq(z,r_z)\eq(a,r_{SIMD}) \eq(b,r_{MPI}) \\ \\ &= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \left( \sum_{z_0,\ldots z_{n-1}} W_{i-1}(z,a,b)\eq(z,r_z) \right) \\ \\ &= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \left( \sum_{x_0, \ldots x_{n-1}} h_{r_z}(x,a,b) W_{i}(x,a,b) \right) \\ \\ &= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \sum_{x_0, \ldots x_{n-1}} W_i(x,a,b) \left( \sum_{y_0, \ldots y_{n-1}} \mul(r_z, x, y) W_i(y,a,b) + \add(r_z,x) \right) \end{align}$$ Fix MPI participant $b$: Due to SIMD, each round of the sumcheck $\sum_{x_0, \ldots x_{n-1}} h_{r_z}(x,a,b) W_{i}(x,a,b)$ is carried out simultaneously for all $a$. In other words, $S$ univariate polynomials are produced simultaneously. Then `unpack_and_combine` performs the $\sum_a \eq(a, r_{SIMD})$ over those individual polynomials, producing a single polynomial. We have effectively performed a round of the sum-check $$\sum_{x_0, \ldots x_{n-1}} \left( \sum_{a_0, \ldots a_{s-1}} \eq(a, r_{SIMD})h_{r_z}(x,a,b) W_{i}(x,a,b) \right)$$ for fixed MPI participant $b$. Now we combine those univariate polynomials for each $b$ with $r_{MPI}$. Effectively, we've computed the univariate polynomial for a round of sumcheck on $$\sum_{x_0, \ldots x_{n-1}} \left( \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} \eq(b,r_{MPI}) \eq(a, r_{SIMD})h_{r_z}(x,a,b) W_{i}(x,a,b) \right)$$ This polynomial is what is actually written to transcript, corresponding to the coordinator having aggregated the work of each prover and sent that to verifier to obtain a single challenge. After the first $n$ rounds, the $x$ variables have been bound to specific challenge values $r_x$, leaving $$\begin{align} \text{intermediate sum claim} &= \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} \eq(b,r_{MPI}) \eq(a, r_{SIMD})h_{r_z}(r_x,a,b) W_{i}(r_x,a,b) \\ &= \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} h_{r_z}(r_x,a,b) W_{i}(r_x,a,b) \eq(b,r_{MPI}) \eq(a, r_{SIMD}) \end{align}$$ So now we're ready to do $S$ rounds over the SIMD variables, $M$ rounds for the MPI variables. > This expression suggests that they'd be degree 1 summands, but the implementation uses degree 3. Resolution is probably that the $a,b$ superscripts hide actual polynomial dependence on $a,b$...