# Distributed GKR
$\newcommand{\add}{\textsf{add}}
\newcommand{\mul}{\textsf{mul}}
\newcommand{\FF}{\mathbb{F}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\V}{\mathcal{V}}
\newcommand{\eq}{\tilde{\mathsf{eq}}}$
## References
- [GKR Review](https://hackmd.io/WUh-NzVNRYeXdEamoQN6PQ)
- [Virgo](https://eprint.iacr.org/2019/1482.pdf) GKR + ZK via novel PCS
- [zkBridge / diVirgo](https://arxiv.org/abs/2210.00264) Distributed GKR
## TLDR
- For data parallel circuits, the GKR prover can be distributed across $N$ machines with linear speed-up
- Main idea: distributed sum-check
- No additional communication between prover/verifier
- Same proof size (assuming no use of PCS)
- Significant communication between provers
## Data Parallel Circuits
- Circuit $C$ is a union of sub-circuits $C_1, \ldots C_N$ (assume $N=2^n$)
- No cross-circuit connections (Gate in layer $i$ of sub-circuit $C_j$ is connected only to gates in layer $i+1$ of sub-circuit $C_j$)
- Sub-circuits do not have to be identical, but must have equal size $S_i =2^{s_i}$ at each layer (otherwise can be padded to have equal size)
- Goal: Distribute GKR prover across $N$ machines

### Decomposing Indices
(Note: To simplify notation we'll index starting from 1 throughout)
- Observe: Layer $i$ of $C$ consists of $S_i \times N = 2^{s_i+n}$ gates
- Say $y = (y_1, \ldots y_{s_i+n})$ indexes a gate in layer $i$ of $C$ using $s_i+n$ variables
- Decompose:
- First $s_i$ variables $(y_1, \ldots y_{s_i})$ indexing *gate within a sub-circuit* using $s_i$ variables
- Last $n$ variables $(y_{s_i+1}, \ldots y_{s_i+n})$ indexing a *sub-circuit within* $C$ using $n$ variables
### Decomposing Multilinear Extension Polynomials
**Claim:** Witness polynomial for layer $i$ of $C$ decomposes into sum of witness polynomials for layer $i$ of $C_1, \ldots C_N$
Recall the basis polynomials (say $x \in \FF^k$ fixed, $Y$ consists of $k$ variables)
$$\eq(x|Y) = \prod_{i=1}^k x_i Y_i + (1-x_i)(1-Y_i)$$
Letting $x$ vary over the boolean hypercube $B_k$, these form a basis for the multilinear polynomials in $k$ variables.
We use them as a "Lagrange basis" to form multilinear extensions: For witness values $W(1), \ldots W(2^k)$, the MLE is
$$\tilde W(Y) = \sum_{x \in B_k} W(x) \eq(x | Y)$$
Observe that these basis polynomials decompose as products:
$$\eq(x_1, \ldots x_{s+n} | Y_1, \ldots Y_{s+n}) = \eq(x_1, \ldots x_s| Y_1, \ldots Y_s) \eq(x_{s+1}, \ldots x_{s+n} | Y_{s+1}, \ldots Y_{s+n})$$
Therefore
$$\begin{align*}
\tilde W(Y) &= \sum_{x_1, \ldots x_{s+n}} W(x_1, \ldots x_{s+n}) \eq(x_1, \ldots x_{s+n} | Y) \\
&= \sum_{x_{s+1}, \ldots x_{s+n}} \bigg( \sum_{x_1, \ldots x_s} W(x_1, \ldots x_{s+n}) \eq(x_1, \ldots x_{s} | Y_1, \ldots Y_s) \bigg) \eq(x_{s+1}, \ldots x_{s+n} | Y_{s+1}, \ldots Y_{s+n}) \\
&= \sum_{\text{sub-circuit indices}} \quad \bigg( \text{witness poly of sub-circuit} \bigg) (\text{basis polynomial}) \\
&= \sum_{j=1}^N W^{(j)}(Y_1, \ldots Y_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n})
\end{align*}$$
where we've defined the witness polynomial for the $j$th sub-circuit as
$$ W^{(j)}(Y_1, \ldots Y_s) = \sum_{x_1, \ldots x_s} W(x_1, \ldots x_{s}, [j]) \eq(x_1, \ldots x_{s} | Y_1, \ldots Y_s)$$
($[j]$ denotes the binary decomposition of $j$ in $n$ bits).
### Decomposing Wiring Polynomials
TLDR: Because the circuit is data-parallel (no wiring between different sub-circuits), the $\tilde\add$, $\tilde\mul$ polynomials decompose by sub-circuit as well.
More precisely say the $\tilde\mul$ polynomial for layer $i$ of the full circuit $C$ is
$$\tilde\mul(X,Y,Z) = \sum_{x_1, \ldots x_{s_i+n}} \sum_{y_1, \ldots y_{s_{i+1}+n}} \sum_{z_1, \ldots z_{s_{i+1}+n}} \mul(x,y,z) \eq(x| X) \eq(y | Y) \eq(z | Z)$$
Because $C$ is data-parallel, we know the only potentially non-zero terms $\mul(x,y,z)$ are those for which $(x,y,z)$ label gates within the same sub-circuit.
Therefore the above expression has redundant variables: $X_{s_i+1}, \ldots X_{s_i+n}$, $Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n}$, and $Z_{s_{i+1}+1}, \ldots Z_{s_{i+1}+n}$ are all indexing sub-circuits. We can discard $Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n}$ and $Z_{s_{i+1}+1}, \ldots Z_{s_{i+1}+n}$ and write
$$\begin{align*} \tilde\mul(X, Y, Z) &= \tilde\mul(X_1, \ldots X_{s_i}, X_{s_i+1}, \ldots X_{s_i+n}, Y_1, \ldots Y_{s_{i+1}}, Z_1, \ldots Z_{s_{i+1}} ) \\
&= \sum_{j=1}^N \tilde\mul^{(j)}(X_1, \ldots X_{s_i}, Y,Z) \eq([j]| X_{s_i+1}, \ldots X_{s_i+n}) \\
&= \sum_{\text{sub-circuit indices}} (\text{mul poly. of sub-circuit})(\text{basis polynomial})
\end{align*}$$
That is, we use a multilinear extension of $\mul$ in $s_i + 2*s_{i+1} + n$ variables
- saves $2n$ variables compared to a non-data-parallel circuit of same size as $C$
- structure as a sum over sub-circuits will allow for distributed sum-check
### GKR Polynomial
GKR at layer $i$ applies sum-check to the polynomial
$$f_{r_i}(Y,Z) = \tilde\add_i(r_i,Y,Z) \bigg(\tilde W_{i+1}(Y) + \tilde W_{i+1}(Z) \bigg) + \tilde\mul_i(r_i,Y,Z) \bigg( \tilde W_{i+1}(Y) \cdot \tilde W_{i+1}(Z) \bigg)$$
Using the data-parallel structure of $C$ and a similar argument, we can decompose this too:
$$f_{r_i}(Y_1, \ldots Y_{s_{i+1}+n},Z_1, \ldots Z_{s_{i+1}}) = \sum_{j=1}^N f_{r_i}^{(j)}(Y_1, \ldots Y_{s_{i+1}},Z_1, \ldots Z_{s_{i+1}})\eq([j]| Y_{s_{i+1}+1}, \ldots Y_{s_{i+1}+n})$$
- Q: Would this work for a term like $\tilde{POW5}(Y) W(Y)^5$ ?
## Distributed Sum-Check
In a given GKR round, prover's claim is of the form
$$m_{i} = \sum_{x_1, \ldots x_{s+n}} f(x_1, \ldots x_{s+n})$$
This will be proven using sum-check in $s+n$ rounds.
When $f$ is of the form
$$\begin{align*}f(Y_1, \ldots Y_{s+n}) &= \sum_{j=1}^N f^{(j)}(Y_1, \ldots Y_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n}) \\
&= \sum_{\text{sub-circuit indices}} (\text{sub-circuit poly})(\text{basis poly})
\end{align*}$$
we can distribute the work across $N$ prover by writing the sum as
$$m_{i} = \sum_{j=1}^N \sum_{x_1, \ldots x_{s}} f^{(j)}(x_1, \ldots x_{s_i})$$
Each prover $\P_i$ holds the $s$-variate polynomial $f^{(i)}$. A single prover, say $\P_1$, will also act as coordinator.
### $s$ Distributed Rounds
The first $s$ rounds of sum-check go as follows:
1. Each $\P_j$ computes the univariate polynomial $$f^{(j)}_1(Y) = \sum_{x_2, \ldots x_s} f^{(j)}(Y, x_2, \ldots x_s)$$ and sends $f^{(j)}_1$ to $\P_1$ (coordinator)
2. $\P_1$ computes $f_1(Y) = \sum_1^N f^{(j)}_1(Y)$, sends to $\V$
3. $\V$ samples $r_1 \in \FF$, sends to $\P_1$
4. $\P_1$ sends $r_1$ to each $\P_j$
5. Each $\P_j$ computes $$f^{(j)}_2(Y) = \sum_{x_3, \ldots x_s} f^{(j)}(r_1, Y, x_3, \ldots x_s)$$ and sends $f^{(j)}_2$ to $\P_1$ (coordinator)
6. [continue round-by-round]
7. $\V$ samples $r_s \in \FF$ and sends to $\P_1$
8. $\P_1$ sends $r_s$ to each $\P_j$
9. $\P_j$ computes $f^{(j)}(r_1, \ldots r_s)$, sends to $\P_1$
### $n$ Ordinary Rounds
At this point the first $s$ of $s+n$ sum-check rounds have been performed and $\P_1$ holds the values $f^{(j)}(r_1, \ldots r_s)$ for $j=1,\ldots N$.
The final $n$ rounds will be carried out between $\P_1$ and $\V$ using the polynomial
$$f(r_1, \ldots r_s, Y_{s+1}, \ldots Y_{s+n}) = \sum_{j=1}^N f^{(j)}(r_1, \ldots r_s) \eq([j] | Y_{s+1}, \ldots Y_{s+n}) $$
# Expander SIMD
Expander uses SIMD field implementations to prove/verify multiple instances of a circuit in parallel. They use some similar ideas to those above to achieve some form of aggregation. This is an attempt to write out the protocol they're effectively using.
In `fn sumcheck_prove_gkr_layer`, they have a few sumcheck rounds in between the `x` and `y` sumcheck rounds (described by Libra as Phases 1 and 2).
```rust=36
helper.prepare_simd();
helper.prepare_x_vals();
for i_var in 0..helper.input_var_num {
let evals = helper.poly_evals_at_rx(i_var, 2);
let r = transcript_io::<C>(&evals, transcript);
helper.receive_rx(i_var, r);
}
helper.prepare_simd_var_vals();
for i_var in 0..helper.simd_var_num {
let evals = helper.poly_evals_at_r_simd_var(i_var, 2);
let r = transcript_io::<C>(&evals, transcript);
helper.receive_r_simd_var(i_var, r);
}
let vx_claim = helper.vx_claim();
transcript.append_challenge_f::<C>(&vx_claim);
helper.prepare_y_vals();
for i_var in 0..helper.input_var_num {
let evals = helper.poly_evals_at_ry(i_var, 2);
let r = transcript_io::<C>(&evals, transcript);
helper.receive_ry(i_var, r);
}
```
Say $N$ is the SIMD packing width (16 for M31 and BabyBear).
In `helper.poly_evals_at_rx(i_var, 2)`, the polynomial evaluation first produces $N$ polynomials:
```rust=399
let [p0, p1, p2] = self.xy_helper.poly_eval_at::<C>(
var_idx,
degree,
&mut self.sp.v_evals,
&mut self.sp.hg_evals,
&self.layer.input_vals,
&self.sp.gate_exists_5,
);
```
So far this is just the prover doing $N$ parallel sum-check rounds. But these $N$ polynomials are not returned, they are instead combined and evaluated:
```rust=408
[
Self::unpack_and_combine(p0, &self.sp.eq_evals_at_r_simd0),
Self::unpack_and_combine(p1, &self.sp.eq_evals_at_r_simd0),
Self::unpack_and_combine(p2, &self.sp.eq_evals_at_r_simd0),
]
```
where
```rust=326
pub(crate) fn unpack_and_combine(p: C::Field, coef: &[C::ChallengeField]) -> C::ChallengeField {
let p_unpacked = p.unpack();
p_unpacked
.into_iter()
.zip(coef)
.map(|(p_i, coef_i)| p_i * coef_i)
.sum()
}
```
The coefficients being used, `self.sp.eq_evals_at_r_simd0` are the evaluations $\eq(r_{simd}, b)$ over the boolean hypercube of size $n = \log_2 N$. So that means the polynomial returned by `helper.poly_evals_at_rx` is actually
$$p(X) = \sum_{i=0}^{N-1} \eq(r_{simd},[i]) p^{(i)}(X)$$
> Note that $r_{simd}$ is passed in as an argument to this sum-check round, meaning it was computed over the course of the *previous* sum-check round, or squeezed from the initialized transcript in the first round.
That suggests that in the background a polynomial in $s+n$ variables has been defined (where $s= \log_2 S$ is the number of $X$-variables)
$$f(Z_0, \ldots Z_{n-1}, X_0, \ldots X_{s-1}) = \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(X)$$
exactly as above in the distributed sum-check protocol.
If we include the $Y_0,\ldots Y_{s-1}$ variables, we're doing sum-check on a polynomial that looks like
$$f(Z_0, \ldots Z_{n-1}, X_0, \ldots X_{s-1},Y_0,\ldots Y_{s-1}) = \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(X,Y)$$
In this case each polynomial $f^{(i)}$ has the form $f_1(g,X,Y)f_2(X)f_3(Y)$, and the Libra 2-phase trick is to sum over $f^{(i)}$ as
$$\begin{align}
\sum_{x,y} f^{(i)}(g,X,Y) & = \sum_x f_2(X) \bigg( \sum_y f_1(g,X,Y)f_3(Y) \bigg) \\
&= \sum_x f_2(X) h_g(X)
\\
\text{where}\quad h_g(X) &:= \sum_y f_1(g,X,Y)f_3(Y)
\end{align}$$
### SIMD Arithmetization
Rewriting this with the $n$ extra $Z$-variables to sum over SIMD copies:
> Should we use a different $g^{(i)}$ in each SIMD copy?
$$\begin{align}
f(g,Z,X,Y) &:= \sum_{i=0}^{N-1} \eq(Z_0, \ldots Z_{n-1},[i]) f^{(i)}(g,X,Y) \\
\\
\sum_{z,x,y} f(g,Z,X,Y) &= \sum_z \eq(Z,[i]) \sum_{x} f_2^{(i)}(X) \bigg( \sum_y f_1^{(i)}(g,X,Y)f_3^{(i)}(Y) \bigg) \\
&= \sum_z \eq(Z,[i]) \sum_{x} f_2^{(i)}(X) h^{(i)}_g(X)
\end{align}$$
When written like this it makes more sense to see the $\sum_x$ followed by the $\sum_z$ over the SIMD variables. Note that because we're using SIMD to evaluate $h^{(i)}_g(X)$ we would need to require all the $f_1^{(i)}, f_2^{(i)}$ polynomials to be equal, i.e. we're using identical sub-circuits. Therefore we may write $h^{(i)}_g = h_g$.
So the $s+n$ rounds of sum-check reduce this down to a single claim $$h_g(r_0, \ldots r_{s-1}) = \texttt{vx_claim}$$
This gets proved with sum-check rounds over the $y$-variables, and again there's the `unpack_and_combine` thing going on, which suggests that actually the definition of $h_g$ should be again a sum over $N$ terms...
OTOH in `prepare_y_vals` we are preparing the $f_3$ evaluations the prover will use in the $y$-sum-check rounds and that is a "fake SIMD" operation: although we store these values in `h_g_vals: Vec<Field>` (recall `Field` is the SIMD version of `ChallengeField`), we are actually always using `C::Field::from()` to convert extension field values to their SIMD version by just repeating the value $N$ times. And there's even a note in the code that this is not optimal, it was probably just complicated to make it work with their current types/fn signatures.
Though whether or not the y-evaluations are all equal depends on whether the `bk_f` values used in `poly_eval_at` are all equal. Running the prover shows that they re not, so the SIMD computation is nontrivial here as well.
## MPI / SIMD
What is the exact sum-check being proven in Expander? How are the MPI/SIMD copies of the circuit included in the arithmetization?
### $r_{z_0}, r_{SIMD}, r_{MPI}$
The first sum-check is kicked off by sampling three random points: $r_{z_0}, r_{SIMD}, r_{MPI}$. Their lengths are
- $r_{z_0}$: `output_var_num` (log2 size of output layer)
- $r_{SIMD}$: `C::get_field_pack_size().trailing_zeros()` (log2 of SIMD packing size)
- $r_{MPI}$: `mpi_config.world_size().trailing_zeros()` (log2 of MPI world size)
If $W_0$ is the output layer witness polynomial then we:
- Evaluate $W_0(r_{z_0})$ using SIMD, i.e. for each of the SIMD-copies of the circuit. Obtain `claimed_v_simd`, an element of the SIMD field
- Unpack and use `claimed_v_simd` as the coefficients of a multilinear polynomial and evaluate it at $r_{SIMD}$. Obtain `claimed_v_local`, an element of the challenge field
- Collect all `config.world_size()`-many `claimed_v_local` and use these as the coefficients of a MLP, evaluate it at $r_{MPI}$.
### Indexing Distributed Provers
Say
- $i = 0, \ldots n-1$ indexes variables for current layer within a circuit. Use $x_i, y_i, z_i$ for concrete values, $X_i, Y_i, Z_i$ for variables.
- $j=0, \ldots s-1$ indexes variables for SIMD (so SIMD packing size is $S=2^s$). Use $a_i$ for concrete values and $A_i$ for variables
- $k=0,\ldots, m-1$ indexes variables for MPI (so MPI world size is $M=2^m$). Use $b_i$ for concrete values and $B_i$ for variables
Then $W^{a,b}(x_0, \ldots x_{n-1})$ is a value in the output layer of the $a$th SIMD circuit for the $b$th MPI participant.
The total circuit witness polynomial is therefore
$$\begin{align*}
\tilde W(Z, A, B) &=
\sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \quad \sum_{z_0, \ldots z_{n-1}} W^{a,b}(z_0, \ldots z_{n-1}) \eq(z,Z) \eq(a,A) \eq(b,B) \\
\\
&=\sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \tilde W^{a,b}(Z)\eq(a,A) \eq(b,B)
\end{align*}$$
in terms of the per-circuit witness polynomial $\tilde W^{a,b}(Z) = \sum_z W^{a,b}(z) \eq(z,Z)$.
<!-- Fixing some $b \in [0,M)$, i.e. one of the MPI participants, we have a single thread working on the polynomial
$$\tilde W(X,A,b) = \sum_{a} \tilde W^{a,b}(X) \eq(a,A)$$ for all $a$ simultaneously using SIMD instructions -->
### Libra 2-phase sum-check
Notation: Libra's convention of using $x,y$ to index the first/second inputs to the fan-in 2 mul gate. In this section we fix a single $(a,b)$ SIMD/MPI circuit and use $x, y$ as in Libra.)
The GKR relation (expressing $W_{i-1}(Z)$ in terms of $W_i(X), W_i(Y)$) in Expander is
$$\begin{align}
W_{i-1}(g) &=
\sum_{x_0, \ldots x_{n-1}, y_0, \ldots y_{n-1}} \mul(g,x, y) W_i(x) W_i(y) + \sum_{x_0, \ldots x_{n-1}} \add(g, x) W_i(x) \\
\\
&= \sum_{x_0, \ldots x_{n-1}} W_i(x) \bigg( \sum_{y_0, \ldots y_{n-1}} \mul(g, x, y) W_i(y) + \add(g,x) \bigg) \\
&= \sum_{x_0, \ldots x_{n-1}} W_i(x) h_g(x) \\
\text{where} \\
h_g(x) &:= \sum_{y_0, \ldots y_{n-1}} \mul(g, x, y) W_i(y) + \add(g,x)
\end{align}$$
<!-- NOTES
$W_i(A, Z) = \sum_{a} W^{a}(z) eq(a, A)$
$$W_{i-1}(A, Z) = \sum_{a,x,y} \mul(Z, x, y) W_i(x) W_i(y)$$
$$\begin{align}
\sum_{a, y} \mul(Z, r_x, y) W_i(a, r_x) W_i(a, y) &= \sum_{a,y} \mul(Z, r_x, y) \cdot \left( \sum_d W_i^{d}(r_x) \eq(d, a) \right) \cdot \left( \sum_e W_i^{e}(y) \eq(e, a) \right) \\
\sum_{a, y} \mul(Z, r_x, y) W_i(a, r_x) W_i(a, y) &= \sum_{a,y} \mul(Z, r_x, y) \cdot W_{i, r_x}(a) \cdot \left( \sum_e W_i^{e}(y) \eq(e, a) \right) \\
&= \sum_{a,y} \mul(Z, r_x, y) \left( \sum_d W_i^{d}(r_x) W_i^{d}(y) \eq^2(d, a) \right)
\end{align}$$
For $A,B$:
$$\begin{align}
W(A, B, X) &= \sum_a W^a(B, X) \eq(a, A) \\
&= \sum_a \eq(a, A) \left( \sum_b W^{a,b}(X) \eq(b, B) \right)
\end{align}$$
$$ W_i(A, B, X) = \sum_{a,b} W_i^{a,b}(X) eq(a, A) eq(b, B) $$
$$\begin{align}
\sum_{a,b,y} mul(Z, r_x, y) W(a, b, r_x) W(a, b, y) = \\ \sum_{a, b, y} mul(Z, r_x, y) \left( \sum_{\alpha, \beta} W^{\alpha,\beta}(r_x) \eq(a, \alpha) \eq(b, \beta) \right) \left( \sum_{\alpha, \beta} W^{\alpha,\beta}(y) \eq(a, \alpha) \eq(b, \beta) \right)
\end{align}$$
-->
Sum-check over the $x$-variables ("phase 1") reduces $\sum_{x_0, \ldots x_{n-1}} W_i(x) h_g(x)$ to a single claim `val`$= W_i(r_x) h_g(r_x)$. We'll let the claimed value $W_i(r_x)$ be something for the prover to prove in next GKR layer. We'll apply sum-check ("phase 2") to claim on $h_g(r_x)$. Phase 2 will result in another claimed $W_i(r_y)$ evaluation, hence the two claims about $W_i$ that we reduce to 1 claim with an RLC (two-to-one reduction).
### Evaluating $h_g$
The prover computes all values of $h_g(x)$ in time proportional to the number of gates using the following sparse representations of the $\tilde\mul, \tilde\add$ polynomials:
$$\begin{align}
\tilde\mul(g,x,y) & = \sum_{z \in B_{n_i}} \mul(z,x,y) \eq(g,z) \\
\tilde\add(g,x) &= \sum_{z \in B_{n_i}} \add(z,x) \eq(g,z)
\end{align}$$
That's written as a sum over all gates $z$ of the $i$th layer, but of course $\mul(z,x,y)=0$ if $z$ does not actually index a mul gate with inputs $x,y$. So really that is a sum over the mul gates of the $i$th layer (resp. add gates).
Therefore an algorithm to compute $h_g$ is
```
for mul(z,x,y) in mul_gates // meaning z += mul(z,x,y) * x * y
h_g[x] += mul(z,x,y) * eq(g,z) * W_i[y]
for add(z,x) in add_gates // meaning z += add(z,x) * x
h_g[x] += add(z,x) * eq(g,z)
```
This is what is done in `SumCheckGkrHelper::prepare_x_vals`. Note that the coefficients `mul(z,x,y), add(z, x)` and evaluations `eq(g,z)` are common to all the circuits (assuming they have same challenge point `g`), so SIMD-packing the various `W_i[y]` values allows the thread to evaluate `h_g` for all $S$ circuits simultaneously.
Ignoring SIMD/MPI, what happens in `sumcheck_prove_gkr_layer` is
1. Evaluate $h_g(x)$ for all $x$
2. Sum-check for $\sum_x h_g(x) W_i(x)$ (degree 2 summand)
3. Arrive at claimed value `vx_claim` of $h_g(r_x) W_i(r_x)$ (Not actually that value, but rather a multilinear combination of those values from the various circuits, due to SIMD/MPI)
> Although it seems like `vx_claim` ought to be $h_g(r_x) W_i(r_x)$, it's actually `sp.mpi_var_v_evals[0]`, and the verifier uses it as a coefficient of $\add(g, r_x)$, suggesting it's actually just a SIMD/MPI-related coefficient...?
4. Clear `hg_vals` memory for reuse -- the new assigned values do *not* have the same interpretation as above.
5. TODO: What is new interpretation of `hg_vals` ??? We'll call this polynomial $h_g'(y)$
6. Sum-check for $\sum_y h_g'(y) W_i(y)$ (degree 2) summand
Observe
- $W_i(r_x)$ is just a claimed evaluation. For second sum-check phase it's a constant.
- $h_g(r_x)$ claim is what 2nd phase proves via sum-check. By definition
$$h_g(r_x) = \sum_{y_0, \ldots y_{n-1}} \mul(g, r_x, y) W_i(y) + \add(g,r_x)$$
Second term is a constant that verifier can compute directly. First term is what we actually sum-check on: $\sum_{y_0, \ldots y_{n-1}} \mul(g, r_x, y) W_i(y)$
Therefore what we called $h'_g(y)$ above should just be $\mul(g,r_x, y)$. We can get all its values by
```
for mul(z,x,y) in mul_gates
h_g'[y] += mul(z,x,y) * eq(g, z) * eq(r_x, x)
```
Note that in the code, $h_g'$ is actually just assigned to the same memory slice `h_g`, for efficiency.
### All together
$$\begin{align*}
\tilde W(Z, A, B) &=
\sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \quad \sum_{z_0, \ldots z_{n-1}} W(a,b,z) \eq(z,Z) \eq(a,A) \eq(b,B) \\
\end{align*}$$
The total sum-check being done would then be
$$\begin{align}
\tilde W_{i-1}(r_z, r_{SIMD}, r_{MPI}) &= \sum_{b_0, \ldots b_{m-1}} \quad \sum_{a_0, \ldots a_{s-1}} \sum_{z_0,\ldots z_{n-1}} W_{i-1}(z,a,b)\eq(z,r_z)\eq(a,r_{SIMD}) \eq(b,r_{MPI}) \\
\\
&= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \left( \sum_{z_0,\ldots z_{n-1}} W_{i-1}(z,a,b)\eq(z,r_z) \right) \\
\\
&= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \left( \sum_{x_0, \ldots x_{n-1}} h_{r_z}(x,a,b) W_{i}(x,a,b) \right) \\
\\
&= \sum_{b_0, \ldots b_{m-1}} \eq(b,r_{MPI}) \sum_{a_0, \ldots a_{s-1}} \eq(a,r_{SIMD}) \sum_{x_0, \ldots x_{n-1}} W_i(x,a,b) \left( \sum_{y_0, \ldots y_{n-1}} \mul(r_z, x, y) W_i(y,a,b) + \add(r_z,x) \right)
\end{align}$$
Fix MPI participant $b$:
Due to SIMD, each round of the sumcheck $\sum_{x_0, \ldots x_{n-1}} h_{r_z}(x,a,b) W_{i}(x,a,b)$ is carried out simultaneously for all $a$. In other words, $S$ univariate polynomials are produced simultaneously.
Then `unpack_and_combine` performs the $\sum_a \eq(a, r_{SIMD})$ over those individual polynomials, producing a single polynomial. We have effectively performed a round of the sum-check
$$\sum_{x_0, \ldots x_{n-1}} \left( \sum_{a_0, \ldots a_{s-1}} \eq(a, r_{SIMD})h_{r_z}(x,a,b) W_{i}(x,a,b) \right)$$
for fixed MPI participant $b$.
Now we combine those univariate polynomials for each $b$ with $r_{MPI}$. Effectively, we've computed the univariate polynomial for a round of sumcheck on
$$\sum_{x_0, \ldots x_{n-1}} \left( \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} \eq(b,r_{MPI}) \eq(a, r_{SIMD})h_{r_z}(x,a,b) W_{i}(x,a,b) \right)$$
This polynomial is what is actually written to transcript, corresponding to the coordinator having aggregated the work of each prover and sent that to verifier to obtain a single challenge.
After the first $n$ rounds, the $x$ variables have been bound to specific challenge values $r_x$, leaving
$$\begin{align}
\text{intermediate sum claim} &= \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} \eq(b,r_{MPI}) \eq(a, r_{SIMD})h_{r_z}(r_x,a,b) W_{i}(r_x,a,b) \\
&= \sum_{b_0, \ldots b_{m-1}} \sum_{a_0, \ldots a_{s-1}} h_{r_z}(r_x,a,b) W_{i}(r_x,a,b) \eq(b,r_{MPI}) \eq(a, r_{SIMD})
\end{align}$$
So now we're ready to do $S$ rounds over the SIMD variables, $M$ rounds for the MPI variables.
> This expression suggests that they'd be degree 1 summands, but the implementation uses degree 3. Resolution is probably that the $a,b$ superscripts hide actual polynomial dependence on $a,b$...