Hardware-friendliness of HyperPlonk

# Hardware-friendliness of HyperPlonk In this note we provide the hardware perspective on [HyperPlonk](https://eprint.iacr.org/2022/1355.pdf). We focus on the main building block, the Multivariate SumCheck protocol and compare its computational and memory complexity to that of an NTT (Number Theoretic Transform). ## Background - HyperPlonk Plonk is one of the most widely adopted SNARKs in the industry. In vanilla Plonk after arithmetization, the resulting execution trace is interpolated into univariate polynomials and thus the resulting protocol relies heavily on NTTs. HyperPlonk is a new adaptation of Plonk, where the execution trace is interpolated on a boolean hypercube. Thus the polynomial representation of the trace is a multivariate polynomial with linear degree in each variable. This is known as an MLE (MultiLinear Extension). A good overview of Hyperplonk can be found in Benedikt Bunz [ZKSummit8 talk](https://www.youtube.com/watch?v=2JDBD5oMS0w&t=279s). One key advantage of HyperPlonk is the elimination of large NTTs, a major computational bottleneck in Plonk over large-circuits. By moving to the boolean hypercube, we no longer need univariate polynomials. Instead, HyperPlonk relies on multivariate polynomial arithemetic. Section 3 in the Hyperplonk paper is devoted to developing the toolbox for working with multivariate polynomials. The figure below, taken from the paper, shows how HyperPlonk is built out of this toolbox. As can be seen, at the root of it all we have the classical SumCheck protocol, which is bound to become the main computational problem in HyperPlonk (polynomial commitments aside), replacing NTTs all together. ![](https://i.imgur.com/GusntWV.png) ## Sumcheck in HyperPlonk ### A toy example Consider an execution trace unrolled in the form of a vector. We illustrate the general idea using a vector of length $8$, constituted of the polynomial evaluations $\left\{ f_{i}\right\} _{i=0}^{7}$, where $f_i\in \mathbb{F}$. We interpolate these values using a multivariate polynomial $F(X_3,X_2,X_1)$ as follows | $f_{0}$ | $f_{1}$ | $f_{2}$ | $f_{3}$ | $f_{4}$ | $f_{5}$ | $f_{6}$ | $f_{7}$ | | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | $F(0,0,0)$ | $F(0,0,1)$ | $F(0,1,0)$ | $F(0,1,1)$ | $F(1,0,0)$ | $F(1,0,1)$ | $F(1,1,0)$ | $F(1,1,1)$ | The elements in the first row are finite field elements, and the second row represents the interpolation idea. We interpolate the values using the Lagrange interpolation \begin{align} F(X_3,X_2,X_1) = & f_0 (1-X_3) (1-X_2) (1-X_1) + f_1 (1-X_3)(1-X_2) X_1\\ & + f_2 (1-X_3) X_2 (1-X_1) + f_3 (1-X_3) X_2 X_1) \\ & + f_4 X_3 (1-X_2) (1-X_1) + f_5 X_3 (1-X_2) X_1 \\ & + f_6 X_3 X_2 (1-X_1) + f_7 X_3 X_2 X_1 \end{align} where $X$ and $1-X$ are Lagrange base polynomials defined in a binary field. This unique polynomial $F(X_3,X_2,X_1)$ is known as the MultiLinear Extension of a vector. Let the sumcheck problem in this case be defined as follows \begin{equation} \sum_{X_i\in \{0,1\}} F(X_3,X_2,X_1) \stackrel{?}{=} C \end{equation} where $C\in \mathbb{F}$. The protocol proceeds with the following steps, where in each round the prover computes and commits to a univariate polynomial (Linear), and recieves a challenge from the verifier. | Round | Prover $\mathcal{P}$ | Communication | Verifier $\mathcal{V}$ | |:-----:| ---------------------------------------------------- |:-------------------------------------------------------------- |:-------------------------------------------------------------------------------------------------------------------- | | 1 | $r_1(X) := \sum_{X_{2,3}\in \{0,1\}} F(X_3, X_2,X)$ | $r_1(X)\longrightarrow\\\longleftarrow \alpha_1\in \mathbb{F}$ | $C\stackrel{?}{=} \sum_{X\in\{0,1\}} r_1(X)$ | | 2 | $r_2(X) := \sum_{X_3\in \{0,1\}} F(X_3, X,\alpha_1)$ | $r_2(X)\longrightarrow\\\longleftarrow \alpha_2\in \mathbb{F}$ | $r_1(\alpha_1)\stackrel{?}{=} \sum_{X\in\{0,1\}} r_2(X)$ | | 3 | $r_3(X) := F(X, \alpha_2,\alpha_1)$ | $r_3(X)\longrightarrow$ | $r_2(\alpha_2)\stackrel{?}{=} \sum_{X\in\{0,1\}} r_3(X)\\r_3(\alpha_3)\stackrel{?}{=} F(\alpha_3,\alpha_2,\alpha_1)$ | What we would like to estimate is the complexity of the prover in the evaluation of the polynomials in each round. The commitment of each of the linear univariate polynomials by the prover is not a computational bottleneck, since it is a simple EC addition or a small MSM. After a short calculation we find that the structure of the polynomial at the end of round 1 is \begin{align} r_1(X) &= \sum_{X_i\in\{0,1 \}} F(X_2,X_2,X) =\sum_{i=0}^3 r_1^{(i)}(X)\\ r_1^{(i)}(X) &= f_{2i} (1-X) + f_{2i+1} X \end{align} It is obvious that \begin{equation} \sum_{X\in \{0,1\}} r_1(X) = r_1(0) +r_1(1) = \sum_{i=0}^3 (f_{2i} + f_{2i+1}) = C \end{equation} Let the challenge at the end of round 1 be $\alpha_1 \in\mathbb{F}_p$. Then the polynomial in round 2 can be written as \begin{align} r_2(X) & = \sum_{X_i\in\{0,1 \}} F(X_2,X,\alpha_1)= \sum_{i=0}^1 r_2^{(i)}(X)\\ r_2^{(i)}(X) &=r_1^{(2i)}(\alpha_1) (1-X) + r_1^{(2i+1)}(\alpha_1) X \end{align} Similarly let the challenge at the end of round 2 be $\alpha_2 \in\mathbb{F}_p$. Then the polynomial in round 3 can be written as \begin{align} r_3(X) & = F(X,\alpha_2,\alpha_1)\\ &= r_2^{(0)}(\alpha_2) (1-X) + r_2^{(1)}(\alpha_2) X \end{align} The prover algorithm in this case is presented graphically in the following figure ![](https://i.imgur.com/kKkEsH0.png) And the prover complexity is summarized in the following table (The notation $x \mathbb{F}$ refers to $x$ finite field elements in memory) | Round | Action | Memory | Mult | Add | | ----- | ------------------------------------------------------------------------------------------------------- | ------------- | ---- | --- | | pre | Store $f_i$, $\forall i=0,1,\ldots ,7$ | $8\mathbb{F}$ | - | - | | 1 | Read $f_i$ ,$\forall i=0,1,\ldots ,7$ commit $r_1(X)$ and receive $\alpha_1$ | | | | | | Read $f_i$, $\forall i=0,1,\ldots ,7$ compute $r_1^{(i)}(\alpha_1)$ , $\forall i=0,1,2,3$ | - | $8$ | $4$ | | | Clear $f_i$ ,$\forall i=0,1,\ldots ,7$ store $r_1^{(i)}(\alpha_1)$ , $\forall i=0,1,2,3$ | $4\mathbb{F}$ | - | - | | 2 | Read $r_1^{(i)}(\alpha_1)$ ,$\forall i=0,1,2,3$ commit $r_2(X)$ and receive $\alpha_2$ | | | | | | Read $r_1^{(i)}(\alpha_1)$, $\forall i=0,1,2,3$ compute $r_2^{(i)}(\alpha_2)$ , $\forall i=0,1$ | - | $4$ | $2$ | | | Clear $r_1^{(i)}(\alpha_1)$ ,$\forall i=0,1,2,3$ store $r_2^{(i)}(\alpha_2)$ , $\forall i=0,1$ | $2\mathbb{F}$ | - | - | | 3 | Read $r_2^{(i)}(\alpha_2)$ ,$\forall i=0,1$ commit $r_3(X)$ | | | | | | Clear $r_2^{(i)}(\alpha_2)$ ,$\forall i=0,1$ | | | | ### General form The general form of the algorithm is easily extrapolated from this simple example. We consider an execution trace of size $T=2^n$ (for simplicity we assume that the trace is always zero padded to a power of 2). This trace can be interpolated using a multivariate polynomial $F(X_n,\ldots,X_1)$. The prover algorithm is then as follows: * In round 1 the prover computes the linear polynomials \begin{align} r_1^{(i)}(X) = & (1-X) f_{2i} + X f_{2i+1},\ \forall i=0,\ldots,T/2-1\\ \end{align} and commits their sum \begin{align} r_1(X) = & \sum_{i=0}^{T/2-1} r_1^{(i)}(X) = \sum_{X_i\in\{0,1\}} F(X_n,\ldots ,X_{2}, X) \end{align} the prover then receives the challenge $\alpha_1$ from the verifier. * In the $k$-th round the prover computes the linear polynomials \begin{align} r_k^{(i)}(X) = & (1-X) r^{(2i)}_{k-1}(\alpha_{k-1}) + X r^{(2i+1)}_{k-1}(\alpha_{k-1}),\ \forall i=0,\ldots,T/2^k-1\\ \end{align} and commits their sum \begin{align} r_k(X) = & \sum_{i=0}^{T/2^k-1} r_k^{(i)}(X) = \sum_{X_i\in\{0,1\}} F(X_n,\ldots ,X_{k+1}, X,\alpha_{k-1},\dots,\alpha_1) \end{align} the prover then receives the challenge $\alpha_k$ from the verifier. * In the final $n$-th round the prover computes and commits the linear polynomial \begin{equation} r_{n}(X) = (1-X)r_{n-1}^{(0)}(\alpha_{n-1}) + X r_{n-1}^{(1)}(\alpha_{n-1}) \end{equation} The overall prover complexity is thus $2T-4$ multiplications and $T-2$ additions, along with a memory requirement of $T$ field elements. In fact, one can do somewhat better by noting that \begin{align} r_k^{(i)}(X) = & (1-X) r^{(2i)}_{k-1}(\alpha_{k-1}) + X r^{(2i+1)}_{k-1}(\alpha_{k-1})\\ = & r^{(2i)}_{k-1}(\alpha_{k-1}) + X(r^{(2i+1)}_{k-1}(\alpha_{k-1}) - r^{(2i)}_{k-1}(\alpha_{k-1})) \end{align} Which halves the number of required multiplications. ## Enter Fiat-Shamir The Fiat-Shamir transformation takes an interactive public coin proof such as the sumcheck protocol, and converts it into a non-interactive proof. Instead of having the challenges $\alpha_i$ be sent by the verifier, the transformation produces them at each round by computing a hash function of the previous transcript of the protocol. The transcript for which each hash function is calculated must include all of the verifiable results from the previous round, to avoid security vulnerabilities. An analysis of these vulnerablities is presented [here](https://blog.trailofbits.com/2022/04/13/part-1-coordinated-disclosure-of-vulnerabilities-affecting-girault-bulletproofs-and-plonk/) and in much more technical detail [here](https://eprint.iacr.org/2016/771.pdf). As a consequence, each round in the sumcheck protocol must be performed sequentially, with no parallelization of the calculations between rounds. As is shown below, this fact seriously reduces the ability to parallelize computations in HyperPlonk. ## SumCheck in Hardware Based on the above presentation of non-interactive SumCheck, we can identify the following pattern for each round $i$: - Read $2^i$ elements from memory - Compute $a_i = hash(transcript)$ - Compute new $2^{i-1}$ elements - Write $2^{i-1}$ elements to memory Concretely, say we have $2^{30}$ field elements. In order to progress to the next round with $2^{29}$ elements, we must use (read from memory) all the $2^{30}$ elements. We cannot proceed to the next round, with $2^{28}$ elements, until we complete the round with the $2^{29}$ elements. This *hard stop* means we need to return to the memory in each round. So for example, for size $2^{30}$ we must read and write from/to the memory 30 times. For small enough sizes, say $2^{18}$ this memory can be SRAM, which is fast and unlikely to form bottlenecks. However, for large SumCheck sizes, e.g. $2^{30}$, which is what we expect to encounter in practice, there will be rounds that must work with the DRAM, i.e. DDR or HBM, which is slow and will become the bottleneck instead of the SumCheck computation. The figure below illustrates the data flow for this SumCheck. ![](https://i.imgur.com/R1zYuiQ.png) Memory bottleneck such as the one we just described is what we usually want to avoid when designing hardware accelerators - simply because there is not much we can do to accelerate these types of computations. By its nature, the vanilla SumCheck round-structure is extremely serial, and therefore there is not much room for acceleration. ### NTT in Hardware The obvious comparison here is to how NTT is handled in hardware. In this note we will not discuss NTT at the same level of depth we discussed the SumCheck. For the sake of comparison, we provide the following diagram: ![](https://i.imgur.com/ByNZJhA.png) NTT holds a good degree of parallelism such that multiple rounds can happen without returning to memory. Observe that for NTT we do not have a *hard stop* due to Fiat-Shamir - the output of one computation (partial result of a round) can immediately feed the input to a second computation (partial input of the next round). As a rule of thumb: We can compute a full NTT of size $1k$ elements with just one access to memory. This is $10$x better than SumCheck! As an example - we can solve an NTT of size $1$ trillion with only $3$ round trips to memory, in this case reading and writing all elements. This makes NTT computationally bottlenecked, and not memory bottlenecked. In fact, our implementations of NTT in GPUs and FPGAs managed to accelerate the NTT computation so much that for all relevant sizes even the compute is no longer a bottleneck! The next bottleneck in line, faced by these implementations is the PCIe bandwidth, used for communication with the host. ## Conclusions After all - are we better off with Plonk and NTT rather than with HyperPlonk and MLE-SumCheck? At least from the hardware point of view, NTT, while obviously not significantly parallel, is still way more hardware friendly than the MLE-SumCheck. We note that the same structure is found in FRI, for example, check out plonky2 FRI [implementation](https://github.com/mir-protocol/plonky2/blob/769175808448cecc7b431d7293e2d689af635a4b/plonky2/src/fri/prover.rs#L69). We assume that the problem with hardware acceleration of FRI was not raised before because FRIs in existing systems such as STARKs take only a small fraction of the overall computation, especially compared to the many, and much larger NTTs. ### Take this note with a grain of salt! We focused here only on the "vanilla" SumCheck and its structure and did not address the solution space at all. The hardware acceleration of SumCheck is an understudied problem, especially compared to an NTT. It is likely that the techniques to improve the hardware-friendliness of the SumCheck exist or will be invented in the future: ranging from cryptography (Fiat-Shamir plays a role here), through new variants of SumCheck, and to ideas enabling parallelism in higher levels of the protocol design (some of them presented in the HyperPlonk paper, such as Batching). **Departing thought:** Have we reached a point where ZK protocols must be co-designed with hardware in mind? *Thanks to Tomer, Kobi, Suyash, Miki, Karthik, Oren and Yuval*

Read more

Identity is the Key

"Alpha-Rays" behind the scenes

FS-DKR: One Round Distributed Key Rotation