Hardware-friendliness of HyperPlonk

In this note we provide the hardware perspective on HyperPlonk. We focus on the main building block, the Multivariate SumCheck protocol and compare its computational and memory complexity to that of an NTT (Number Theoretic Transform).

Background - HyperPlonk

Plonk is one of the most widely adopted SNARKs in the industry. In vanilla Plonk after arithmetization, the resulting execution trace is interpolated into univariate polynomials and thus the resulting protocol relies heavily on NTTs. HyperPlonk is a new adaptation of Plonk, where the execution trace is interpolated on a boolean hypercube. Thus the polynomial representation of the trace is a multivariate polynomial with linear degree in each variable. This is known as an MLE (MultiLinear Extension). A good overview of Hyperplonk can be found in Benedikt Bunz ZKSummit8 talk.
One key advantage of HyperPlonk is the elimination of large NTTs, a major computational bottleneck in Plonk over large-circuits. By moving to the boolean hypercube, we no longer need univariate polynomials. Instead, HyperPlonk relies on multivariate polynomial arithemetic. Section 3 in the Hyperplonk paper is devoted to developing the toolbox for working with multivariate polynomials. The figure below, taken from the paper, shows how HyperPlonk is built out of this toolbox. As can be seen, at the root of it all we have the classical SumCheck protocol, which is bound to become the main computational problem in HyperPlonk (polynomial commitments aside), replacing NTTs all together.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sumcheck in HyperPlonk

A toy example

Consider an execution trace unrolled in the form of a vector. We illustrate the general idea using a vector of length

8

, constituted of the polynomial evaluations

{f_{i}}_{i = 0}^{7}

, where

f_{i} \in F

. We interpolate these values using a multivariate polynomial

F (X_{3}, X_{2}, X_{1})

as follows

$f_{0}$	$f_{1}$	$f_{2}$	$f_{3}$	$f_{4}$	$f_{5}$	$f_{6}$	$f_{7}$
$F (0, 0, 0)$	$F (0, 0, 1)$	$F (0, 1, 0)$	$F (0, 1, 1)$	$F (1, 0, 0)$	$F (1, 0, 1)$	$F (1, 1, 0)$	$F (1, 1, 1)$

The elements in the first row are finite field elements, and the second row represents the interpolation idea. We interpolate the values using the Lagrange interpolation

\begin{aligned} F (X_{3}, X_{2}, X_{1}) = & f_{0} (1 - X_{3}) (1 - X_{2}) (1 - X_{1}) + f_{1} (1 - X_{3}) (1 - X_{2}) X_{1} \\ + f_{2} (1 - X_{3}) X_{2} (1 - X_{1}) + f_{3} (1 - X_{3}) X_{2} X_{1}) \\ + f_{4} X_{3} (1 - X_{2}) (1 - X_{1}) + f_{5} X_{3} (1 - X_{2}) X_{1} \\ + f_{6} X_{3} X_{2} (1 - X_{1}) + f_{7} X_{3} X_{2} X_{1} \end{aligned}

where

X

and

1 - X

are Lagrange base polynomials defined in a binary field. This unique polynomial

F (X_{3}, X_{2}, X_{1})

is known as the MultiLinear Extension of a vector. Let the sumcheck problem in this case be defined as follows

\sum_{X_{i} \in {0, 1}} F (X_{3}, X_{2}, X_{1}) \overset{?}{=} C

where

C \in F

. The protocol proceeds with the following steps, where in each round the prover computes and commits to a univariate polynomial (Linear), and recieves a challenge from the verifier.

Round	Prover $P$	Communication	Verifier $V$
1	$r_{1} (X) := \sum_{X_{2, 3} \in {0, 1}} F (X_{3}, X_{2}, X)$	$r_{1} (X) ⟶ ⟵ α_{1} \in F$	$C \overset{?}{=} \sum_{X \in {0, 1}} r_{1} (X)$
2	$r_{2} (X) := \sum_{X_{3} \in {0, 1}} F (X_{3}, X, α_{1})$	$r_{2} (X) ⟶ ⟵ α_{2} \in F$	$r_{1} (α_{1}) \overset{?}{=} \sum_{X \in {0, 1}} r_{2} (X)$
3	$r_{3} (X) := F (X, α_{2}, α_{1})$	$r_{3} (X) ⟶$	$r_{2} (α_{2}) \overset{?}{=} \sum_{X \in {0, 1}} r_{3} (X) r_{3} (α_{3}) \overset{?}{=} F (α_{3}, α_{2}, α_{1})$

What we would like to estimate is the complexity of the prover in the evaluation of the polynomials in each round. The commitment of each of the linear univariate polynomials by the prover is not a computational bottleneck, since it is a simple EC addition or a small MSM.

After a short calculation we find that the structure of the polynomial at the end of round 1 is

\begin{aligned} r_{1} (X) & = \sum_{X_{i} \in {0, 1}} F (X_{2}, X_{2}, X) = \sum_{i = 0}^{3} r_{1}^{(i)} (X) \\ r_{1}^{(i)} (X) & = f_{2 i} (1 - X) + f_{2 i + 1} X \end{aligned}

It is obvious that

\sum_{X \in {0, 1}} r_{1} (X) = r_{1} (0) + r_{1} (1) = \sum_{i = 0}^{3} (f_{2 i} + f_{2 i + 1}) = C

Let the challenge at the end of round 1 be

α_{1} \in F_{p}

. Then the polynomial in round 2 can be written as

\begin{aligned} r_{2} (X) & = \sum_{X_{i} \in {0, 1}} F (X_{2}, X, α_{1}) = \sum_{i = 0}^{1} r_{2}^{(i)} (X) \\ r_{2}^{(i)} (X) & = r_{1}^{(2 i)} (α_{1}) (1 - X) + r_{1}^{(2 i + 1)} (α_{1}) X \end{aligned}

Similarly let the challenge at the end of round 2 be

α_{2} \in F_{p}

. Then the polynomial in round 3 can be written as

\begin{aligned} r_{3} (X) & = F (X, α_{2}, α_{1}) \\ = r_{2}^{(0)} (α_{2}) (1 - X) + r_{2}^{(1)} (α_{2}) X \end{aligned}

The prover algorithm in this case is presented graphically in the following figure

And the prover complexity is summarized in the following table (The notation

x F

refers to

x

finite field elements in memory)

Round	Action	Memory	Mult	Add
pre	Store $f_{i}$ , $\forall i = 0, 1, \dots, 7$	$8 F$	-	-
1	Read $f_{i}$ , $\forall i = 0, 1, \dots, 7$ commit $r_{1} (X)$ and receive $α_{1}$
	Read $f_{i}$ , $\forall i = 0, 1, \dots, 7$ compute $r_{1}^{(i)} (α_{1})$ , $\forall i = 0, 1, 2, 3$	-	$8$	$4$
	Clear $f_{i}$ , $\forall i = 0, 1, \dots, 7$ store $r_{1}^{(i)} (α_{1})$ , $\forall i = 0, 1, 2, 3$	$4 F$	-	-
2	Read $r_{1}^{(i)} (α_{1})$ , $\forall i = 0, 1, 2, 3$ commit $r_{2} (X)$ and receive $α_{2}$
	Read $r_{1}^{(i)} (α_{1})$ , $\forall i = 0, 1, 2, 3$ compute $r_{2}^{(i)} (α_{2})$ , $\forall i = 0, 1$	-	$4$	$2$
	Clear $r_{1}^{(i)} (α_{1})$ , $\forall i = 0, 1, 2, 3$ store $r_{2}^{(i)} (α_{2})$ , $\forall i = 0, 1$	$2 F$	-	-
3	Read $r_{2}^{(i)} (α_{2})$ , $\forall i = 0, 1$ commit $r_{3} (X)$
	Clear $r_{2}^{(i)} (α_{2})$ , $\forall i = 0, 1$

General form

The general form of the algorithm is easily extrapolated from this simple example. We consider an execution trace of size

T = 2^{n}

(for simplicity we assume that the trace is always zero padded to a power of 2). This trace can be interpolated using a multivariate polynomial

F (X_{n}, \dots, X_{1})

. The prover algorithm is then as follows:

In round 1 the prover computes the linear polynomials
$\begin{aligned} r_{1}^{(i)} (X) = & (1 - X) f_{2 i} + X f_{2 i + 1}, \forall i = 0, \dots, T / 2 - 1 \end{aligned}$ and commits their sum
$\begin{aligned} r_{1} (X) = & \sum_{i = 0}^{T / 2 - 1} r_{1}^{(i)} (X) = \sum_{X_{i} \in {0, 1}} F (X_{n}, \dots, X_{2}, X) \end{aligned}$ the prover then receives the challenge
$α_{1}$ from the verifier.
In the
$k$ -th round the prover computes the linear polynomials
$\begin{aligned} r_{k}^{(i)} (X) = & (1 - X) r_{k - 1}^{(2 i)} (α_{k - 1}) + X r_{k - 1}^{(2 i + 1)} (α_{k - 1}), \forall i = 0, \dots, T / 2^{k} - 1 \end{aligned}$ and commits their sum
$\begin{aligned} r_{k} (X) = & \sum_{i = 0}^{T / 2^{k} - 1} r_{k}^{(i)} (X) = \sum_{X_{i} \in {0, 1}} F (X_{n}, \dots, X_{k + 1}, X, α_{k - 1}, \dots, α_{1}) \end{aligned}$ the prover then receives the challenge
$α_{k}$ from the verifier.
In the final
$n$ -th round the prover computes and commits the linear polynomial
$r_{n} (X) = (1 - X) r_{n - 1}^{(0)} (α_{n - 1}) + X r_{n - 1}^{(1)} (α_{n - 1})$

The overall prover complexity is thus

2 T - 4

multiplications and

T - 2

additions, along with a memory requirement of

T

field elements. In fact, one can do somewhat better by noting that

\begin{aligned} r_{k}^{(i)} (X) = & (1 - X) r_{k - 1}^{(2 i)} (α_{k - 1}) + X r_{k - 1}^{(2 i + 1)} (α_{k - 1}) \\ = & r_{k - 1}^{(2 i)} (α_{k - 1}) + X (r_{k - 1}^{(2 i + 1)} (α_{k - 1}) - r_{k - 1}^{(2 i)} (α_{k - 1})) \end{aligned}

Which halves the number of required multiplications.

Enter Fiat-Shamir

The Fiat-Shamir transformation takes an interactive public coin proof such as the sumcheck protocol, and converts it into a non-interactive proof. Instead of having the challenges

α_{i}

be sent by the verifier, the transformation produces them at each round by computing a hash function of the previous transcript of the protocol. The transcript for which each hash function is calculated must include all of the verifiable results from the previous round, to avoid security vulnerabilities. An analysis of these vulnerablities is presented here and in much more technical detail here. As a consequence, each round in the sumcheck protocol must be performed sequentially, with no parallelization of the calculations between rounds. As is shown below, this fact seriously reduces the ability to parallelize computations in HyperPlonk.

SumCheck in Hardware

Based on the above presentation of non-interactive SumCheck, we can identify the following pattern for each round

i

Read
$2^{i}$ elements from memory
Compute
$a_{i} = h a s h (t r a n s c r i p t)$
Compute new
$2^{i - 1}$ elements
Write
$2^{i - 1}$ elements to memory

Concretely, say we have

2^{30}

field elements. In order to progress to the next round with

2^{29}

elements, we must use (read from memory) all the

2^{30}

elements. We cannot proceed to the next round, with

2^{28}

elements, until we complete the round with the

2^{29}

elements. This hard stop means we need to return to the memory in each round. So for example, for size

2^{30}

we must read and write from/to the memory 30 times. For small enough sizes, say

2^{18}

this memory can be SRAM, which is fast and unlikely to form bottlenecks. However, for large SumCheck sizes, e.g.

2^{30}

, which is what we expect to encounter in practice, there will be rounds that must work with the DRAM, i.e. DDR or HBM, which is slow and will become the bottleneck instead of the SumCheck computation. The figure below illustrates the data flow for this SumCheck.

Memory bottleneck such as the one we just described is what we usually want to avoid when designing hardware accelerators - simply because there is not much we can do to accelerate these types of computations. By its nature, the vanilla SumCheck round-structure is extremely serial, and therefore there is not much room for acceleration.

NTT in Hardware

The obvious comparison here is to how NTT is handled in hardware. In this note we will not discuss NTT at the same level of depth we discussed the SumCheck. For the sake of comparison, we provide the following diagram:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

NTT holds a good degree of parallelism such that multiple rounds can happen without returning to memory. Observe that for NTT we do not have a hard stop due to Fiat-Shamir - the output of one computation (partial result of a round) can immediately feed the input to a second computation (partial input of the next round). As a rule of thumb: We can compute a full NTT of size

1 k

elements with just one access to memory. This is

10

x better than SumCheck! As an example - we can solve an NTT of size

1

trillion with only

3

round trips to memory, in this case reading and writing all elements.

This makes NTT computationally bottlenecked, and not memory bottlenecked. In fact, our implementations of NTT in GPUs and FPGAs managed to accelerate the NTT computation so much that for all relevant sizes even the compute is no longer a bottleneck! The next bottleneck in line, faced by these implementations is the PCIe bandwidth, used for communication with the host.

Conclusions

After all - are we better off with Plonk and NTT rather than with HyperPlonk and MLE-SumCheck?

At least from the hardware point of view, NTT, while obviously not significantly parallel, is still way more hardware friendly than the MLE-SumCheck. We note that the same structure is found in FRI, for example, check out plonky2 FRI implementation. We assume that the problem with hardware acceleration of FRI was not raised before because FRIs in existing systems such as STARKs take only a small fraction of the overall computation, especially compared to the many, and much larger NTTs.

Take this note with a grain of salt!

We focused here only on the "vanilla" SumCheck and its structure and did not address the solution space at all. The hardware acceleration of SumCheck is an understudied problem, especially compared to an NTT. It is likely that the techniques to improve the hardware-friendliness of the SumCheck exist or will be invented in the future: ranging from cryptography (Fiat-Shamir plays a role here), through new variants of SumCheck, and to ideas enabling parallelism in higher levels of the protocol design (some of them presented in the HyperPlonk paper, such as Batching).

Departing thought: Have we reached a point where ZK protocols must be co-designed with hardware in mind?

Thanks to Tomer, Kobi, Suyash, Miki, Karthik, Oren and Yuval

Suyash Bagad

2022/12/27 13:40:20

The first round polynomial definition needs a slight correction, it should be: $r_1(X) := \sum_{x_i \in \{2,3\}} F(x_3, x_2, X)$ Similarly, second round polynomial must be: $r_2(X) := \sum_{x_i \in \{3\}} F(x_3, X, \alpha_1)$ (Edited)

2022/12/27 13:42:01

Also, in round 3, the verifier equation has a tiny correction: $r_3(\alpha_3) = F(\alpha_3, \alpha_2, \alpha_1)$ (Edited)

2022/12/27 13:45:12

Before this line, the following should be mentioned: In practice, the prover does not send the round polynomials to the verifier as that would blow up the communication. Instead, the prover only sends commitments to the round polynomials. (Edited)

2022/12/27 13:57:00

Small correction: r_k^(i) = (1 - X)r_{k-1}^(2i) + Xr_{k-1}^{(2i+1)} (Edited)

Michael Asa

2022/12/27 15:13:20

straightforward optimization will be commiting r2(X) and receive a2 while storing the elements at previous layer. (Edited)

2022/12/27 15:13:59

so this table is quite naive (Edited)

Elan Neiger

2022/12/27 18:06:41

I would end the sentence and start a new one: ...the form of a vector. For simplicity, we work on a vector of size 8,..." (Edited)

Yuval Domb

2022/12/27 18:17:00

Agree with Elan. This is incorrect grammar (Edited)

2022/12/27 18:18:14

## Enter

There are only 2 rows... (Edited)

2022/12/27 18:19:08

Can mention that this is Lagrange interpolation (Edited)

2022/12/27 18:19:31

## En

and proceed... (Edited)

2022/12/27 18:21:37

zes ev

add a comma: sizes, even (Edited)

2022/12/27 18:23:13

"round polynomials" is confusing. Maybe call them marginal polynomial r_i(X)? (Edited)

2022/12/27 18:24:04

## Ent

tiny or very small. very tiny sounds weird (Edited)

2022/12/27 18:24:39

fter all -

Might it be that after investigation - (Edited)

2022/12/27 18:25:35

## Enter Fiat-Sh

Should say something about how alpha is determined. Is it uniformly dist etc... (Edited)

2022/12/27 18:27:26

his note with a grain of salt!

Take this note with a grain of salt! (Edited)

2022/12/27 18:27:55

especially compared

especially (Edited)

2022/12/27 18:28:39

I think that this is redundant (Edited)

2022/12/27 18:32:07

## Enter Fiat-Shamir The Fiat-Shamir tra

delete this. you already said for simplicity (Edited)

2022/12/27 18:35:35

## SumCheck in Hardware Based on the above presentation

replace with something like: the ability to parallelize computations in hyperplonk (Edited)

2022/12/27 18:37:08

return (Edited)

2022/12/27 18:37:32

memory

the memory (Edited)

2022/12/27 18:37:49

delete the (Edited)

2022/12/27 18:38:21

SumCh

delete enough (Edited)

2022/12/27 18:38:39

many (Edited)

2022/12/27 18:40:23

or... Might it be that after all - (Edited)

2022/12/27 18:43:58

Perhaps we are better... (Edited)

2022/12/27 18:44:21

very (Edited)

josephljohnston

2023/02/04 16:03:41

we need to return to the memory in each round

Yeah, but only the tiny univariates need to be summed to memory and a challenge returned, yet you're rewriting the whole 2^i multivariate to memory each round. For whatever hardware constraints make you do this (eg GPUs without block synchronization), why don't the constraints apply to FFTs? FFTs seem even more an issue since outputs need to be shared between units before the next round... (Edited)