Try   HackMD

Hardware-friendliness of HyperPlonk

In this note we provide the hardware perspective on HyperPlonk. We focus on the main building block, the Multivariate SumCheck protocol and compare its computational and memory complexity to that of an NTT (Number Theoretic Transform).

Background - HyperPlonk

Plonk is one of the most widely adopted SNARKs in the industry. In vanilla Plonk after arithmetization, the resulting execution trace is interpolated into univariate polynomials and thus the resulting protocol relies heavily on NTTs. HyperPlonk is a new adaptation of Plonk, where the execution trace is interpolated on a boolean hypercube. Thus the polynomial representation of the trace is a multivariate polynomial with linear degree in each variable. This is known as an MLE (MultiLinear Extension). A good overview of Hyperplonk can be found in Benedikt Bunz ZKSummit8 talk.
One key advantage of HyperPlonk is the elimination of large NTTs, a major computational bottleneck in Plonk over large-circuits. By moving to the boolean hypercube, we no longer need univariate polynomials. Instead, HyperPlonk relies on multivariate polynomial arithemetic. Section 3 in the Hyperplonk paper is devoted to developing the toolbox for working with multivariate polynomials. The figure below, taken from the paper, shows how HyperPlonk is built out of this toolbox. As can be seen, at the root of it all we have the classical SumCheck protocol, which is bound to become the main computational problem in HyperPlonk (polynomial commitments aside), replacing NTTs all together.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sumcheck in HyperPlonk

A toy example

Consider an execution trace unrolled in the form of a vector. We illustrate the general idea using a vector of length

8, constituted of the polynomial evaluations
{fi}i=07
, where
fiF
. We interpolate these values using a multivariate polynomial
F(X3,X2,X1)
as follows

f0
f1
f2
f3
f4
f5
f6
f7
F(0,0,0)
F(0,0,1)
F(0,1,0)
F(0,1,1)
F(1,0,0)
F(1,0,1)
F(1,1,0)
F(1,1,1)

The elements in the first row are finite field elements, and the second row represents the interpolation idea. We interpolate the values using the Lagrange interpolation

F(X3,X2,X1)=f0(1X3)(1X2)(1X1)+f1(1X3)(1X2)X1+f2(1X3)X2(1X1)+f3(1X3)X2X1)+f4X3(1X2)(1X1)+f5X3(1X2)X1+f6X3X2(1X1)+f7X3X2X1
where
X
and
1X
are Lagrange base polynomials defined in a binary field. This unique polynomial
F(X3,X2,X1)
is known as the MultiLinear Extension of a vector. Let the sumcheck problem in this case be defined as follows
Xi{0,1}F(X3,X2,X1)=?C

where
CF
. The protocol proceeds with the following steps, where in each round the prover computes and commits to a univariate polynomial (Linear), and recieves a challenge from the verifier.

Round Prover
P
Communication Verifier
V
1
r1(X):=X2,3{0,1}F(X3,X2,X)
r1(X)α1F
C=?X{0,1}r1(X)
2
r2(X):=X3{0,1}F(X3,X,α1)
r2(X)α2F
r1(α1)=?X{0,1}r2(X)
3
r3(X):=F(X,α2,α1)
r3(X)
r2(α2)=?X{0,1}r3(X)r3(α3)=?F(α3,α2,α1)

What we would like to estimate is the complexity of the prover in the evaluation of the polynomials in each round. The commitment of each of the linear univariate polynomials by the prover is not a computational bottleneck, since it is a simple EC addition or a small MSM.

After a short calculation we find that the structure of the polynomial at the end of round 1 is

r1(X)=Xi{0,1}F(X2,X2,X)=i=03r1(i)(X)r1(i)(X)=f2i(1X)+f2i+1X
It is obvious that
X{0,1}r1(X)=r1(0)+r1(1)=i=03(f2i+f2i+1)=C

Let the challenge at the end of round 1 be
α1Fp
. Then the polynomial in round 2 can be written as
r2(X)=Xi{0,1}F(X2,X,α1)=i=01r2(i)(X)r2(i)(X)=r1(2i)(α1)(1X)+r1(2i+1)(α1)X

Similarly let the challenge at the end of round 2 be
α2Fp
. Then the polynomial in round 3 can be written as
r3(X)=F(X,α2,α1)=r2(0)(α2)(1X)+r2(1)(α2)X

The prover algorithm in this case is presented graphically in the following figure
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

And the prover complexity is summarized in the following table (The notation

xF refers to
x
finite field elements in memory)

Round Action Memory Mult Add
pre Store
fi
,
i=0,1,,7
8F
- -
1 Read
fi
,
i=0,1,,7

commit
r1(X)
and receive
α1
Read
fi
,
i=0,1,,7

compute
r1(i)(α1)
,
i=0,1,2,3
-
8
4
Clear
fi
,
i=0,1,,7

store
r1(i)(α1)
,
i=0,1,2,3
4F
- -
2 Read
r1(i)(α1)
,
i=0,1,2,3

commit
r2(X)
and receive
α2
Read
r1(i)(α1)
,
i=0,1,2,3

compute
r2(i)(α2)
,
i=0,1
-
4
2
Clear
r1(i)(α1)
,
i=0,1,2,3

store
r2(i)(α2)
,
i=0,1
2F
- -
3 Read
r2(i)(α2)
,
i=0,1

commit
r3(X)
Clear
r2(i)(α2)
,
i=0,1

General form

The general form of the algorithm is easily extrapolated from this simple example. We consider an execution trace of size

T=2n (for simplicity we assume that the trace is always zero padded to a power of 2). This trace can be interpolated using a multivariate polynomial
F(Xn,,X1)
. The prover algorithm is then as follows:

  • In round 1 the prover computes the linear polynomials
    r1(i)(X)=(1X)f2i+Xf2i+1, i=0,,T/21
    and commits their sum
    r1(X)=i=0T/21r1(i)(X)=Xi{0,1}F(Xn,,X2,X)
    the prover then receives the challenge
    α1
    from the verifier.
  • In the
    k
    -th round the prover computes the linear polynomials
    rk(i)(X)=(1X)rk1(2i)(αk1)+Xrk1(2i+1)(αk1), i=0,,T/2k1
    and commits their sum
    rk(X)=i=0T/2k1rk(i)(X)=Xi{0,1}F(Xn,,Xk+1,X,αk1,,α1)
    the prover then receives the challenge
    αk
    from the verifier.
  • In the final
    n
    -th round the prover computes and commits the linear polynomial
    rn(X)=(1X)rn1(0)(αn1)+Xrn1(1)(αn1)

The overall prover complexity is thus

2T4 multiplications and
T2
additions, along with a memory requirement of
T
field elements. In fact, one can do somewhat better by noting that
rk(i)(X)=(1X)rk1(2i)(αk1)+Xrk1(2i+1)(αk1)=rk1(2i)(αk1)+X(rk1(2i+1)(αk1)rk1(2i)(αk1))

Which halves the number of required multiplications.

Enter Fiat-Shamir

The Fiat-Shamir transformation takes an interactive public coin proof such as the sumcheck protocol, and converts it into a non-interactive proof. Instead of having the challenges

αi be sent by the verifier, the transformation produces them at each round by computing a hash function of the previous transcript of the protocol. The transcript for which each hash function is calculated must include all of the verifiable results from the previous round, to avoid security vulnerabilities. An analysis of these vulnerablities is presented here and in much more technical detail here. As a consequence, each round in the sumcheck protocol must be performed sequentially, with no parallelization of the calculations between rounds. As is shown below, this fact seriously reduces the ability to parallelize computations in HyperPlonk.

SumCheck in Hardware

Based on the above presentation of non-interactive SumCheck, we can identify the following pattern for each round

i:

  • Read
    2i
    elements from memory
  • Compute
    ai=hash(transcript)
  • Compute new
    2i1
    elements
  • Write
    2i1
    elements to memory

Concretely, say we have

230 field elements. In order to progress to the next round with
229
elements, we must use (read from memory) all the
230
elements. We cannot proceed to the next round, with
228
elements, until we complete the round with the
229
elements. This hard stop means we need to return to the memory in each round. So for example, for size
230
we must read and write from/to the memory 30 times. For small enough sizes, say
218
this memory can be SRAM, which is fast and unlikely to form bottlenecks. However, for large SumCheck sizes, e.g.
230
, which is what we expect to encounter in practice, there will be rounds that must work with the DRAM, i.e. DDR or HBM, which is slow and will become the bottleneck instead of the SumCheck computation. The figure below illustrates the data flow for this SumCheck.

Memory bottleneck such as the one we just described is what we usually want to avoid when designing hardware accelerators - simply because there is not much we can do to accelerate these types of computations. By its nature, the vanilla SumCheck round-structure is extremely serial, and therefore there is not much room for acceleration.

NTT in Hardware

The obvious comparison here is to how NTT is handled in hardware. In this note we will not discuss NTT at the same level of depth we discussed the SumCheck. For the sake of comparison, we provide the following diagram:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

NTT holds a good degree of parallelism such that multiple rounds can happen without returning to memory. Observe that for NTT we do not have a hard stop due to Fiat-Shamir - the output of one computation (partial result of a round) can immediately feed the input to a second computation (partial input of the next round). As a rule of thumb: We can compute a full NTT of size

1k elements with just one access to memory. This is
10
x better than SumCheck! As an example - we can solve an NTT of size
1
trillion with only
3
round trips to memory, in this case reading and writing all elements.

This makes NTT computationally bottlenecked, and not memory bottlenecked. In fact, our implementations of NTT in GPUs and FPGAs managed to accelerate the NTT computation so much that for all relevant sizes even the compute is no longer a bottleneck! The next bottleneck in line, faced by these implementations is the PCIe bandwidth, used for communication with the host.

Conclusions

After all - are we better off with Plonk and NTT rather than with HyperPlonk and MLE-SumCheck?

At least from the hardware point of view, NTT, while obviously not significantly parallel, is still way more hardware friendly than the MLE-SumCheck. We note that the same structure is found in FRI, for example, check out plonky2 FRI implementation. We assume that the problem with hardware acceleration of FRI was not raised before because FRIs in existing systems such as STARKs take only a small fraction of the overall computation, especially compared to the many, and much larger NTTs.

Take this note with a grain of salt!

We focused here only on the "vanilla" SumCheck and its structure and did not address the solution space at all. The hardware acceleration of SumCheck is an understudied problem, especially compared to an NTT. It is likely that the techniques to improve the hardware-friendliness of the SumCheck exist or will be invented in the future: ranging from cryptography (Fiat-Shamir plays a role here), through new variants of SumCheck, and to ideas enabling parallelism in higher levels of the protocol design (some of them presented in the HyperPlonk paper, such as Batching).

Departing thought: Have we reached a point where ZK protocols must be co-designed with hardware in mind?

Thanks to Tomer, Kobi, Suyash, Miki, Karthik, Oren and Yuval