The Cysic team would like to thank Luke Pearson and Binyi Chen for valuable feedbacks on this post.
Hyperplonk is an adaptation of the Plonk proof system to the Boolean hypercube that aims to eliminate the need for computing fast Fourier transforms (FFT) during proof generation. It also supports custom gates of higher degree without increasing the proof generation time.
In this article, we will analyze the performance of Hyperplonk from a hardware perspective, focusing on the pros and cons of its three computing components: multi-scalar multiplication (MSM), multi-linear extension (MLE), and sumcheck. We will also propose our design for these components and potential improvements to Hyperplonk for better performance on customized hardware.
Hyperplonk is a type of succinct non-interactive argument of knowledge (SNARK). One of the most important features for SNARKs is succinctness, which means that the size of the proof depends logarithmically on the size of the circuit being proved and the verification time is proportional to the length of the statement plus the logarithmic size of the circuit. There are two primary components in constructing SNARKs: an interactive oracle proof (IOP) and a cryptographic commitment scheme:
The common paradigm for obtaining a SNARK from an IOP and commitment scheme is to compile the proof in the IOP using a commitment scheme.
There are three categories of SNARKs based on different flavors of the IOP and commitment scheme:
Hyperplonk belongs to the second category, while Plonk belongs to the first one. It is also possible to add zero-knowledgeness to SNARKs to obtain zkSNARKs, but for simplicity, we will not analyze the zk-Hyperplonk scheme in this article.
The key difference between Hyperplonk and Plonk is that Hyperplonk works on the Boolean hypercube, which can be represented as a bit string. Hyperplonk has two advantages: it eliminates the need for FFTs and can support high-degree customized gates in a more efficient way.
Let be a circuit with a length- public input and gates, where each gate can be an addition, multiplication, or a custom gate . The arithmetization of Plonk converts a circuit into a sequence of triples, i.e., a matrix , which represents the execution trace of the circuit. After the commitment to the matrix , the Plonk prover needs to convince the verifier that the committed satisfies the gate and wiring identities. Hyperplonk follows the paradigm of Plonk, replacing the FFTs in the ZeroCheck part with SumCheck, as shown in Figure 1:
The current implementation of Hyperplonk is based on Arkworks, which is written in Rust. The following figure illustrates the dependency of components in the Hyperplonk system, where the blocks in gray were implemented in the Hyperplonk paper:
Based on the benchmark of the code for constraints (Section 6.2 of Hyperplonk, there are three major computing components:
In the following sections, we will delve into the details of these three components to analyze their hardware-friendliness.
The MSM component is usually the most time-consuming part in elliptic curve-based SNARKs. Its goal is to compute , where is the size of the MSM, is a 256-bit scalar, and is a point on the curve. The represents the point scalar multiplication (PMULT) of and . The Pippenger algorithm is widely used to calculate the MSM problem, which uses some basic arithmetic operations such as point addition (PADD) and point doubling (PDBL). Due to the inherent parallelism of MSM, it is quite friendly to GPU and hardware acceleration. The main difficulty lies in designing a highly efficient modular multiplication component, which is used throughout the proof generation process.
Our FPGA-MSM design is a fully pipelined architecture, which means that in every cycle, the processing element is able to finish a point addition operation. The performance of a fully pipelined architecture is usually bounded by the bandwidth of PCIe. There are three modules in the architecture: the EC point adder, the MSM accumulate table, and the ALU input queue, as shown below:
The input to the MSM component consists of the projective coordinates of a point (the length of which depends on the specific curve being used) and a -bit slice of the scalar. The sectional scalar is first checked against the accumulate table to see if it contains the corresponding . If it does, then is read out and referred to as the old . The new and the old are then sent to the MSM EC point adder to compute their addition, with the result being stored in the accumulation table.
The EC point adder has a very deep pipeline in the circuit, and a related tricky problem is handling read-access to the same table entry before the prior result is written back. A naive approach will introduce stalls and result in suboptimal performance. To address this issue, we implemented dynamic reorder logic using the associative law of point-addition to guarantee that all read-after-write dependencies are correctly handled.
We announced our initial result for the MSM based on a single Xilinx VU13P at the Berkeley ZKP workshop, and now are working on optimizing this part, as well as a highly customized FPGA server. We will announce the performance of MSM on this server soon.
The task of MLE is to compute the evaluation result for a multilinear polynomial modulo a prime number. The MLE task is parameterized by two things: the number of variables and the prime number . The input to MLE consists of a sequence of coefficients and a sequence of values , where each number is 256 bits long. The MLE computes the following polynomial:
where denotes . For instance, let , the polynomial to be computed is
If we compute the polynomial directly, we need to deal with terms, each containing a coefficient and variables, which results in approximately multiplications. However, upon closer inspection, we realize that we can merge terms to reduce the number of required multiplications. We can then divide the computation into layers of multiplication and layers of addition, which are interleaved. This allows us to reduce the number of needed multiplications to , as shown in the following figure:
This is a rough outline of how we can use a CPU to compute this polynomial, with each thread computing a single MLE. One disadvantage of this design is the large amount of memory required to store intermediate values. If we try to use this design in hardware acceleration, we will face two problems:
Our improved design builds on the layered CPU implementation. We noticed that subsequent layers do not need to wait for previous layers to finish; instead, all layers can work simultaneously, connected by queues. The basic module computes , using the merged form as shown in the following diagram:
In our initial MLE design, the coefficient vector is read in a stream and placed in queue . The is connected to a basic module that takes two coefficients from and computes . The result is placed in queue . Similarly, the is connected to another basic module that performs the same computation. The result of this module is placed in queue . This process is repeated multiple times until the -th queue. The -th queue is connected to a basic module that computes and outputs the final result.
Our final design is based on this initial concept, but with several improvements such as multiplexing instead of duplicating the above components. We have expanded it into a fully-pipelined architecture that can handle a 256-bit input at every cycle, similar to our MSM module design. The MLE enjoys hardware-friendness as the MSM module, but the performance is also limited by the bandwidth of the PCIe.
Essentially, the sumcheck protocol between the prover and verifier is round-dependent, meaning the information exchanged in the current round depends on that of previous rounds. The sumcheck protocol can be made non-interactive using the Fiat-Shamir heuristic, which involves using a hash function to compute the challenge based on the transcript. Omer Shlomovits wrote an excellent post discussing the hardware friendliness of the sumcheck module. We agree with the arguments presented in that post. Given the serial nature of the sumcheck protocol and the hardware-resistant nature of hash functions, it seems difficult to design a hardware acceleration solution that offers significant advantages over CPU/GPU approaches, but we still got inspired by Justin Thaler's tweet about some hardware optimizations for sumcheck.
One potential improvement is to reduce the number of required sumchecks in Hyperplonk. In particular, the paper already mentioned (also in Benedikt Bünz's tweet), that the number of sumchecks can be reduced by batching MLEs. Using this technique, the number of sumchecks is always 2, no matter how complex the custom gate is. Thus the number of sumchecks is much smaller than the number of MLEs.
Compared to the FFT-based SNARK system, the Hyperplonk proof system as it currently stands does not offer significant hardware friendliness. Based on our discussion, the sumcheck module is more hardware unfriendly than the MSM and MLE modules. Novel techniques, either algorithmic or hardware-based, are needed to improve the performance of sumcheck on hardware.
As Omer Shlomovits also mentioned in his post, an alternative approach is to have the hardware-mindset in the design of SNARKs. We tend to agree with this point, not because we are a hardware company, but due to the performance demands of the community. For instance, the hardware-friendness of Hyperplonk can be improved using batching. However, is there a more hardware-friendly variant of Hyperplonk? Stay tuned – we may have a solution!