# Review of zero knowledge proof systems in FPGA I review some of the open-source submissions to the ZPrize 2022 competition[^zprize][^zkpod] to get the gist of where we are with zero knowledge (ZK) cryptography acceleration on FPGAs. ## Preliminaries Cryptography is the science (or art, or both?) of conveying any kind of information securely over an untrusted medium. Over the years more and more kinds of secure communication were established. ZK proofs allow to communicate securely with higher degrees of privacy than before thus paving the way to new modes of interaction and new business models. Examples of applications of ZK proofs are - proving a statement about some data while keeping that data private - a person has at least $x$ in their bank account - identity match without revealing the person's DNA, passport data, etc. - anonymous authorisation - passwordless access to a website - prove that an entity is not resident in blacklisted countries without revealing the actual residence - anonymous payments - shielding the payer and the payee for all but the tax authority - making purchases on a blockchain without everyone being able to witness that on the ledger - trusted compute - outsource an expensive computation and validate that the result was computed correctly - on a blockchain, not requiring the validators to redo the computation made by a block producer in order to verify the block A zero knowledge proof is a communication protocol between parties called a prover and a verifier. The prover $P$ aims to convince the verifier $V$ that a statement $S$ of interest is true without revealing any other information about the statement except its validity. An interesting feature of validity of ZK proofs is that the proof convinces the verifier up to some probability which has to be essentially 1 for a protocol to be *complete*. If the prover doesn't have a *witness* $w$ to the validity of $S$ and yet tries to cheat and make the verifier accept $S$, the verifier should reject such a proof with very high probability. This is known as *knowledge soundness*. ## ZPrize 2022 FPGA categories Competitions started in June 2022 and finished in September 2022. The prize format allowed blockchain projects Aleo, Jump Crypto, Polygon, Trapdoor Tech and ZK Validator to have access to a diverse group of FPGA engineers who were incentivised to produce high-quality, high-performance designs which we can now all enjoy[^zprize-repo]. ### Multi-scalar multiplication By looking at the results achieved by the finalists, it appears that MSM was a challenge to accelerate above the provided baseline. In both cases the improvement is one order of magnitude compared to $\mathtt{arkworks}$ software which takes about 70 sec to compute the same MSM problem on an Intel Xeon Platinum 8380 CPU at 2.30 GHz. ![](https://i.imgur.com/Za2yIjk.png) ZK-SNARKs[^how-zk-snarks] are a kind of ZK proofs that have small proof size and fast verification time. However this is achieved at a trade-off of extra compexity in the prover which becomes a bottleneck in the protocol. Every ZK-SNARK prover has to efficiently evaluate polynomials at points on behalf of the verifier. The cryptographic primitive employed in proving and verifying polynomial relations is a Polynomial Commitment (PC) scheme. PC allows to bind the prover to a polynomial $P$ so that it can prove the evaluation $P(k)$ at any field point $k$ without revealing the value of $P$ at other points. ZK-SNARKs Plonk, Marlin, Groth16, BulletProofs, Supersonic and others all use various PC schemes that spend a significant amount of time computing multi-scalar multiplications (MSMs), estimated between 70% and 90% of the total prover time[^pipe-msm]. MSM is defined as (in the logarithmic form which expresses exponentiations as multiplications and multiplications as additions) $$ \mathrm{MSM}(x, G) = \sum_{j=0}^{n-1} x_j G_j $$ where $G_j$ are points on an elliptic curve group $\mathcal{G}$ over a prime field $\mathbb{F}_q$ and $x_j$ are scalars in $\mathbb{F}_{|\mathcal{G}|}$. MSM can be reduced to a decades-old multi-product problem (MPP) which has been found to have an asymptotically optimal algorithm by Pippenger. MPP asks, for a sequence of elements $t_0, \dots, t_{n-1}$, to compute the following sequence of products using as few multiplications as possible: $$ y_i = \prod_{j=0}^{n-1} t_j^{x_{i,j}} $$ where $i = 0, \dots, k - 1$ and $x_{i,j} \in \mathbb{Z}_r$ for $r \geq 2$. Implementing MSM on FPGA requires working with efficient fundamental primitives: - modular multiplication - elliptic curve addition For the general case of a finite field, modular reduction needs to work with arbitrary primes, even though it is conceivable to optimise reduction further for specific classes of primes. These general methods include Barrett reduction, Montgomery reduction and methods based on lookup tables. #### Hardcaml team Hardcaml came with their implementation of an optimised Pippengers MSM algorithm for the AWS F1 instance specialised on the BLS12-377 G1 elliptic curve. Rather than implementing the entire Pippenger's algorithm, the solution focuses on high thoughput and only implements point additions. The adder is fully pipelined, accepts inputs on every clock cycle and takes 238 cycles to compute the result, which amounts to 850 ns at 280 MHz. To reduce FPGA resource requirements, the solution transtorms coordinates from affine points on a Weierstrass curve to extended projective points on an equivalent twisted Edwards curve. This removes some modulo additins and a modulo multiplication by a constant. The PCIe latency and post-processing on the host are mitigated by starting MSM operations on the card while points are streamed from the host. When the host receives and starts processing the result, the card processes the next batch of points. Instead of performing a full Barrett reduction or full modular addition/subtraction, only a coarse reduction is performed, allowing additive error to accumulate through the point adder. This error is corrected using BRAMs as ROMs to store coefficients that can be used to reduce values to their modular equivalents. The high-level system architecture looks as follows ![](https://hackmd.io/_uploads/BkXKtkXNh.png) The design based on the AWS shell requires Xilinx Runtime environment (XRT) which is a likely factor limiting the kernel frequency to 280 MHz. The main processing element is written in Ocaml using the Jane Street Hardcaml library which has a Verilog backend and hooks into Xilinx proprietary IP cores. This benefits code readability and software simulation. #### Team Jump Crypto This implementation of MSM is again specialised on BLS12-377 and optimized to compute a fixed-base MSM for uniform random (UR) scalars, with lowest latency. The MSM for $N = 2^{26}$ UR scalars is computed in 5.58 sec. The novelty of this architecture is a fully pipelined curve adder that runs (and is saturated) at 250 MHz and scheduler to reorder operations and maximise the utilisation of the curve adder. The processing element consists of the following components: ![](https://hackmd.io/_uploads/SJMMtl7E3.png) Curve Adder computes one or multiple elliptic curve addition operators. Multiple adders may be used in different phases of the bucket algorithm, for example to compute in different coordinate systems. MSM Controller handles the workflow of the MSM. MSM Init initialises the system, stores points, optionally performs pre-computaiton on points, e.g. conversion in Twisted Edwards coordinates. The MSM component computes the actual MSM, given $N$ scalars. The bucket algorithm is implemented as a number of iterations of bucket accumulation, then bucket aggregation, and final result aggregation phases. Each phase relies on a Curve Adder for the actual computation. Bucket accumulation is driven by a Scheduler to maximise the performance. The FPGA architecture is as follows: ![](https://hackmd.io/_uploads/H1dIse7V2.png) The implementation is in SystemVerilog on the FPGA and Rust on the host. One might wonder if it would beat the team Hardcaml timing if the processing element were scrutinised a bit further to increase the clock frequency by a mere 30 MHz to the same 280 MHz. ### Number-theoretic transform Results in this category ranged broadly. The baseline implementation could not be used at least for benchmarking purposes. ![](https://i.imgur.com/ROwDM2E.png) Number-theoretic transformations (NTT) are essential building blocks for zero-knowledge proof protocols based on error-correcting codes (ECC), such as STARK and Plonky2. This prize focused on using AMD-Xilinx Alveo u55n (*aka* Varium C1100 Blockchain Accelerator Card) to accelerate NTT in Polygon Zero’s Goldilocks field. The NTT is a generalisation of the discrete Fourier transform to an integer quotient ring $\mathbb{Z}_Q$ instead of a complex number field. It is required that $Q$ is a NTT-friendly prime, meaning that $\mathbb{Z}_Q$ has an $N$-th root of unity for an NTT of size $N$. The NTT is a useful tool as it allows the usage of a range of well-known mathematical tricks that also apply to the Fourier domain. Particularly of interest for ECC is the property that a discrete convolution of two polynomials is equivalent to a pointwise multiplication in the NTT domain. This is interesting because the time complexity of convolution is $O(n^2)$ while the complexity of a pointwise multiplication is only $O(n)$. However the polynomials also need to be transformed to and from the NTT domain, which has time complexity $O(n \log_2 n)$. This is illustrated by the diagram below: ![](https://hackmd.io/_uploads/BJQJfGmV2.png) A brief overview of the forward NTT using the Cooley-Tukey Butterfly (a division pattern of the decimation process into layers) and the inverse NTT using the Gentleman-Sande Butterfly can be found in Kris Keersmaekers' MSc thesis[^ntru-ntt-hw]. These were the algorithms used by the winning team Supranational as well as Hardcaml, COSIC and Jump Crypto. #### Hardcaml team Let's have a look at the Hardcaml contribution because it has a straightforward design. It performs a single transform size configured at build time. For the ZPrize competition the transform size is $2^{24}$. The design is based around the 4-step algorithm which decomposes the full $2^{24}$ NTT into multiple $2^{12}$ NTTs across columns and rows of a $2^{12} \times 2^{12}$ matrix. Here is a summary of this algorithm: 1. Lay out the input of size $2^{24}$ as a $2^{12} \times 2^{12}$ matrix in row-major order. 2. Perform a length $2^{12}$ NTT along all columns and write the results back in place. 3. Multiply the matrix by $[x^{i * j}]$, where $[x]$ is the N-th root of unity of the underling field, and $N = 2^{24}$. 4. Perform a length $2^{12}$ NTT along all rows and write the results back in place. 5. Transpose the matrix. The overall complexity is roughly equivalent to a single $2^{24}$ INNT, although an extra twiddle factor correction pass at step 3 is required between the column and row phases. On-chip memory usage is drastically reduced, and it becomes possible to implement multiple smaller INNT cores for improved performance through parallelism. The implementation has a very simple architecture with only two types of cores: - Hardcaml NTT processing element (PE) in Ocaml - PCIe and HBM controller core in Vitis HLS C++ There can be at least 8 PEs accessing exclusive regions on HBM. The number of PEs should be a power of 2. There are multiple versions of the PEs in the submissions repo, with clock frequencies ranging from 200 to 300 MHz. #### Supranational team The winning team implemented their design in SystemVerilog and were able to reach 500 MHz clock frequency, in which case the feed and sink rates are both 125 GB/sec. At a peak bandwidth of 460 GB/sec, the HBM memory is the only available resource capable of supporting the combined feed and sink bandwidth need of 250 GB/sec while also providing enough storage for $2^{24}$ points. However achieving a 460 GB/sec HBM bandwidth requires extra thought. The HBM memory subsystem on the u55n FPGA is split into 32 channels, each capable of a $1/32$ of 460 GB/sec or about 14 GB/sec. The only way to achieve the 250 GB/sec need is to distribute the $2^{24}$ points across the 32 channels. Apart from the higher clock frequency, other winning factors were the pipelined design of the Butterfly and the modularity of having a separate Twiddle Factor Generator within the NTT core. ### Proof of succinct work (PoSW) This was the ZPrize category that received no submissions. PoSW is a novel work-based consensus algorithm for blockchains where the work that is proven is the generation of a SNARK proof. Miners compete to provide a valid solution to the PoSW puzzle by repeatedly generating SNARK proofs until they satisfy a given difficulty level provided by the protocol. If the PoSW prize recurs next time ZPrize is held, it may receive high-quality submissions since there might be teams that worked on it after since the 2022 competition. ## Closing words It is a joy to see such a collection of open-source marvel which is ZPrize 2022. The submissions are designed, implemented and documented to a high standard. It leaves no doubt that these implementations will be a common point of reference and a source of inspiration in the coming years. AMD-Xilinx are working on a new FPGA accelerator card targeted in particular at ZK cryptography. Among the likely parameters of that card are PCIe 5 and 32 GB HBM. If this card becomes available for the next ZPrize, we may see a prize category dedicated specifically to it. For example, PoSW can benefit from the large HBM which may help increasing the amount of SNARK proofs that are generated in parallel. If the space requirements of the PCIe 5 core are reasonable, this can allow to at least double the theoretical DMA throughput compared to the current designs that commonly work on 16-lane PCIe 3, which will in turn lead to lower latencies and new possible design choices. ## References [^zprize]: https://www.zprize.io/blog/announcing-zprize-results [^zkpod]: https://zeroknowledge.fm/266-2/ [^how-zk-snarks]: https://arxiv.org/abs/1906.07221 [^pipe-msm]: https://eprint.iacr.org/2022/999 [^zprize-repo]: https://github.com/z-prize/2022-entries [^ntru-ntt-hw]: https://github.com/KULeuven-COSIC/NTRU_NTT_HW