Hardware Review: GPUs , FPGAs and Zero Knowledge Proofs

# Hardware Review: GPUs , FPGAs and Zero Knowledge Proofs # The Story So Far Zero Knowledge Proofs (ZKPs) are of growing importance for decentralized systems. Initially gaining popularity with projects such as [ZCash](https://z.cash/), they are most widely known today as a scaling solution for Ethereum. Utilizing ZKPs, Layer 2's ([Scroll](https://scroll.io/) and [Polygon](https://polygon.technology/) for example) also known as Rollups are developing [zkEVMs](https://vitalik.ca/general/2022/08/04/zkevm.html) (Zero Knowledge Ethereum Virtual Machines). These zkEVMs process a bundle of transactions and produce a single, small proof (KB in size). This proof can be used to verify to the entire network that all the transactions in this bundle are valid and correct. However, their utility doesn't stop there. ZKPs allow for all sorts of interesting applications. - Private Layer 1 - Mina for example hides the details of transactions and data on its blockchain allowing users to just prove the validity of their transactions without revealing specific information about the transaction itself. [neptune.cash](neptune.cash) and Aleo are also very interesting projects. - Decentralized Cloud - [FedML](https://www.fedml.ai/) and [together.xyz](https://www.together.xyz/) are allowing ML (Machine learning) models to be trained in a decentralized way, outsourcing computation to the network and validating the computation using ZKPs. [Druglike](https://druglike.com/) is building a more open drug discovery platform utilizing similar techniques. - Decentralized Storage - [Filecoin](https://filecoin.io/) is revolutionizing cloud storage and ZKPs are at its core, allowing storage providers to prove that they have stored the correct data over a period of time. - Gaming - More advanced versions of [Darkforest](https://zkga.me/) are likely to emerge, requiring realtime proving. ZKPs may also scale games, reduce the ability to cheat and perhaps along with Decentralized Cloud, games could pay for their own hosting fees! - Identity - Decentralized identity is also now possible with ZKP. A lot of interesting projects are being built around this idea. Some of my favorites are [notebook](https://www.notebooklabs.xyz/) and [zkemail](https://zkemail.xyz/). The implications of ZKPs and decentralized systems are tremendous. However, there are still many roadblocks and challenges. The process of generating a proof is extremely resource-intensive, requiring complex mathematical operations to be performed many times. As developers seek to build ever more ambitious and complex protocols and systems utilizing ZKP technologies, they demand larger proofs, faster performance, and better optimizations, both for proof generation and verification. In this article, we shall explore the current state of hardware acceleration and how it is going to play a key role in advancing ZKP applications. # The different types of Zero-Knowledge Proofs There are two popular zero-knowledge technologies available today zk-STARKs and zk-SNARKs (we are ignoring Bulletproofs in this article). <table> <tr> <td> </td> <td>zk-STARK </td> <td>zk-SNARK </td> </tr> <tr> <td>Complexity - Prover </td> <td>O(N * poly-log(N)) </td> <td>O(N * log(N)) </td> </tr> <tr> <td>Complexity - Verifier </td> <td>O(poly-log(N)) </td> <td>O(1) </td> </tr> <tr> <td><a href="https://docs.google.com/presentation/d/1gfB6WZMvM9mmDKofFibIgsyYShdf0RV_Y8TLz3k1Ls0/edit#slide=id.g443ebc39b4_0_110">Proof size</a> </td> <td>45KB-200KB </td> <td>~ 288 Bytes </td> </tr> <tr> <td>Post quantum resistance </td> <td>Yes (hash function based) </td> <td>No </td> </tr> <tr> <td>Trusted setup </td> <td>No </td> <td>Yes </td> </tr> <tr> <td>Zero Knowledge </td> <td>Both </td> <td>Yes </td> </tr> <tr> <td>Interactivity </td> <td>Interactive or non-interactive </td> <td>Non interactive </td> </tr> <tr> <td>Developer documentation </td> <td>Limited - but expanding </td> <td>Very well documented </td> </tr> </table> As illustrated above both have differences and tradeoffs. The most important to highlight is that the trusted setup of a zk-SNARK based system relies on the assumption that at least one honest player took part in the trusted setup process and destroyed their key. In terms of verifying zk-STARK vs zk-SNARK on chain, zk-SNARK are considerably cheaper in terms of gas. zk-SNARK proofs are also significantly smaller making it cheaper to store on chain. Currently, zk-SNARKs are more popular then zk-STARKs. Most likely because zk-SNARKs are way older. Another reason may be that Zcash one of the earliest ZK blockchains utilized zk-SNARKs so most of the development tools and documentation currently revolve around the zk-SNARK ecosystem. # What are we accelerating? Lets first understand how a developer would approach building a ZK application. They would need two components: 1. A way to express the computation in a zk friendly manner (DSL or Low level library). 2. A proving system, this system should implement two functions Prove and Verify. ### DSLs (domain-specific language) and low level libraries When it comes to DSLs there are many: [Circom](https://docs.circom.io/), [Halo2](https://zcash.github.io/halo2/), [Noir](https://medium.com/aztec-protocol/introducing-noir-the-universal-language-of-zero-knowledge-ff43f38d86d9) and many more. Today Circom is perhaps the most popular. A popular example for a low level library is [Arkworks](https://github.com/arkworks-rs); others are being developed such as [ICICLE](https://github.com/ingonyama-zk/icicle), which is a CUDA-accelerated library. These DSLs are designed to output constraint systems. Unlike a normal program which would be evaluated as `x := y * z` the same program in constraint form would look more like `x - y * z = 0`. As a result of representing programs as constraint systems we find that operations such as addition or multiplication can be represented in a single constraint. However more complex operations such as bit manipulation requires hundreds of constraints! Thus it becomes complex to implement certain hash functions as ZK friendly programs. ### Proving systems A proving system is a piece of software that fulfills two main roles: generating a proof and verifying a proof. Proving systems normally take as inputs, Circuit, witness and public / private parameters. Two of the most popular proving systems are Groth16 and PLONK (HyperPlonk, UltraPlonk). They differ from each other in many ways. <table> <tr> <td> </td> <td>Trusted Setup </td> <td>Proof size </td> <td>Prover Complexity </td> <td>Verifier Complexity </td> <td>Control over constraint optimization </td> </tr> <tr> <td>Groth16 </td> <td>Circuit specific </td> <td>small </td> <td>Low </td> <td>Low </td> <td>Low </td> </tr> <tr> <td>Plonk </td> <td>Universal </td> <td>Large </td> <td>High </td> <td>Constant </td> <td>High </td> </tr> </table> Groth16 requires a lengthy trusted setup, but provides us with faster proving and verifying times as well as a constant proof size. Plonk offers a Universal setup, each time you generate a proof a preprocessing phase will be executed for the circuit you are running. Verifying time however is constant despite supporting many circuits. Both also differ when it comes to optimizing your constraints. Groth16 gives you little control and constraint sizes grow linearly while Plonk lets you have more flexibility. \ \ Overall it's important to understand that performance is directly tied to the number of constraints your ZKP application has. The more constraints the more operations the Prover must compute. ### Bottlenecks Proof systems contain two main mathematical operations: MSM and FFT. Today we will be exploring systems where both are present, in such systems MSM may take over ~60% of runtime while FFT takes up ~ 30%. MSM requires a high number of memory accesses which in most cases consumes a significant amount of a machine's memory, as well as performing many scalar multiplications. FFT algorithms often require rearranging the input data into a specific order to perform the transform. This can require a large amount of data movement and can be a bottleneck, especially for large input sizes. Cache utilization may also be an issue if memory access patterns do not fit well into the cache hierarchy. # Beyond Software Optimizations Hardware acceleration is a must to fully optimize ZKP performance. This is reminiscent of how ASICs and GPUs took over Bitcoin and Ethereum mining. While software optimizations are extremely important and valuable they have limitations. Hardware optimizations can improve parallelism by designing FPGAs with multiple processing units for example which reduce issues with software-based parallelism, such as thread synchronization or vectorization, memory access as well can be optimized more efficiently by improving memory hierarchy and access pattern. Today in ZKP there are two major trends: GPUs and FPGAs. In the short term, GPUs have the price advantage, especially after ETH moved over to Proof Of Stake leaving massive amounts of GPUs without use. GPUs also have the advantage of short development times and provide developers with massive parallelism “out of the box” FPGAs however should catch up, especially when comparing form factor and power consumption. FPGAs also provide better latency for high speed data streams. FPGAs when clustered together also provide better performance versus clusters of GPUs. ### GPU vs FPGA #### GPU * GPUs provide a fast development time. Frameworks such as CUDA and OpenCL are all documented and popular. * GPUs are extremely “power hungry”. Even when a developer is exploiting data level parallelism and thread level parallelism, GPUs still consume a lot of power. * GPUs are readily available; consumer grade GPUs are currently cheap and readily available. #### FPGA * FPGAs have a way more complex development cycle and require more specialized engineers. FPGAs allow an engineer to implement alot of “low level” optimizations not possible with GPUs. * FPGAs provide lower latency especially when processing large data streams. * FPGAs are more expensive then GPUs and not as readily available. ### ZPrize ZPrize is one of the most important initiatives currently in the ZK space. ZPrize is a competition encouraging teams to develop hardware acceleration for key ZKP bottlenecks (MSM, NTT, Poseidon and more) on a wide range of platforms (FPGA, GPU, Mobile and browser) with the judgment criteria being latency, throughput and efficiency. The results are interesting. GPU improvements where huge for example: * MSM was improved 131% from 5.86 seconds to 2.52 seconds. On the other hand FPGA solutions often even lacked any benchmarks. Unlike the GPU results which had prior benchmarks to compare to most of the FPGA solutions where the first time such algorithms were open sourced, future improvement is expected. I think these results also display a map of where the majority of developers feel comfortable developing their hardware acceleration solutions. Many teams competed for the GPU category, but only a very small group of teams competed for the FPGA category. I'm not sure if this is due to lack of skills when it comes to developing for FPGAs or the majority of FPGA companies / teams developing in secret to sell their solutions commercially. ZPrize made available to the community some extremely [interesting research work](https://www.zprize.io/blog/announcing-zprize-results). As more rounds of ZPrize commence I think we will be seeing increasingly interesting solutions and problems arise. # Hardware acceleration in action Let's take Groth16 as an example. If we inspect the [rapidsnark](https://github.com/iden3/rapidsnark/blob/main/src/groth16.cpp#L101) implementation of Groth16. We can see that two main operations are occurring: [FFT](https://github.com/iden3/rapidsnark/blob/8b254247fd34b523c79ec1b582a4402343bc8094/src/groth16.cpp#L101) (fast fourier transform) and [MSM](https://github.com/iden3/rapidsnark/blob/8b254247fd34b523c79ec1b582a4402343bc8094/src/groth16.cpp#L171) (multi scalar multiplication) ; both these primitives are common in many proving systems. The amount of time they are executed is directly correlated with the number of constraints a circuit has, naturally as more complex circuits are written the number of constraints will grow. To visualize how dominant these operations can be let's have a look at a benchmark performed by [Ingonyama](https://ingonyama.com/) on a Groth16 circuit (Vanilla [C2](https://lotus.filecoin.io/storage-providers/get-started/tasks/#commit-2) process from filecoin executed on a 32GB sector): ![](https://i.imgur.com/slrcEtD.png) ## On MSM As seen above MSM can take up a huge chunk of runtime for many Prover’s and serves as a serious bottleneck, this makes MSM a vital primitive that needs to be accelerated. ![](https://i.imgur.com/nvPtKGX.png) [Source ](https://drive.google.com/viewerng/viewer?url=https://eprint.iacr.org/2022/999.pdf) ### What is MSM? Multi-Scalar Multiplication (MSM) is an algorithm for calculating the sum of multiple scalar multiplications. $MSM(a, G) = \sum\limits_{j=0}^{n-1} a_j G_j$ * $G_j \in G$ - points from an Elliptic Curve group. * $a_0 \ldots a_n$ - Scalars * $MSM(a,G) \in G$ - a single EC (elliptic curve point) point We can break MSM down to two main sub components * Modular Multiplication * Elliptic curve point addition ![](https://i.imgur.com/3wlgPaC.png) Both of these are addressed in Ingonyama’s [PipeMSM paper](https://drive.google.com/viewerng/viewer?url=https://eprint.iacr.org/2022/999.pdf) in detail. ### MSMs role in Groth16 MSMs role in Groth16 can be understood at a very high level quite trivially. Given: * $P$ - Program * $X$ - secret inputs * $n$ - Size of $X$ * $G_n \ldots G_{n}$ - a group of EC points derived from $P$ proving key. * $a_n \ldots a_{n}$ - these integers are produced by the prover while executing the circuit. The result of $MSM(a,G)$ is included in the proof output, meaning for a Prover to produce a valid proof he must compute MSM. ## GPU vs FPGA in terms of MSM As a case study of two leading MSM implementations, we will compare PipeMSM and Sppark, with PipeMSM implemented on FPGA and Sppark on GPU. ### PipeMSM vs Sppark [Sppark](https://github.com/supranational/sppark) and PipeMSM implement MSM using state of the art Pippenger MSM, also known as bucket algorithm; plenty of articles have been written on this subject ([example](https://hackmd.io/@drouyang/SyYwhWIso#Pippenger-Algorithm-for-Multi-Scalar-Multiplication-MSM)). Sppark is implemented using CUDA; their version is highly parallel and archives impressive results running on the latest GPUs. PipeMSMs however optimizes not only the algorithm but provides optimisations down to the mathematical primitives of Modular Multiplication and EC Addition. PipeMSM addresses the whole “MSM stack” implementing a range of mathematical tricks and optimisations aimed at making MSM more hardware friendly and a hardware design aimed at lowering latency as well as focusing heavily on parallelization. Let's quickly review PipeMSMs implementation. ### Low latency PipeMSM is designed to provide low latency when performing many MSM on large amounts of inputs. GPUs cannot provide deterministic low latency due to Dynamic frequency scaling where the GPU will adjust its clock speed based on the workload. \ \ The FPGA implementation can provide deterministic performance and latency for specific workloads, due to an optimized hardware design. ### Addition implementation EC Addition is implemented as a low latency, highly parallel and complete formula (complete means it correctly computes the sum of _any_ two points in the EC group). EC Addition is used in the EC Adder Module in a pipeline fashion to reduce latency. \ We choose to represent our coordinates as projective coordinates because fewer divisions and multiplications are involved in computation. To demonstrate this; given $P: (X_1,Y_1,Z_1)$ and $Q: (X_2,Y_2,Z_2)$: we would calculate $(X_3,Y_3,Z_3)$ as $s = (Y2 * Z1 - Y1 * Z2) / (X2 * Z1 - X1 * Z2)$ $X3 = s^2 - (X1 + X2)$ $Y3 = s * (X1 - X3) - Y1$ $Z3 = Z1 * Z2$ As seen above we perform only a single division while affine coordinates would contain three division operations. We use a [Complete formula](https://www.sciencedirect.com/science/article/pii/S0022314X85710888) for EC Addition. The main advantage here is that we don't have to handle edge cases. The complete formula; ![](https://i.imgur.com/ccrux0Q.png) We implement this addition formula in a pipeline and parallel fashion, our FPGA EC Adder circuit can perform this formula with only 12 multiplications, 17 additions, and 3 multiplications by 3. While the original requires 15 modular multiplications, 13 additions, and 4 multiplications by 3. ![](https://i.imgur.com/sYqY55h.png) *EC Adder module - a highly parallel, low latency circuit performing Addition and Subtraction* ### Modular Multiplication Implementation We utilize Barrett Reduction and Karatsuba algorithms. Barrett Reduction is a technique to efficiently compute $r &\equiv ab \mod n$ when $n$ is fixed. Barrett Reduction attempts to replace division (an expensive operation) by approximating division. An error tolerance can be chosen to describe a range for which we are looking for a correct result. We found that a large error tolerance allows for a smaller reduction constant to be used; this reduces the size of the intermediate values used in the modular reduction operation, which in turn reduces the number of multiplications required to compute the final result. In the blue square below we can see that due to our high error tolerance we must perform multiple checks to find the accurate result. ![](https://i.imgur.com/IWtJh9a.png) In short, the Karatsuba algorithm is used to optimize the multiplications we perform in Barrett Reduction. Karatsuba reduces the multiplication of two n-digit numbers to three multiplications of n/2-digit numbers; these multiplications can be reduced to the level where they are narrow enough to be computed by the hardware directly. ### MSM Using the primitives described above we developed a low latency highly parallel implementation of MSM. ![](https://i.imgur.com/cgsp7kx.jpg) I will provide a brief explanation (for a complete explanation please refer [here](https://eprint.iacr.org/2022/999.pdf)) of how this system works. Given $MSM(a, G) = \sum\limits_{j=0}^{n-1} a_j G_j$ we reduce it to; ![](https://i.imgur.com/YDCuG9h.png) *for exact steps refer [here](https://eprint.iacr.org/2022/999)* Essentially $K$ is $a_j$ split into chunks / partitions. These smaller chunks can be used to calculate partial sums of $G$ using an efficient algorithm to calculate $G^[k]$. So instead of calculating the whole MSM at once, we pass each $G^[k]$ into the EC Adder in parallel. The results from the EC Adder are accumulated into the final MSM. ### Results ![](https://i.imgur.com/JB0s1P4.png) Above we see a comparison between Sppark and PipeMSM. \ \ What stands out most is the significant power savings FPGAs offer, this could be extremely important in the future for large scale Proof generation operations. \ We should also note that our implementation was prototyped and benchmarked @125Mhz increasing clock speed to @500Mhz could boost calculation times by up to 4X or more. Another advantage of using our [FPGAs](https://github.com/ingonyama-zk/cloud-ZK) is that they can be fit into small form factor servers as they consume less energy and produce less heat. We are in the early days of FPGA engineering for ZKP, further improvements to algorithms should be expected. Additionally it's important to note that while the FPGA is calculating MSM, the CPU is idle, it may be possible to achieve even faster times by combining the FPGA with system resources (CPU/GPU). # ZPUs Zero Knowledge Processing Units are likely to arise as a result of the growing demand for ZKP solutions. As we start seeing ZKP integrated into databases, applications, gaming, IoT, and as we see blockchain adoption becoming more widespread, the demand for solutions more efficient than FPGAs and GPUs shall arise. ZPUs similar to ASICs once produced will outperform all existing solutions in terms of power and computation. Unlike FPGAs and GPUs they require event more resources and research to produce since once manufactured they are not possible to modify. Perhaps in the future we may even see ZPUs being intergrated into consumer devices similar to Apples Neural Engine in its mac devices. ## Conclusion Zero knowledge proofs (ZKP) have emerged as a crucial technology for decentralized systems. ZKPs are going to expand way beyond just scaling Etheruem but rather allow for outsourcing of compute to untrusted 3rd parties, allow the development of systems impossible before such as decentralized cloud computing, identity systems and much more. Optimizations for ZKP have traditionally focused on software-level improvements, but as demand for better performance grows we will be seeing more advanced acceleration solutions involving both hardware and software. Currently we will mostly see GPUs being used, due to short development time using CUDA and the fact that consumer grade GPUs are currently very cheap and abundant. GPUs also offer incredible performance. GPUs are not likely to vanish from this landscape anytime soon. As more sophisticated development teams enter the space we will most likely witness FPGAs taking the lead in terms of power efficiency and performance. FPGAs provide more lower level customization then GPUs as well as more configuration options. FPGAs have faster development speed and flexibility than ASICS. With the world of ZKPs evolving at rapid speed FPGAs can be re-flashed with different programs for different systems and update solutions. However to produce near real time proofs will likely require a combination of GPU/FPGA/ASICS configuration together depending on the system we are generating proofs for. ZPUs are also likely to evolve providing large scale proof generators and low power devices with an extremely efficient solution (I will write about this in an upcoming article).