ZK Hardware Acceleration: The Past, the Present and the Future

# ZK Hardware Acceleration: The Past, the Present and the Future *By Luke Pearson and the Cysic team* TL;DR: - In hardware accelerating ZKP, FPGA has the same performance-per-watt level as GPU, but cannot compete with GPU on the performance-per-dollar metric. - ASIC outperforms FPGA and GPU on the above two metrics, but takes longer to get to the market. ## Introduction The significance of zero-knowledge proofs (ZKP) has grown exponentially in recent years, emerging as one of the most crucial innovations in computer science over the past half-century. This can be attributed to the fact that ZKPs have the potential to dramatically enhance the scalability of blockchain platforms such as Ethereum. A key aspect of ZKPs is their ability to significantly increase the transactions per second (tps) on various blockchain platforms, relying solely on mathematical principles rather than trust. By enabling validators to consolidate multiple transactions into a single, concise proof, ZKPs ensure both accuracy and integrity throughout the process. ZKPs offer many other features that make them essential components of various scaling and privacy solutions, including ZK aggregations like StarkNet, private ZK aggregations like Aztec, and Layer 1 chains like Mina, Filecoin, Manta and Aleo. Nonetheless, ZKP is not without its limitations, as the process of generating proofs can be extremely resource-intensive in terms of both time and energy. The creation of proofs is often slowed down by the need for numerous complex mathematical operations, such as exponentiations, inversions, and bilinear pairing computations. Consequently, it remains a challenge to optimize ZKP solutions in order to fully harness their potential. To overcome these issues for all proposed ZKP constructions, it is of utmost importance to develop hardware acceleration methods. Namely, they can be accelerated 10-1000 times through the use of specialized hardware such as Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). In this article, we will provide an overview of the computational challenges associated with ZKPs, followed by a discussion of potential improvements that could help address these issues and enhance the efficiency of these cryptographic techniques ### Zero-Knowledge Proofs: zkSNARKs and zkSTARKs zkSNARK (Zero-Knowledge Succinct Non-Interactive ARgument of Knowledge) scheme is a type of ZKP that allows a prover to convince a verifier that the prover knows a witness without revealing any information about that witness. The scheme consists of four algorithms: Setup, KeyGen, Prove and Verify. The setup algorithm produces some structured reference string, which will be used by the KeyGen algorithm to generate a proving key and a verification key for some specified circuit. The prover, with the proving key and the statement/witness pair, can generate a ZK proof with respect to the relation of the statement/witness pair through the specified circuit. The verifier can check the validness of ZK proofs using the verification key and the statement. A zkSNARK scheme needs to satisfy the following features: - Completeness: If the prover possesses the witness, they can produce a valid proof, and the verifier will always accept it. - Zero-knowledgeness: The prover can provide evidence that it possesses the secret witness, without disclosing any information about it to the verifier or anyone else. - Soundness: Creating a valid proof without requiring a witness is not possible for a dishonest prover. The other variant is called zkSTARK (Zero-Knowledge Scalable and Transparent ARgument of Knowledge). It can function with or without zero knowledge, and can be either interactive or non-interactive protocols. They can also substitute zkSNARK as an interactive protocol. Unlike zkSNARK, interactive proofs do not need a trusted setup phase, making STARK systems a better option. STARK systems have overcome this disadvantage by utilizing the Interactive Oracle Proofs (IOP) domain, which relies on Fast Reed-Solomon codes to improve scalability, leading to the development of zkSTARK proof systems. Furthermore, zkSTARK based systems traditionally work with logarithmic complexity for both the prover and verifier, making them efficient and preventing one party from performing a Denial of Service (DoS) attack since each side's computations take a similar amount of time. Another notable property of zkSTARK is the conjectured post-quantum security, while zkSNARK is not due to the Shor's quantum algorithm. The zkSTARK system offers post-quantum security, provided that the hash function employed within the framework is itself resistant to quantum computing attacks. In the realm of blockchain technology, the abbreviations SNARK and STARK are often used in place of zkSNARK and zkSTARK, as either when privacy is not implemented or required for differing blockchain use cases. Consequently, throughout this article, we will also adopt the simplified term "zk" in our discussions and explanations. There are two important use cases of ZKPs: - Outsourcing verifiable computation: Suppose we need to do a computation that is either expensive or impossible to run due to the underlying platform constraints (e.g., a smart phone, a Raspberry Pi, a laptop, or even a blockchain network such as Ethereum). In such cases, we may have to run the computation on a third-party service, which can quickly and cheaply return the output of the computation (such as an oracle service like Chainlink or an AWS Lambda function). However, in many instances, we must depend on the assumption that the computation has been executed correctly, thereby granting the service provider the potential to generate inaccurate results. Such results could lead to severe consequences and compromise the integrity of the system. - Private computation: In cases where the computation is not costly to perform locally but some parts of it need to be private, ZKPs can also be useful. In other words, ZKP can ensure that the computations have been performed correctly without even disclosing the private values to the outsourced provider. For instance, if we want to demonstrate that we know the 1000-th Fibonacci number without disclosing the actual number or prove that we made a valid payment without revealing our identity or the amount paid, ZKPs can easily help achieve this goal. Even further, we can selectively disclose some private values while hide other parts through ZKPs (which is also known as selective disclosure) In this context of blockchain, the above use cases of ZKP are presented as: - Layer 2 scaling: By incorporating ZKPs into verifiable computation, Layer 1 blockchains can delegate transaction processing to off-chain, high-performance systems, commonly referred to as Layer 2s. This approach enhances blockchain scalability while maintaining robust security measures, striking a balance between efficiency and security. StarkWare and Scroll have developed scalable smart contract platforms that employ a specialized virtual machine to execute ZK-friendly code, and Aztec allows their layer 2 programs to run privately, protecting the information of a user's transactions. - Private layer 1s: Aleo, Mina, Manta and Zcash are layer 1 blockchains that use ZKPs to enable transactors to conceal senders, receivers, or amounts, either by default (as in Aleo) or through an opt-in process (as in Mina and Zcash). - Data Compression: ZKPs are utilized by Mina and Celo to compress the blockchain data required for syncing to the most recent state of the chain into a brief proof. - Decentralized Storage: Filecoin also uses ZKPs (running on GPUs) to prove that nodes in the network store data correctly. Consequently, as cryptocurrency adoption continues to expand, the demand for enhanced performance and privacy is also expected to rise, fueling the need for more sophisticated applications of ZKPs. As developers create new applications and protocols, ZKPs are poised to play a crucial role in facilitating secure and efficient decentralized systems. These systems will be capable of handling large-scale transaction processing while preserving user privacy, meeting the evolving requirements of the digital currency landscape. ## How come ZKPs are slow, and what can be done to increase their speed? To prove a computation using ZKPs, it is necessary to convert the computation from a classical program to a ZK-friendly format. There are two ways to do this: either by using a library written in some existent language, like Arkworks (in Rust), or by using a Domain Specific Language (DSL) such as Cairo or Circom that compiles down to the required primitives for generating the proof. The more complicated the operation, the longer it takes to generate the proof. In addition, some operations are not inherently ZK-friendly and require additional work to be made so. For instance, operations like bitwise functions utilized in SHA or Keccak may not be friendly with ZKPs, leading to extended proof generation times. This can occur even for operations that are relatively inexpensive when executed on a classical computer. After converting your computation into a ZK-friendly format, you select some inputs and submit them to a proof system. There are numerous proof systems, such as Groth16, PLONK, HyperPlonk, UltraPlonk, Sonic, Spartan, and STARK. All these systems share the same basic function of accepting a ZK-friendly computation with inputs and generating a proof as output. Depending on the proof system, the proof generation process may differ, but the bottleneck ends up being similar. There are mainly two computational tasks that are often used in the context of ZKPs: - Multiplications over large vectors of numbers, which can be either field or group elements. In this case, there are also two specific types of multiplications: variable-base and fixed-base multi-scalar multiplications (MSMs). - Number Theoretic Transform (NTT) and Inverse NTT. However, there are also techniques available for NTT-less proof systems, such as the recent HyperPlonk system. In proof systems where both NTTs and MSMs exist, about 60% of the time generating a proof is spent on MSMs, and the rest is dominated by NTTs. Both MSMs and NTTs have performance challenges that can be addressed in the following ways: - MSMs can be executed on multiple threads, enabling parallel processing. However, when dealing with large vectors of elements, such as 67 million elements, the multiplications may still be slow and demand considerable memory resources. Additionally, MSMs present scalability challenges and can remain sluggish even when extensively parallelized. - Frequent data shuffling during the algorithm makes NTTs hard to distribute across a computing cluster, and they require significant bandwidth when run on hardware due to the need to load and unload elements from large datasets. This can cause slowdowns even though the hardware operations are fast. For instance, if a hardware chip has 16GB of memory or less, running NTTs on a >100GB dataset would require loading and unloading data over the wire, which can significantly slow down the operations. We believe that hardware acceleration is a vital component in enabling blockchain protocols to attain the necessary scalability for practical applications. Currently, blockchain protocols are constrained by their on-chain compute and storage capacities, as well as their networking bandwidth. By optimizing off-chain hardware and proof generation, we are confident that we can substantially boost the performance of blockchain networks, making them more effective and efficient. ### ZPrize Competition for Zero-knowledge Tools ZPrize is a collective initiative within the blockchain industry, comprising over 32 partners and sponsors who contribute their time, resources, and efforts to ensure its success. The program provides a range of challenges and programs to inspire participants to develop inventive solutions that promote sustainability and reduce carbon emissions. They took pride in achieving both objectives, which marks a significant advancement in the field of zero-knowledge cryptography. The outcomes of the ZPrize surpassed their expectations, as evidenced by the chart provided below. For the majority of the prizes for which they received submissions, the average enhancement from the baseline was more than five times. Notably, some prizes involving FPGAs lacked a public benchmark, making these submissions the first public demonstration of the algorithm implementations in open-source form. Here are a few noteworthy achievements: - Multiscalar multiplication on a GPU for a large-scale problem instance ($2^{26}$ elements) showed improvement from 5.8 seconds to 2.5 seconds, enabling more complex usage of zkSNARKs. - Poseidon zero-knowledge friendly hash function implementation reduced the constraint count by half, resulting in a significant decrease in the cost of creating Merkle trees inside a SNARK. - Multiscalar multiplication on an Android mobile device for a small-scale problem instance improved from 2.4 seconds to approximately half a second, facilitating ZK-based mobile payments. In the coming months, they plan to feature the work of all the teams who participated in this initiative. As the field of zero-knowledge cryptography evolves, novel methods and obstacles will arise. They aspire to regularly organize the ZPrize competition to facilitate the progress of this technology and make it available to the public through a collection of open-source materials. ### Technical Approaches for Accelerating ZKP The bottleneck in proof generation typically stems from the multiplication of large number vectors, including fields or group elements, multiscalar multiplications of variable or fixed cardinality, and NTT and inverse NTT. In proof systems utilizing both NTT and MSM, MSM contributes to approximately 60% of the proof generation time, while NTT dominates the remaining time. Although MSM is parallelizable, it remains slow, necessitating significant memory resources and often consuming most of the available memory on the device. On the other hand, NTT relies heavily on frequent data shuffling, which complicates acceleration by distributing the load across computing clusters. Additionally, NTT demands substantial bandwidth to run on hardware, as loading and unloading data over the network can notably decelerate operations, despite the hardware operations themselves being quite fast. However, there are methods to enhance the performance of both MSM and NTT in order to optimize proof generation processes. MSM is commonly implemented using Pippenger's algorithm to minimize the number of group addition operations. This method has a straightforward hardware implementation, which includes a group addition unit and a bucket table as its basic components. From a system design perspective, MSM is highly accelerator-friendly for the following reasons: - Group addition is a static operation (no data-dependent control flow) with intensive computation and relatively small input/output, making it well-suited for a fully-pipelined architecture. This allows accelerators to achieve nearly 100% hardware utilization, compared to the 20% utilization seen in the best GPU implementations.Decades of research have made group addition a stable operation, so it is unlikely to be replaced within the life-span of an ASIC accelerator. - The pain points of MSM implementation on massively parallel hardware with relatively weak global scheduling capabilities (like GPUs) can be easily addressed by adding an extra hardware scheduler to the accelerator. For example, updating the bucket table is more efficiently handled in this way, avoiding the risk of write-conflicts. In summary, accelerating MSM using specialized hardware is highly advantageous. However, the main issue is the lack of such hardware in the market. Cycis has designed an MSM accelerator using Xilinx FPGAs, which demonstrates the best system-wise performance to date (around 40ms per $2^{26}$ MSM on the BN254 curve). Unfortunately, the high cost of FPGA chips for professional users compared to GPUs designed for general gamers, along with the process lag of FPGAs (14nm for FPGA vs enhanced 5nm for GPU), are two significant drawbacks that negate the utilization advantage of FPGA-based accelerators. The NTT is composed of layers of butterfly operators. While the textbook layer-by-layer NTT implementation can quickly become bottlenecked by memory bandwidth utilization (due to strided access), this problem can be resolved by executing a batch of layers simultaneously and properly reordering NTT data during computation. As a rule of thumb, memory access can be reorganized into block access with a granularity of approximately (size of on-chip memory capacity) / ($k$-th root of the NTT points), where $k$ is the adjustable parameter representing the number of layer groups. By using this method, both GPUs and accelerators can achieve excellent performance. Hardware acceleration can bolster the performance of blockchain protocols by deploying optimized algorithms on diverse hardware technologies, such as GPUs, FPGAs, or ASICs. Enhancing software and algorithmic implementations to more effectively utilize existing resources like CPUs and GPUs, as well as developing custom hardware in conjunction with novel algorithms tailored for specific hardware configurations, can facilitate hardware acceleration. By doing so, the performance of blockchain networks, which are presently constrained by their on-chain computing and storage capacities and network bandwidth, can be significantly improved. ## What is the best hardware to use: now and the future? To determine the best hardware technology to implement the acceleration techniques discussed above, we need to consider that ZKPs are still in their early stages of development. As such, there is limited standardization on system parameters and choice of proof system, such as FFT width or bit-size of elements. Additionally, there are two more factors we need to consider: - Performance per dollar: This measures how much capital you need to pay for the hardware performance. Based on our experience, FPGA cannot compete with GPU on this metric. Based on ZPrize competition submissions and our internal benchmark results, a single RTX3090 GPU can produce roughly 2.5x higher performance than a single high-end FPGA. And the price of a high-end FPGA and a high-end GPU is roughly the same, which means an FPGA will be roughly 2.5x more expensive than a GPU system in the end. - Performance per watt: This metric is about how much you have to spend to maintain the hardware, which is basically the electric bill. FPGAs are roughly at the same level as GPUs on this metric. Based on the above two metrics, does this mean that FPGAs cannot outperform GPU? The Answer is NO. Although a single FPGA chip cannot compete with a single GPU, a massively interconnected FPGA system can beat a massively interconnected GPU system. A top-end GPU system can have at most 10 GPU cards in total due to the constraints of PCIe slots. However, dozens of FPGA chips can be connected together with customized, programmable, high-bandwidth interconnect, therefore reaching a performance level that can beat the GPU system. Cysic's massively-connected FPGA system is a direct testimony of this insight. ### How about ASIC for ZKP? ASIC is unanimously believed to be the ideal hardware for ZKP. However, there are three majors concerns over building ASIC for ZKP. - Programmability: In terms of ZKP logic modifications, ASICs have write-once business logic, which necessitates rebuilding the entire system from the beginning. On the contrary, FPGAs or GPU can be reprogrammed numerous times, enabling the use of the same hardware across various projects with different proof systems or potential updates. This attribute makes FPGAs a more versatile alternative compared to ASICs. - High cost: Building ASIC is a capital-intensive game. It's common to spend $8m for a full-mask 28nm ASIC chip design with 20 engineers, or $25m for a full-mask 5nm/4nm ASIC design with 30 engineers. - Time to market. The time required for ASIC design, production, and deployment is typically 12 to 18 months or more, which is much longer than FPGA/GPU. The first issue can be addressed using a framework in hardware, called Instruction-Set Architecture (ISA). An ISA refers to the abstract framework and design of a hardware system, such as CPU and hardware accelerator. It represents the set of basic instructions and operations that a processor can execute. These instructions include arithmetic operations, data movement, logical operations, control flow, and other low-level functions. Using ISA, the hardware accelerating ZKP can be divided into three layers: - The scheme layer: The various ZK proof systems are in this layer, such as Groth16 and Plonk. The top layer calls the kernel layer to fulfill the computation. - The kernel layer: The job for the kernel layer is to compute different functions that are used by the top layer. The functions include multi-scalar multiplication, number theoretic transfer, polynomial evaluation and various hash functions. It relies on the instruction layer to perform the computation. - The instruction layer: This layer is the ISA, which covers the most elementary operations, including arithmetic operations, data movement, logical operations and control flow. The above structure is illustrated in the following picture: ![](https://i.imgur.com/CtZXC84.png) This ISA structure provides a solution to support multiple ZK systems and the potential updates. An AZK (ASIC for ZK) can be programmed using the ZK-ISA to support the versatility of ZK systems. For the other concerns about doing ASIC, it is determined by the market. Currently the valuation of ZK-related projects adds up to more than $10b, we believe it is worthwhile to spend around $10m to build an ASIC to improve the efficiency of ZK proof generation. This time-to-market issue is inherent in all ASIC designs. Fortunately, the supply chain issue for ASIC has become much better now than before. The only solution to address this issue is to start early and warmup the supply chain thoroughly. By the time ASICs come out, it can demolish any other forms of ZK hardware. ## Conclusion ZKPs possess the potential to facilitate scalable and private payment systems, as well as smart contract platforms, substantially enhancing the usability and security of blockchain technology. It is true that ZKPs introduce overheads that have historically impeded their adoption. Factors such as significant computation and verification costs, along with the complexity of implementing ZKP-based systems, can create barriers to entry for numerous developers. Nonetheless, ongoing research and development in the field of ZKPs are addressing these challenges, simplifying the implementation and application of these techniques in practice. Considering the advantages of GPUs in terms of flexibility, ease of deployment, and expected performance, we believe that companies concentrating on GPU solutions are more likely to thrive in the coming months before ASIC technology emerges. If ASICs attain dominant scale and stability, they may become the preferred hardware for optimized algorithms. Consequently, ZKPs are expected to maintain a crucial role in enabling increasingly advanced and secure blockchain systems in the future.

Read more

Cysic Network Litepaper

Analyzing the Hyperplonk Proof System, a Hardware Perspective