We would like to thank Zhenfei Zhang for insightful discussions.
We show the performance of Cysic in proving Keccak function:
With the advance of proving techniques, using more advanced algorithm or faster software implementation, currently more than one million Poseidon hash functions can be proven per second on Macbooks. The speed of proving hash functions is important in one possible future of Ethereum, the Verge. As Vitalik wrote in the article "Verkle trees are vulnerable to quantum computers, and so if we replace the current KECCAK Merkle Patricia tree with Verkle trees, we will later have to replace the trees again". Having a performant proving speed for Keccak function is crucial in realising a post quantum secure STARKed binary hash tree. However, the best technology can only prove several thousands Keccak calls per second, which is far from being practical in the Verge approach. The main reason of the slow Keccak proving is due to the SNARK/START-unfriendness of Keccak functions, and for these conventional hash functions, such as BLAKE.
In this article, we use the GKR proof system to analyze the proving speed of Keccak on Cysic's C1 chip, which consists of a 64-by-64 array of processing elements (PEs) with massive bandwidth to both on-chip SRAM and off-chip DRAM. There are two main hardware products based on C1 chip, which is designed to support various proof backend:
We show with the novel GKR-specific design of Keccak function, our C1 chip can prove 1.31M Keccak functions per second. We remark that the design of Keccak proving process can be used in other hardware platforms, such as CPU, which could see possible improvement as well.
Before diving into the technical details of optimizing the operation, it is necessary to review some basics of the Keccak function and GKR proof system.
Keccak is a cryptographic hash function family and the foundation of the Secure Hash Algorithm 3 (SHA-3) standard. It is a member of the sponge construction family, which enables a flexible approach to hashing and other cryptographic applications like random number generation and authenticated encryption. The Keccak function can be described in the following code snippet:
As we can see in the code, the typical Keccak function consists of 24 rounds of several steps:
The GKR proof system (short for “Goldwasser-Kalai-Rothblum proof system”) is a cryptographic interactive proof system introduced by Shafi Goldwasser, Yael Kalai, and Guy Rothblum in their seminal 2008 work. The GKR proof system leverages techniques such as:
The GKR proof system we use here is slightly different from the original GKR system in the following manner:
While the proof conclusion condition remains the same: When the claims are only about the input wires of the circuit, the proof concludes. One benefit of having the slightly different GKR proof system is to perform sumcheck-free proof process for Bit-Permute or XOR with constant numbers operations.
The target number of Keccak function we consider here is 1M. In the following discussion, we use the term "logic bit" to refer to the wire defined in Keccak hash function, and "witness bit" to refer to the duplicated "logic bit" in the 1M instances. For example, since the Keccak hash function uses a 1,600 bit state, we have 1,600 logic bits and each corresponds to 1M witness bits. Our design thoughts consist of two parts:
We encode the 1,600-bit as 1,600 20-variable multilinear polynomials, and our proof state consists of a whole set of 1,600 pending claims about the evaluation of these polynomials. Recall that the Keccak function consists of three types of operations.
Now, the bit-permutation corresponds to no-operation (nop) from the prover's perspective, since the prover and verifier only need to agree on exchanging the claimed values of the set of 1,600 polynomials of the 1,600 logic bits – if the position of evaluations of the two claims are the same. For instance, some part of the polynomials before permutation are
After permutation, the polynomials become
The XOR operation (), for characteristic 2 field, corresponds to addition operation (). Since each logic bit is encoded into a standalone polynomial, the two operands must come from two different polynomials。Moreover, a sparse polynomial is not needed to link two additional operands. Specifically, we don't need to do the following as in traditional GKR proof system:
or
where and the following are the multilinear extension of the sparse polynomial that links the input and output ports of the gates in the traditional GKR implementation.
If we have:
We can do the following Sum-Check free proof state reduction:
where
The AND operation () appears only in the step of the Keccak function, corresponding to the multiplication operation () in the field. It can be simplified because the two operands comes from the same position of different polynomials: i.e.
Therefore, we proceed differently from traditional GKR. In traditional GKR, we do
Instead, we have following reduction:
where
This lead to a reduction of claims as following:
Combining the above three observations, the prover's work is greatly simplified for the following reasons:
In fact, in the whole proof process, we only need one sumcheck for each round of state updates in the Keccak function.
Encoding each logic bit as a standalone polynomial is nearly harmless for the prover, but it will significantly increase the verifier's cost. The solution is to combine the claims of the 1,600 polynomials using linear composition. In most cases, this is similar to the original GKR proof system that works on a single polynomial of two set of variables, which corresponds to logic bits and for parallel instances. The essence of the sumcheck-free claim reduction for XOR and Bit-Permute operations is captured by an efficient reduction for following type of "half-sumcheck":
with following state transition:
Compared with normal sumcheck, such "half"-sumcheck:
Putting everything together, we can simplify the proving process of each state update round of Keccak function as a full sumcheck for gate () and an almost free "half"-sumcheck. The full sumcheck covers the "AND" gate in step and the "half"-sumcheck covers all the rest operation (XOR and bit-permute). It is summarized in the following figure:
The full sumcheck proceed as follows:
Once the algorithm is fixed, the computational complexity is almost determined. However, in software implementation (or mapping to hardware accelerators), there are still constant factor optimizations that can be pushed further. In the following, we describe some field-agnostic optimizations.
The first opportunity arises in the first round of sumcheck. In binary circuits, encoding a single wire requires only one bit per instance, while directly storing a field element requires 128 bits. Therefore, from both memory and bandwidth saving perspective, the first round of sumcheck can be optimized. It reads the circuit data in compressed form (e.g., one byte encoding 8 field elements) directly from memory and decompresses it into field elements on-chip for computation.
This idea can be extended to the first four layers of sumcheck, if the hardware can support efficient table lookups with 256 entries (or in general, the first layers, if efficient lookup in a $2{2{s-1} $ entry table is supported). Specifically, for the first four layers of sumcheck, the circuit data in compressed form is read directly from memory, and on-chip SRAM is used for lookup-based decompression into field elements. Assuming the random folding parameters for the first, second, and third layers are , , and , respectively:
table[0] = 0
table[1] = 1
table[x, y] = (1 - r_1) * x + r_1 * y
table[x, y, z, w] = (1 - r_2) * ((1 - r_1) * x + r_1 * y) + r_2 * ((1 - r_1) * z + r_1 * w)
table[x, y, z, w, a, b, c, d] = (1 - r_3) * ((1 - r_2) * ((1 - r_1) * x + r_1 * y) + r_2 * ((1 - r_1) * z + r_1 * w)) + r_3 * ((1 - r_2) * ((1 - r_1) * a + r_1 * b) + r_2 * ((1 - r_1) * c + r_1 * d))
This approach can significantly reduce memory reads/writes, intermediate storage requirements (which peak after the first SumCheck round), and the computational effort of the folding operations in the SumCheck dynamic programming calculations.
Next, we discuss detailed strategies for carefully organizing data reads/writes and computation sequences to minimize memory access.
Algorithmically, a complete sumcheck consists of two alternating phases: (Assuming the gate has degree and the each of the input polynomials has variables ( in our case))
|"inputs"| × (d+1) × 2^{k+1-i}
outputs.|"outputs"| × (d+1) × 2^{k+1-i}
outputs.|"outputs"| × (d+1)
results.First, within each phase of Tic, the operations along the dimension are completely element-wise parallel. These operations can be pipelined rather than storing large intermediate results in memory after each step, thereby reducing memory access costs.
Second, we can consider fusing Tic- with Toc- to save memory reads/writes. (Note that Tic- cannot be fused with Toc- because the latter requires the random challenge.)
The main body of the Cysic C1 chip consists of a 64-by-64 array of processing elements (PEs) with massive bandwidth to both on-chip SRAM and off-chip DRAM. The PEs are connected by two sets of interconnects:
The control logic is designed for fully pipelined operation in regular dataflows, where data access patterns have simple analytical expressions, while maintaining the flexibility to handle irregular dataflows.
The PE array includes both general-purpose ALUs for standard 32/64-bit arithmetic and bitwise operations, as well as specialized ALUs tailored for cryptographic tasks such as hashing (Blake2, Poseidon, Keccak) and modular arithmetic (ranging from 32-bit to up to 384-bit width). ALUs are categorized based on their bandwidth requirements and computational complexity:
Outside the PE array, the C1 chip also features a high-performance CPU to execute arbitrary code, particularly for generating random challenges for Fiat-Shamir transformations. The chip communicates with the host CPU via a USB 3.0 interface. The host can be a server, a PE, or even a laptop, and multiple C1 chips can be utilized simultaneously by a single host.
If we use layers of lookup tables, with , , and (1M Hashes, or 2G wires being proved together), then the computation per layer of Keccak is (the computation cost is normalized to bytes to clearly show the relative cost of computation and memory access)
Putting everything together, the evaluation results are shown in the following table:
Cysic C1 Chp | 7950X3D | |
---|---|---|
Performance | 1.31M | 15k |
Node | TSMC 12nm | TSMC 5nm |
As the GPU implementation is not yet ready, so we do not report the GPU performance. As we mentioned in the intro, our design of the Keccak proving process on the GKR system can possibly improve the proving speed on other hardware platforms. Our C1 chip demonstrates strong performance over CPU and possible GPU implementations. Although Cysic C1 chip will be fabricated on a much older node than the CPU or GPU hardware, but Cysic C1 chip still outperform CPU by a lot. This strong performance is largely due to architectural advantage, high-speed memory architecture and ZK-specific ALU design.
We are excited to show that Cysic C1 chip is capable to prove more than enough Keccak functions per second, which paves the way for Ethereum Verge to happen. Keccak function is an instance to demonstrate the power of Cysic C1 chip. We envision that the highly efficient Cysic C1 chip would expand the use cases of ZK technology.