# How Fast We Can Go: Proving Million Keccak Function Per Second
*We would like to thank Zhenfei Zhang for insightful discussions.*
### TL;DR
We show the performance of Cysic in proving Keccak function:
- Cysic C1 chip can prove **1.31M** Keccak functions per second;
- It requires in depth optimization in both coding and algorithmic design, in addition to powerful hardware.
## Introduction
With the advance of proving techniques, using more advanced algorithm or faster software implementation, currently more than one million Poseidon hash functions can be proven per second on Macbooks. The speed of proving hash functions is important in one possible future of Ethereum, the [Verge](https://vitalik.eth.limo/general/2024/10/23/futures4.html). As Vitalik wrote in the article "*Verkle trees are vulnerable to quantum computers, and so if we replace the current KECCAK Merkle Patricia tree with Verkle trees, we will later have to replace the trees again*". Having a performant proving speed for Keccak function is crucial in realising a post quantum secure STARKed binary hash tree. However, the best technology can only prove several thousands Keccak calls per second, which is far from being practical in the Verge approach. The main reason of the slow Keccak proving is due to the SNARK/START-unfriendness of Keccak functions, and for these conventional hash functions, such as BLAKE.
In this article, we use the GKR proof system to analyze the proving speed of Keccak on Cysic's C1 chip, which consists of a 64-by-64 array of processing elements (PEs) with massive bandwidth to both on-chip SRAM and off-chip DRAM. There are two main hardware products based on C1 chip, which is designed to support various proof backend:
- ZK-Air: It's a portable machine, with the size smaller than a Macbook Mini. It can be connected to laptops and cellphone via USB-C for client-side proving. It consists of a single C1 chip and will outperform the latest GPU card in the setting of ZK proving.
- ZK-Pro: It's a powerful machine, with the size similar to a mining machine. It consists of multiple C1 chips and does not need to be connected with any external devices for ZK proof generation. It is mainly used for server-side proving and will be faster than a GPU server in the setting of ZK proving.
We show with the novel GKR-specific design of Keccak function, our C1 chip can prove **1.31M** Keccak functions per second. We remark that the design of Keccak proving process can be used in other hardware platforms, such as CPU, which could see possible improvement as well.
## Preliminaries
Before diving into the technical details of optimizing the operation, it is necessary to review some basics of the Keccak function and GKR proof system.
### Keccak Function
Keccak is a cryptographic hash function family and the foundation of the Secure Hash Algorithm 3 (SHA-3) standard. It is a member of the sponge construction family, which enables a flexible approach to hashing and other cryptographic applications like random number generation and authenticated encryption. The Keccak function can be described in the following code snippet:
![Screenshot 2024-11-22 at 12.16.28](https://hackmd.io/_uploads/rJ9nBYaGke.png)
As we can see in the code, the typical Keccak function consists of 24 rounds of several steps:
- $\theta$ step: This step aims to introduce diffusion and increase the mixing of bits within the state array. This step is purely made of XOR operations, which are used to distribute information and propagate changes across the state array.
- $\rho$ step: After the $\theta$ step, the state array undergoes the $\rho$ step, which involves performing a series of predefined bit shuffle on specific positions within the array. This step enhances the diffusion of bits and ensures that each bit interacts with a wide range of other bits.
- $\pi$ step: After the $\rho$ step, the state array proceeds with the $\pi$ step, which is responsible for rearranging the bits within each lane of the state array, contributing to a high degree of confusion and diffusion.
- $\chi$ step: The $\chi$ step introduces nonlinearity and plays a crucial role in amplifying the effects of small changes in the input, resulting in significant changes throughout the state array. Each bit in the state array is combined with two other bits during the $\chi$ step to produce a new bit value.
- $\iota$ step: Finally, the $\iota$ step is a simple XOR operation that introduces additional randomness into the state array.
### GKR Proof System
The GKR proof system (short for “Goldwasser-Kalai-Rothblum proof system”) is a cryptographic interactive proof system introduced by Shafi Goldwasser, Yael Kalai, and Guy Rothblum in their seminal 2008 work. The GKR proof system leverages techniques such as:
- Sum-check Protocol: A fundamental subroutine used to verify claims about sums of polynomials.
- Layer-by-layer Verification: Each layer of the circuit is verified sequentially to ensure correctness.
The GKR proof system we use here is slightly different from the original GKR system in the following manner:
- Property of the claims: In the original GKR system, the claim is about evaluating the multilinear extension of **only one** polynomial at a given position, while ours is about evaluating MLE of **one or several** polynomials at several given locations.
- Proof status transitions: In every step of transition, one claim is removed from the set and another claim is added to the set in the original GKR proof system, while in ours, the addition and removal happen to **several** claims at the same time.
While the proof conclusion condition remains the same: When the claims are only about the input wires of the circuit, the proof concludes. One benefit of having the slightly different GKR proof system is to perform sumcheck-free proof process for Bit-Permute or XOR with constant numbers operations.
### Initial Design Thoughts
The target number of Keccak function we consider here is 1M. In the following discussion, we use the term "logic bit" to refer to the wire defined in Keccak hash function, and "witness bit" to refer to the duplicated "logic bit" in the 1M instances. For example, since the Keccak hash function uses a 1,600 bit state, we have 1,600 logic bits and each corresponds to 1M witness bits. Our design thoughts consist of two parts:
+ In the first part, we demonstrate the benefit of encoding each logic bit as a standalone 20-variable multilinear polynomial, which greatly simplifies the prover's work. We ignore the verifier-side cost for now.
+ In the second part, we discuss how verifier-side cost can be reduced using linear composition of the 1,600 polynomials, which somewhat resembles the original GKR system that works on a single 31-variable multilinear polynomial while preserving the benefits of the per-bit encoding.
#### The first part
We encode the 1,600-bit as 1,600 20-variable multilinear polynomials, and our proof state consists of a whole set of 1,600 pending claims about the evaluation of these polynomials. Recall that the Keccak function consists of three types of operations.
+ Bit permutation
+ XOR operation ($\oplus$)
+ And operation ($\wedge$)
Now, the bit-permutation corresponds to no-operation (nop) from the prover's perspective, since the prover and verifier only need to agree on exchanging the claimed values of the set of 1,600 polynomials of the 1,600 logic bits -- if the position of evaluations of the two claims are the same. For instance, some part of the polynomials before permutation are
$$\tilde{w}_1(r_1, r_2, r_3) = v_1, \tilde{w}_2(r_1, r_2, r_3) = v_2$$
After $\tilde{w}_1 = \tilde{w}'_2, \tilde{w}_2 = \tilde{w}'_1$ permutation, the polynomials become
$$\tilde{w}'_1(r_1, r_2, r_3) = v_2, \tilde{w}'_2(r_1, r_2, r_3) = v_1.$$
The XOR operation ($\oplus$), for characteristic 2 field, corresponds to addition operation ($+$). Since each logic bit is encoded into a standalone polynomial, the two operands must come from two different polynomials。Moreover, a sparse polynomial is not needed to link two additional operands. Specifically, we don't need to do the following as in traditional GKR proof system:
$$\tilde{w}_{out}(r_1, r_2, r_3) = \sum_{s_1, s_2, s_3, u_1, u_2, u_3} \widetilde{ADD}(r_1, r_2, r_3) | (s_1, s_2, s_3)(u_1, u_2, u_3)\big(\tilde{w}_{in(s_1, s_2, s_3)} + \tilde{w}_{in(u_1, u_2, u_3)}\big)$$
or
$$\tilde{w}_{out}(r_1, r_2, r_3) = \sum_{s_1, s_2, s_3} \widetilde{ADD}(r_1, r_2, r_3) | (s_1, s_2, s_3)\big(\tilde{w}_{in(s_1, s_2, s_3)}\big)$$
where $\widetilde{Add}$ and the following $\widetilde{Mul}$ are the multilinear extension of the sparse polynomial that links the input and output ports of the gates in the traditional GKR implementation.
If we have:
$$\forall b_1, b_2, b_3 \in \{0, 1\}, w_{out}(b_1, b_2, b_3) = c_1 \cdot w_1(b_1, b_2, b_3) + c_2 \cdot w_2(b_1, b_2, b_3),$$
We can do the following Sum-Check free proof state reduction:
$$\forall r_1, r_2, r_3 \in F, \tilde{w}_{out} (r_1, r_2, r_3) = c_1 \cdot \tilde{w}_1(r_1, r_2, r_3) + c_2 \cdot \tilde{w}_2(r_1, r_2, r_3), $$
where
$$\tilde{w}_{out}(r_1, r_2, r_3) = v_1 \rightarrow (\tilde{w}_1(r_1, r_2, r_3) = v_2, \tilde{w}_2(r_1, r_2, r_3) = v_3)$$
The AND operation ($\wedge$) appears only in the $\chi$ step of the Keccak function, corresponding to the multiplication operation ($\times$) in the field. It can be simplified because the two operands comes from the same position of different polynomials: i.e.
$$\forall b_1, b_2, b_3 \in \{0, 1\}, w_{out}(b_1, b_2, b_3) = w_1(b_1, b_2, b_3) \cdot w_2(b_1, b_2, b_3).$$
Therefore, we proceed differently from traditional GKR. In traditional GKR, we do
$$\tilde{w}_{out}(r_1, r_2, r_3) = \sum_{s_1, s_2, s_3, u_1, u_2, u_3} \widetilde{MUL}(r_1, r_2, r_3)(s_1, s_2, s_3)(u_1, u_2, u_3)\big(\tilde{w}_{in}(s_1, s_2, s_3) \cdot \tilde{w}_{in}(u_1, u_2, u_3) \big).$$
Instead, we have following reduction:
$$\tilde{w}_{out}(r_1, r_2, r_3) = \sum_{s_1, s_2, s_3} \beta(r_1, s_1) \beta(r_2, s_2) \beta(r_3, s_3) \big( \tilde{w}_a(s_1, s_2, s_3) \cdot \tilde{w}_b(s_1, s_2, s_3) \big),$$
where
$$
\begin{eqnarray}
\beta &:& F \times F \to F \\
\beta(x, y) &=& xy + (1-x)(1-y)
\end{eqnarray}
$$
This lead to a reduction of claims as following:
$$\tilde{w}_{out}(r_1, r_2, r_3) = v_1 \rightarrow (\tilde{w}_a(r'_1, r'_2, r'_3) = v_2, \tilde{w}_b(r'_1, r'_2, r'_3) = v_3)$$
Combining the above three observations, the prover's work is greatly simplified for the following reasons:
- The bit-permutation is now free;
- The XOR operation, which dominates the Keccak function is nearly free. This is because it does not require sumcheck but only requires the prover and verifier to communicate the evaluations of the polynomials;
- The AND operation does not improve so much as the XOR operation, as it still requires a sumcheck. The AND operation only contributes to a small part of the Keccak function.
In fact, in the whole proof process, we only need one sumcheck for each round of state updates in the Keccak function.
#### The second part
Encoding each logic bit as a standalone polynomial is nearly harmless for the prover, but it will significantly increase the verifier's cost. The solution is to combine the claims of the 1,600 polynomials using linear composition. In most cases, this is similar to the original GKR proof system that works on a single polynomial of two set of variables, which corresponds to logic bits $\vec{u}\in {\{0,1\}}^{11}$ and $\vec{r}\in {\{0,1\}}^{20}$ for parallel instances. The essence of the sumcheck-free claim reduction for XOR and Bit-Permute operations is captured by an efficient reduction for following type of "half-sumcheck":
$$\forall \vec{u} \in F^{11}, \vec{r} \in F^{20}, \vec{w}_{out}(\vec{u}, \vec{r}) = \sum_{\vec{s} \in F^{11}} \widetilde{M}(\vec{u}, \vec{s}) \cdot \tilde{w}_{in}(\vec{s}, \vec{r}).$$
with following state transition:
$$\tilde{w}_{out}(\vec{u}, \vec{r}) = v_1 \rightarrow \tilde{w}_{in}(\vec{u}', \vec{r}) = v_2$$
Compared with normal sumcheck, such "half"-sumcheck:
- Only refresh the first set of 11 variables corresponding to logic bits ($\vec u \to \vec u'$) while treating the second set of 20 variables corresponding to 1M parallel instances unchanged.
- The proving cost only depends on the number of logic wires and is independent from the number of parallel instances; i.e. is $2^{11}$ instead of $2^{31}$ in our case. This makes the "half"-sumcheck nearly free for the prover side.
- In this step, we use the value of $\tilde w_{in}(\vec b, \vec r)$ for all $\vec b \in {\{0,1\}}^{11}$ and a fixed $\vec r \in F^{20}$, but prefer not computing it. We can obtain them as a free by-product of the sumcheck immediately before this step with little post-process cost (independent from the factor $2^{20}$).
Putting everything together, we can simplify the proving process of each state update round of Keccak function as a full sumcheck for gate ($(a,b,c) \Rightarrow x \cdot b + c$) and an almost free "half"-sumcheck. The full sumcheck covers the "AND" gate in $\chi$ step and the "half"-sumcheck covers all the rest operation (XOR and bit-permute). It is summarized in the following figure:
![image](https://hackmd.io/_uploads/Syn4ChGXyl.png)
The full sumcheck proceed as follows:
![image](https://hackmd.io/_uploads/rJX_OneXyl.png)
## Software-Hardware Codesign of Keccak Function Implementation
Once the algorithm is fixed, the computational complexity is almost determined. However, in software implementation (or mapping to hardware accelerators), there are still constant factor optimizations that can be pushed further. In the following, we describe some field-agnostic optimizations.
### The polynomials only take 0/1 values on a hypercube
The first opportunity arises in the first round of sumcheck. In binary circuits, encoding a single wire requires only one bit per instance, while directly storing a field element requires 128 bits. Therefore, from both memory and bandwidth saving perspective, the first round of sumcheck can be optimized. It reads the circuit data in compressed form (e.g., one byte encoding 8 field elements) directly from memory and decompresses it into field elements on-chip for computation.
This idea can be extended to the first four layers of sumcheck, if the hardware can support efficient table lookups with 256 entries (or in general, the first $s$ layers, if efficient lookup in a $2^{2^{s-1} $ entry table is supported). Specifically, for the first four layers of sumcheck, the circuit data in compressed form is read directly from memory, and on-chip SRAM is used for lookup-based decompression into field elements. Assuming the random folding parameters for the first, second, and third layers are $r_1$, $r_2$, and $r_3$, respectively:
- **Layer 1**: For each decoded bit, a 2-entry lookup table is used to obtain the field value:
- `table[0] = 0`
- `table[1] = 1`
- **Layer 2**: For each decoded 2-bit value, a 4-entry lookup table is used to obtain the field value:
- `table[x, y] = (1 - r_1) * x + r_1 * y`
- **Layer 3**: For each decoded 4-bit value, a 16-entry lookup table is used to obtain the field value:
- `table[x, y, z, w] = (1 - r_2) * ((1 - r_1) * x + r_1 * y) + r_2 * ((1 - r_1) * z + r_1 * w)`
- **Layer 4**: For each decoded 8-bit value, a 256-entry lookup table is used to obtain the field value:
- `table[x, y, z, w, a, b, c, d] = (1 - r_3) * ((1 - r_2) * ((1 - r_1) * x + r_1 * y) + r_2 * ((1 - r_1) * z + r_1 * w)) + r_3 * ((1 - r_2) * ((1 - r_1) * a + r_1 * b) + r_2 * ((1 - r_1) * c + r_1 * d))`
This approach can significantly reduce memory reads/writes, intermediate storage requirements (which peak after the first SumCheck round), and the computational effort of the folding operations in the SumCheck dynamic programming calculations.
### Optimizing computational pipeline to reduce memory access
Next, we discuss detailed strategies for carefully organizing data reads/writes and computation sequences to minimize memory access.
Algorithmically, a complete sumcheck consists of two alternating phases: (Assuming the gate has degree $d$ and the each of the input polynomials has $k$ variables ($k=20$ in our case))
- **Tic**: The prover/verifier interaction of the $i$-th round of sumcheck, involving:
- Interpolate the input polynomial to the necessary degree. This results in `|"inputs"| × (d+1) × 2^{k+1-i}` outputs.
- Compute the gate values at the interpolated positions. This results in `|"outputs"| × (d+1) × 2^{k+1-i}` outputs.
- Sum up to obtain `|"outputs"| × (d+1)` results.
- Generate the random challenge for the $i$th round.
- **Toc**: Fold the inputs of the $i$-th round to produce the inputs for the $(i+1)$th round, which reduces the size by half.
First, within each phase of **Tic**, the operations along the dimension $2^{k+1-i}$ are completely element-wise parallel. These operations can be pipelined rather than storing large intermediate results in memory after each step, thereby reducing memory access costs.
Second, we can consider fusing **Tic-$(i+1)$** with **Toc-$i$** to save memory reads/writes. (Note that **Tic-$i$** cannot be fused with **Toc-$i$** because the latter requires the random challenge.)
## Evaluation Results on Cysic C1 Chip
The main body of the Cysic C1 chip consists of a 64-by-64 array of processing elements (PEs) with massive bandwidth to both on-chip SRAM and off-chip DRAM. The PEs are connected by two sets of interconnects:
1. A **mesh-topology network-on-chip (NoC)** optimized for multidimensional array transpositions with power-of-two dimensions.
2. A **tree-topology network** for broadcasting duplicated data to all PEs.
The control logic is designed for fully pipelined operation in regular dataflows, where data access patterns have simple analytical expressions, while maintaining the flexibility to handle irregular dataflows.
The PE array includes both general-purpose ALUs for standard 32/64-bit arithmetic and bitwise operations, as well as specialized ALUs tailored for cryptographic tasks such as hashing (Blake2, Poseidon, Keccak) and modular arithmetic (ranging from 32-bit to up to 384-bit width). ALUs are categorized based on their bandwidth requirements and computational complexity:
- **Light ALUs**: High bandwidth but low computational complexity (e.g., 64-bit integer operations excluding division and 64-bit field operations) are installed in all PEs.
- **Heavy ALUs**: High computational complexity (e.g., 384-bit field operations) are shared among PEs at a ratio of 4:1, 8:1, or even higher.
Outside the PE array, the C1 chip also features a high-performance CPU to execute arbitrary code, particularly for generating random challenges for Fiat-Shamir transformations. The chip communicates with the host CPU via a USB 3.0 interface. The host can be a server, a PE, or even a laptop, and multiple C1 chips can be utilized simultaneously by a single host.
If we use $s=4$ layers of lookup tables, with $k=20$, $p=11$, and $n=k+p=31$ (1M Hashes, or 2G wires being proved together), then the computation per layer of Keccak is (the computation cost is normalized to bytes to clearly show the relative cost of computation and memory access)
- Memory access: $1.5 \times 2^{31} \times 128 \, \text{bit} = 48 \, \text{GB},$
- Computation: $12 \times 2^{31} \times 128 \, \text{bit} = 384 \, \text{GB}.$
Putting everything together, the evaluation results are shown in the following table:
| | Cysic C1 Chp | 7950X3D |
| :---------------- | :------: | ----: |
|Performance | 1.31M | 15k |
| Node | TSMC 12nm | TSMC 5nm |
As the GPU implementation is not yet ready, so we do not report the GPU performance. As we mentioned in the intro, our design of the Keccak proving process on the GKR system can possibly improve the proving speed on other hardware platforms. Our C1 chip demonstrates strong performance over CPU and possible GPU implementations. Although Cysic C1 chip will be fabricated on a much older node than the CPU or GPU hardware, but Cysic C1 chip still outperform CPU by a lot. This strong performance is largely due to architectural advantage, high-speed memory architecture and ZK-specific ALU design.
## Conclusion
We are excited to show that Cysic C1 chip is capable to prove more than enough Keccak functions per second, which paves the way for Ethereum Verge to happen. Keccak function is an instance to demonstrate the power of Cysic C1 chip. We envision that the highly efficient Cysic C1 chip would expand the use cases of ZK technology.