Hypercube IR: The Foundation of Zero-Knowledge Proof System

# Hypercube IR: The Foundation of Zero-Knowledge Proof System In this article, we introduce our ongoing work on zero-knowledge proof (ZKP)-specific hardware. At Cysic, we are developing a custom proof acceleration chip designed specifically for ZKP workloads. This chip leverages a non-traditional spatial architecture to maximize parallelism and boost throughput for polynomial computations. Furthermore, multiple chips can be interconnected at high speed to create a scalable multi-core system, significantly enhancing performance and memory capacity to support larger and more complex proof tasks. ## Introduction ZK is a cryptographic concept that allows one party (the prover) to prove to another party (the verifier) that a statement is true without revealing any additional information beyond the truth of the statement itself. This principle is fundamental in privacy-preserving technologies and secure authentication. In the past decade, we have witnessed significant developments in both theoretical and practical aspects of ZKP technology. One of the major obstacles in developing high-performance ZKP proving systems is the lack of portability of high-performance code. This lack of portability manifests in several ways: - It prevents the rapid adoption of new designs from the cryptography community due to the high cost of rewriting a high-performance backend. - It prevents existing proving systems from quickly adapting to different hardware, such as CPUs, GPUs (including Nvidia’s CUDA, Apple’s Metal, and AMD's ROCm), and more importantly, the upcoming ZKP-specific acceleration chips. - It makes it difficult for ZKP hardware accelerator manufacturers to promote their products, as hardware companies need to bear the huge costs of adapting each proving system’s software backend. Ideally, the ZKP community should have a public domain-specific intermediate representation (IR) akin to TorchScript in the AI community to serve as a bridge between proving system development teams and high-performance hardware teams. Here we introduce the HyperCube IR, a minimalistic format tailored for ZKP as well as fully homomorphic encryption (FHE). ## Design Rationale The workload in the ZKP process cannot be fully represented by combinations of a limited set of basic kernels (excluding special cases like NTT). Therefore, ZKP requires a programmable IR rather than an operator library primarily built on enumeration (such as linear algebra-based BLAS libraries). For example, certain steps in Plonkish proof systems depend entirely on their constraint polynomials, which are further influenced by the set of enabled custom gate types. Another example is in proof systems using SumCheck, where the polynomials being summed often involve customizable multivariate low-degree polynomials with multiple multi-linear polynomial inputs. Even for polynomial commitment schemes (PCS) based on Merkle trees using fixed hash functions, implementations vary in the hashing order of data across different projects. These examples highlight the critical importance of programmability in ZKP-specific IRs. On the other hand, general-purpose processor IRs like LLVM or PTX, while highly flexible, contain excessive low-level details tailored to CPU or a single GPU. This not only increases the complexity of upstream programs generating IRs but also creates challenges for other types of hardware (e.g., those targeting ZKP-specific chips or parallel computing systems), often yielding counterproductive results. For instance: - Native modular operations: Some hardware natively supports modular arithmetic as a primitive data type. However, LLVM/PTX might decompose modular operations into sequences of 32/64-bit integer operations. Downstreaming hardware compilers encounters the difficult task of reverse-engineering these sequences to recognize them as high-level modular operations (e.g., identifying a 256-bit modular multiplication hidden in 100 instructions), which is extremely challenging. - Parallel architectural optimization: Many hardware platforms employ large-scale hierarchical parallelism and ZKP/FHE-optimized cache/interconnect architectures. This applies across multiple levels: between cores on a chip, chips on a card, cards in a machine, and even across machines. Hardware designers typically optimize their platforms by leveraging regularity in ZKP computation patterns (e.g., NTT or SumCheck) to improve the usage of the data cache and interconnect topologie. However, if LLVM/PTX representations are used in ZKP computation, such regularity becomes obscured by pointer arithmetic details, rendering these optimizations ineffective. Thus, a clear conclusion is that ZKP IRs must explicitly expose regularity in computational data access patterns, as this represents cross-platform, performance-critical information. This can be achieved by constraining data access patterns to a minimal set of well-defined primitives, on the condition that the IR maintains sufficient flexibility to meet programmability requirements. The gains from these trade-offs are significant. *The more you want, the more you have to assume*. These constraints unlock optimization opportunities across software and hardware: - **Hardware Neutrality**: The IR resides at the intersection of computational models that existing heterogeneous hardware can be efficiently implemented. - **Compiler Optimization**: Compilers can fully automate data layout management, optimizing for hardware and computational context while enabling kernel fusion opportunities. - **Unlocking Efficient Hardware Architectures**: Hardware can adopt spatially efficient multi-core parallelism, decoupled access-execute (DAE) memory architectures, and statically scheduled network communication using optimization solvers. ## Computational Representation The primitive operations required in ZKP proof processes can be divided into two categories: - Category I: Computations with regular data access patterns and massive parallelism; - Category II: Highly flexible, irregular computations that are relatively small in total workload. The purpose of establishing such a classification lies in improving the cost-effectiveness of designing high-performance computing systems: If we can demonstrate that the vast majority of computation time is dominated by a small set of regular patterns, then we can focus on optimizing these patterns while relegating the remaining edge-case computations to existing general-purpose systems as a fallback solution. From this perspective, whether designing customizable chips or constructing highly parallel systems using off-the-shelf hardware, identifying such a classification is essential. Below, we first introduce a taxonomy of primitives required for ZKP computations, then illustrate their usage and coverage completeness for ZKP functionalities through examples, described in a more mathematical rather than code-centric manner. When we identify these primitives, we prioritize minimizing the number of primitive types to cover most ZKP operations. If an operation can be decomposed into combinations of simpler primitives, it will not be classified as a primitive itself. While real-world hardware implementations often benefit from merging multiple primitives into a single step for efficiency (commonly termed kernel fusion), this is considered an internal compiler optimization detail. The primitive abstraction interface should remain conceptually simple. In the following, we list the primitives in both category I and II. For ease of reading, we use different colors to represent different primitives. ### Category I #### Pointwise Operations on Multidimensional Arrays This primitive can be described mathematically as follows: $$\left(O_1(x_1,..., x_N),..., O_U(x_1,..., x_N)\right) = f\left(p_1(x_1,..., x_N),..., p_S(x_1,..., x_N)\right),$$ where $f: \mathbb{F}_q^S \rightarrow \mathbb{F}_q^U$ is a function with $S$ inputs and $U$ outputs, $(p_1,..., p_S)$ is a collection of $N$-dimensional array, and $(O_1,..., O_U)$ is a collection of $N$-dimensional array. The function $f$ here can be - A set of $S$-arity polynomials over $\mathbb{F}_q$, typically derived from -- constraint polynomials; -- NTT butterfly operations; -- SumCheck on complex polynomials. - Cryptographic hash functions, such as -- Binary-based hash functions (e.g., Keccak/SHA), where $\mathbb{F}_q$ elements are cast to binary data; -- Native $\mathbb{F}_q$ hash functions (e.g., Poseidon family), primarily used in Merkle tree construction. An example that uses this primitive is in the NTT butterfly operations, where $(U'_1(x_1,..., x_N), V'_1(x_1,..., x_N))$ is computed as $$\left(U_1(x_1,..., x_N) + V_1(x_1,..., x_N), T(x_1,..., x_N)(U_1(x_1,..., x_N) - V_1(x_1,..., x_N)) \right).$$ #### Regular Data Movement: Hypercube Transformations This primitive can be described as $$O(x_1,..., x_N) = p(expr_1,..., expr_M),$$ where $expr_i$ can be constant 0 or 1, $x_1$ (direct indexing) or $1 - x_i$ (complement indexing). An example using this primitive is in the first layer of NTT computation where we need to take a all even/odd elements from the input as seperate arrays, as $$ U_1(x_1,..., x_{N-1}) = p_0(x_1,..., x_{N-1}, 0) $$ #### Regular Data Movement: 1-Dimensional Cyclic Shifts The next primitive can be described as the following equation: $$O(x_1,..., x_N; y_1,..., y_M) = p(x_1,..., x_N; y_{i_1},..., y_{i_M}),$$ where $(y_{i_1},..., y_{i_M}) = \left((y_1,..., y_M) + \text{offset}\right) \bmod 2^M$, and $\text{offset}$ is the integer specifying distance in cyclic shift. Here the input $p$ is a $\mathbb{F}_q$ array of $2^{M + N}$ elements and output $O$ is a $\mathbb{F}_{q}$ array of $2^{M + N}$ elements. This primitive is used in the permutation constraint in Plonk, as we illustrate below. #### Array Concatenation The concatenation primitive is described as $$O(x_1,..., x_N; y_1,..., y_M) = p_{\text{binary}(y_1,..., y_M)}(x_1,..., x_N),$$ where $\text{binary}(y_1,...,y_M)$ converts a boolean array $(y_1,..., y)M$ into a boolean string $y_1y_2\cdots y_M$. Here the input $p_0,..., p_{2^M - 1}$ are $2^m$ arrays of size $2^N$ to be concatenated, and output $O$ is a $\mathbb{F}_q$ array of $2^{M+N}$ elements. This concatenation primitive is used in the fist layers of NTT operation. #### Irregular Data Movement: Grouped Gather The group gather primitive can be expressed as $$O(x_1,..., x_L; y_1,..., y_M; z_1,..., z_{N'}) = p(x_1,.., x_L; y_1,..., y_M; z'_1,..., z'_N),$$ where it holds $\text{binary}(z'_1,..., z'_{N'}) = \text{IndexTable}(\text{binary}(y_1,..., y_M; z_1,..., z_N))$. The function $\text{IndexTable}$ maps table of size $2^{M+N}$ with value range from $0$ to $2^{N'} - 1$. The input array $p \in \mathbb{F}_q^{}$ is the source array to be lookup from, while the output $O \in \mathbb{F}_q^{L + M+N}$ is the destination array. ### Category II #### Flexible Low-Volume Computations There is only one primitive in Category II to handle irregular computations. The primitive is described as $$(O_1,..., O_U) = f(p_1,..., p_S),$$ where $f$ is a CPU-executable function, like hash function for Fiat-Shamit transforms or Merkle tree path verification in FRI-IOP. A typical example using this primitive is the Merkle tree construction in FRI, $$(\text{NewState, Challenges}) := \text{FiatShamir}(\text{OldState}, L_{N, A}, L_{N, B}, L_{N, C}, L_{N, D}).$$ ## Running Examples In this part, we show that the ZKP operations can be constructed using the primitives defined above. ### NTT DIF Operation IN NTT DIF operation, the input $p_0$ is an array of $2^N$ elements in $\mathbb{F}_q$ and the output $O$ is an array of $2^M$ elements in $\mathbb{F}_q$, plus a parameter table called the twiddle factors. Each layer of NTT computation corresponds to a butterfly operation. During this process, the $2^N$-sized array is split into two $2^{𝑁−1}$-sized arrays, followed by butterfly operations. The butterfly operation itself can be expressed as element-wise operations, while Hypercube Transformations combined with Array Concatenation provide the necessary data layout conversions missing in these operations. The first round of NTT operation is described as the following four equations: $$U_1(x_1,..., x_{N-1}) = p_0(x_1,...,x_{N - 1}, 0), V_1(x_1,..., x_{N-1}) = p_0(x_1,...,x_{N - 1}, 1).$$ The first two equations can be expressed using Hypercube Transformations. The third equation, expressed using Pointwise Operations on Multidimensional Arrays, is that $U'_1(x_1,..., x_{N-1} = (U_1(x_1,..., x_{N-1}) + V_1(x_1,..., x_{N-1})$ and $$V'_1(x_1,..., x_{N - 1}) = T(x_1,..., x_{N - 1})(U_1(x_1,..., x_{N-1}) - V_1(x_1,..., x_{N-1})).$$ The last equation is expressed using Array Concatenation as $$p_1(x_1,..., x_N) = [U'_1, V'_1]_{x_N}(x_1,..., x_{N-1}).$$ It is worth mentioning that while the above illustrated flow seems not optimal from a hardware execution perspective (especially for GPU programmers familiar): for example, it creates many temporary intermediate arrays solely for data layout adjustments instead of rewriting the index expression in subsequent operations. But those temporarily are not difficult to be eliminated afterward by the platform-specific compiler when suitable. ### Merkle Tree Merkle tree computation is used in polynomial commitment schemes (PCS), where it takes $(p_1,..., p_{4+8S})$ elements in $\mathbb{F}_q^{2^N}$ and outputs Merkle root along with sampled path via Fiat-Shamir transformation. In Plonky2, it targets matrix-shaped witness data (with dimensions $2^k$ rows and $M$ columns) by first hashing each row into a fixed-size digest ($4\times 64$-bit field elements), then generating a conventional Merkle tree from the remaining single column of digests. First, it hashes the the 12 columns, using Poseidon hash function $H$ $$(A_1(x_1,..., x_N), B_1(x_1,..., x_N), C_1(x_1,..., x_N), D_1(x_1,..., x_N)) = H(p_1(x_1,,,., x_N),...,p_{12}(x_1,..., x_N)),$$ where this step can be carried out by Pointwise Operations on Multidimensional Arrays. Secondly, we can hash the next 8 columns to obtain $(A_2(x_1,..., x_N), B_2(x_1,..., x_N), C_2(x_1,..., x_N), D_2(x_1,..., x_N))$ as $$H(A_1(x_1,,,., x_N),...,D_1(x_1,,,., x_N), p_{13}(x_1,..., x_N), P_{20}(x_1,..., x_N)),$$ using Pointwise Operations on Multidimensional Arrays. The hash computation continues using Pointwise Operations on Multidimensional Arrays, until for the last 8 columns, as $(A_S(x_1,..., x_N), B_S(x_1,..., x_N), C_S(x_1,..., x_N), D_S(x_1,..., x_N))$ is computed $$H(A_{S-1}(x_1,,,., x_N),...,D_{S-1}(x_1,,,., x_N), p_{5+8(S-1)}(x_1,..., x_N), p_{12+8(S-1)}(x_1,..., x_N)).$$ Then using Hypercube Transformations, we can fold the length by half as $$L_{1, A, 0}(x_2,..., x_N) = A_S(0, x_2,..., x_N), L_{1, A, 1}(x_2,..., x_N) = A_S(1, x_2,..., x_N),$$ and $L_{1, B, 0}, L_{1, B, 1}, L_{1, C, 0}, L_{1, C, 1}, L_{1, D, 0}, L_{1, D, 1}$ can be computed similarly. These values are then hashed to obtain $(L_{1, A}, L_{1, B}, L_{1, C}, L_{1, D})$ as $$H(L_{1, A, 0}, L_{1, A, 1}, L_{1, B, 0}, L_{1, B, 1}, L_{1, C, 0}, L_{1, C, 1}, L_{1, D, 0}, L_{1, D, 1})$$ The hash computation continues, until we compute $(L_{N, A}, L_{N, B}, L_{N, C}, L_{N, D})$ as the Merkle root. At last, Flexible Low-Volume Computations is used to get the sample Merkle path as $$(\text{NewState}, \text{Challenges}):= \text{FiatShamir}(\text{OldState}, L_{N, A}, L_{N, B}, L_{N, C}, L_{N, D}).$$ ### SumCheck (Dense Case) Sumcheck is commonly used in the GKR and HyperPlonk proof systems. For GRK proofs in Data Parallel Circuits or HyperPlonk that use Plonkish-style approaches for irregular circuits, most computations can be represented as dense operations. Although the full SumCheck process is unsuitable for diagrammatic representation due to its large number of rounds, its computational building blocks can be intuitively categorized as follows: - Weighted folding: First, we use Hypercube Transformations to compute $$U(x_1,..., x_{N-1}) := p_0(x_1,..., x_{N-1}, 0), V(x_1,..., x_{N-1}) := p_0(x_1,..., x_{N-1}, 1).$$ Then Pointwise Operations on Multidimensional Arrays Arrays is used to get $$O(x_1,..., x_{N-1}) := \alpha\times U(x_1,..., x_{N-1}) + \beta \times V(x_1,..., x_{N-1}).$$ - Interpolation of multivariate polynomial values: First, we use Hypercube Transformations to compute $$U(x_1,..., x_{N-1}) := p_0(x_1,..., x_{N-1}, 0), V(x_1,..., x_{N-1}) := p_0(x_1,..., x_{N-1}, 1).$$ Then Pointwise Operations on Multidimensional Arrays is used to get $$(O_1(x_1,..., x_{N-1}),..., O_d(x_1,..., x_{N-1}) := f(U(x_1,..., x_{N-1}), V(x_1,..., x_{N-1})).$$ - Index reordering: This is used in Hyperplonk to compute $$p(1， x_1,..., x_N) - p(x_1,..., x_N, 0) \times p(x_1,..., x_N, 1).$$ The above formula can be implemented by creating three intermediates, using Hypercube Transformations $$p' := p(1， x_1,..., x_N), p'':= p(x_1,..., x_N, 0), p''' := p(x_1,..., x_N, 1).$$ ### Sparse Computation Some proof systems use sparse polynomials to represent irregular circuits. These systems skip zero-element computations during proofs via Grouped Gather that aggregate non-zero elements into dense arrays, enabling subsequent dense-style computations. For instance, in GKR’s SumCheck phase, weighted folding of a sparse polynomial $p$ is $$O(x_1,..., x_{N-1}) := \alpha\times p(x_1,..., x_{N-1}, 0) + \beta \times p(x_1,..., x_{N-1}, 1)$$ is computed on a dense array of size $2^{\lceil \log_2(O) \rceil}$. Let $R = 2^{\lceil \log_2(O) \rceil}$, we have $$[O](x_1,..., x_R) := \alpha\times A(x_1,..., x_R) + \beta \times B(x_1,..., x_R), $$ where $[O]$, $A$ and $B$ are dense arrays. $[O]$ denotes the dense representation of sparse polynomial $O$. $A$ and $B$ are auxiliary arrays obtained via Garther Operation from dense representation $[p]$ of $p$, i.e. $$A(x_1,..., x_R) := [p](y_1,..., y_K), $$ where the tuple $(y_1, ..., y_K) \in \{0,1\}^K$ is defined as *the (only) value* one such that $\text{binary}(y_1,..., y_K) = \text{IndexTable}_1(\text{binary})(x_1,..., x_R)$ and similar computation can be done for polynomial $B$ with $\text{IndexTable}_2$. Here $\text{IndexTable}_1$ and $\text{IndexTable}_2$ are pre-computed $2^R$-element arrays mapping sparse indices to dense positions. ### Permutation Constraint (Plonk) In almost every Plonkish proof system, it is required to prove that two arrays contain identical element sets, which is typically transformed into a product-equality proof of array elements. This process often involves cyclic access patterns: - Prefix-Product Computation: Given array $[a_0, a_1,\ldots,a_N]$, we compute (exclusive) prefix-product as $[1, a_0, a_0 a_1, \ldots, \Pi_{i<k} a_i \ldots]$. In other words, we compute $O$ from $p$ such that: $$ O(x_1, x_2, ..., x_N) := \cases{ 1, & if $(x_1,\ldots, x_N) = (0, \ldots, 0)$,\\ p(x_1', x_2', \ldots, x_N') \times O(x_1', x_2', \ldots, x_N'),& otherwise } $$ where $(x'_1,..., x'_N)\in \{0,1\}^N$ is defined as the only value such that $\text{binary}(x'_1,..., x'_N) = \text{binary}(x_1,..., x_N) - 1$. The array $O$ can be computed recursively using logarithmically many rounds of its size: - Step 1: Compute $$\begin{aligned} U(x_2, ..., x_N) &:= p(0, x_2,...,x_N)\\ V(x_2, ..., x_N) &:= p(1, x_2,...,x_N)\\ T(x_2, ..., x_N) &:= U(x_2, ..., x_N) \times V(x_2, ..., x_N) \end{aligned} $$ - Step 2: Recursively compute prefix product $C$ for $T$ with halved sized; we obtained $C$ with values $$ C(x_2, ..., x_N) := \cases{ 1, &if $(x_2, ..., x_N) = (0,...,0)$\\ T(x_2', ..., x_N') \times C(x_2', ..., x_N'), & otherwise } $$ where $(x'_2,..., x'_N) \in \{0,1\}^N$ is the only value that $\text{binary}(x'_2,..., x'_N) = (\text{binary}(x_2,..., x_N) - 1)$. - Step 3: Finish up with:$$\begin{aligned} U'(x_2, ..., x_N) &:= 1 \times C(x_2, ..., x_N)\\ V'(x_2, ..., x_N) &:= U(x_2, ..., x_N) \times C(x_2, ..., x_N)\\ O(x_1,...,x_N) &:= [U',V']_{x_1}(x_2, ..., x_N) \end{aligned} $$ - Data Rotation: For univariate polynomial expressions like $g(x) - h(x)g(wc\cdot x)$, where $x$ iterates over set $\Omega = \{\omega^k | k \in \mathbb{N}\}$, we implement cyclic access via first-order rotation when storing $g$ and $h$ in multi-dimensional arrays: $$U(x_1,..., x_N) := g(x'_1,..., x'_N), $$ where $(x'_1,..., x'_N) \in \{0,1\}^N$ is the only value that $\text{binary}(x'_1,..., x'_N) = (\text{binary}(x_1,..., x_N) + 1) \bmod 2^N$.