Hardware-friendliness of HyperPlonk, Part2

This article is a follow-up to our earlier post on the hardware-friendliness of HyperPlonk, expanding on claims made about the Sumcheck protocol. In that post, we argued that the performance bottleneck in Sumcheck stems not from raw computation, but from memory access patterns.

This is a critical distinction: ideally, cryptographic protocols like Sumcheck should be compute-bound rather than memory-bound, since modern hardware—particularly GPUs and specialized accelerators—is optimized for high-throughput arithmetic but often struggles when constrained by memory bandwidth.

This memory-bound behavior is especially evident in the use of Number Theoretic Transforms (NTTs), a fundamental building block for polynomial operations in Sumcheck. Although NTTs are highly parallelizable, they require frequent and structured access to large arrays of field elements, placing significant pressure on memory subsystems and cache hierarchies. As a result, even when the underlying arithmetic is relatively inexpensive, memory access overhead can dominate overall performance.

Addressing this limitation calls for optimized memory layouts, hardware-aware scheduling strategies, and rethinking systems design to push hardware toward its computational limits, not its memory ceilings.

This follow-up aims to empirically validate our earlier claim by profiling the Sumcheck implementations within ICICLE. Through this analysis, we offer concrete evidence that memory access is indeed the dominant bottleneck, and provide insights that can inform future optimization and hardware acceleration strategies for Sumcheck and similar protocols.

Background

The Sumcheck protocol is typically used to prove a claim about the sum of a polynomial evaluated over a multi-dimensional domain (often the boolean hypercube {0,1}^d). In other words, the prover aims to convince the verifier that the sum of a polynomial P(x₁,x₂,…,x_d), evaluated over all possible values of the variables x1,x2,…,xd, equals a specific value. More formally, the claim can be described as:

\sum_{x_{1}, x_{2}, . . ., x_{d} \in F} P (x_{1}, x_{2}, . . ., x_{d}) = C

The Sumcheck protocol is an iterative process with two primary calculations at each iteration (round):

accumulating the round polynomial, and
decreasing the number of elements by a factor of two by "folding" the polynomial.

The verifier generates the challenges which are used to fold the polynomial.

Plonk, a widely used SNARK in the industry, relies heavily on NTTs for its computations. HyperPlonk, a variant of Plonk, aims to improve parallelization by replacing NTTs with a Sumcheck, among other things.

However, our previous blog post challenges this assumption. We argue that the supposed advantages of Sumcheck’s parallelizability in HyperPlonk may be overstated. The core issue is that the primary bottleneck in Sumcheck stems from memory access, not computation.

Sumcheck in ICICLE

Our Sumcheck implementations on CPU and GPUs (both CUDA and Metal), introduced in ICICLE V3.5, use the Fiat-Shamir heuristic to transform the protocol into a non-interactive form, eliminating the need for round-based communication.

Beyond supporting a fixed set of operations, these implementations allow user-defined products of multilinear polynomials through a high-level abstraction called the Program class. This class enables developers to define custom element-wise lambda functions over vectors, which are then compiled into optimized backend code for the target hardware. ICICLE V3.5 shipped with built-in programs such as

A (X) * B (X) - C (X)

and

e q (X) * (A (X) * B (X) - C (X))

, while still allowing full flexibility to specify more complex constraints. The system is designed with hardware-agnostic optimization in mind, taking advantage of multi-threading on CPUs and parallelism on GPUs to minimize memory overhead and maximize arithmetic throughput. These improvements are part of ICICLE’s broader effort to make cryptographic primitives both portable and performant across architectures.

The CUDA implementation structures the computation rounds slightly differently. In each round, it computes both the folding from the previous round and the round polynomial evaluations for the current one, allowing us to eliminate unnecessary memory accesses. Notably, in the first round, only the round polynomial is evaluated—no folding is performed. This design results in the most memory-efficient approach we’re aware of, performing only the memory operations strictly required for folding.

The Fiat-Shamir challenges and round polynomial accumulations are computed directly on the device, eliminating the need to transfer data back to the host after each round and then send the results back to the device. This avoids costly round-trip data movement and improves overall throughput.

We will use this implementation to examine the claim that Sumcheck is memory-bound, and not compute-bound.

Limiting factor of Sumcheck

We used NVIDIA's Nsight Compute to profile ICICLE’s CUDA implementation and identify performance bottlenecks, running the analysis on an NVIDIA L4 GPU. The profiler provides detailed metrics for each CUDA kernel, including runtime, memory bandwidth, and register usage. To determine the protocol’s limiting factor, we focus on the Speed of Light (SoL) throughput metrics for both memory and compute. The SoL metric indicates how close a given sub-unit is operating to its theoretical maximum throughput, helping us assess whether the bottleneck is due to memory access or computation.

For the purpose of this check we used 4 polynomials of size 2²⁵ (the range of sizes utilized by HyperPlonk) with the combine function EQ(AB-C) (R1CS zerocheck).

The chart below illustrates the compute and memory SoL throughput for the primary CUDA kernel in ICICLE’s Sumcheck implementation. The data covers the first 13 rounds of the protocol, providing insight into performance trends over the course of execution.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The data reveals important insights into the relationship between memory and compute throughput during the initial rounds of execution. Notably, memory SoL throughput operates near peak capacity in these early stages, while compute SoL throughput remains relatively low in comparison.

A program is considered memory-bound when limitations in memory throughput prevent it from fully utilizing available compute resources. As the chart indicates, while memory operates near peak capacity, compute resources remain underutilized—clearly pointing to a memory bottleneck.

Additionally, as the rounds progress, a key pattern emerges: while memory throughput gradually declines, compute throughput remains steady—and even shows a slight increase. The observed doubling of compute throughput between the first and second rounds stems from the structure of the protocol, as previously explained. Notably, compute throughput only begins to decrease once memory throughput drops below 60%, reinforcing the conclusion that memory remains the primary limiting factor. The chart below explains these findings.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The L1 hit rate increases as memory throughput decreases.

As the rounds progress, the polynomials shrink in size and begin to fit within the cache, resulting in higher cache hit rates. This reduces memory throughput, as more data can be accessed directly from the cache rather than from slower main memory. Consequently, compute throughput increases due to the faster data access enabled by caching.

The increase in compute utilization that occurs with faster memory access is a strong indication that the program is memory bound.

The evidence clearly points to memory access as the primary bottleneck in the system. This is supported by the consistently high memory SoL throughput alongside low compute utilization, as well as the observed behavior of compute throughput and L1 cache hit rates: as memory utilization decreases across rounds, compute performance improves—further reinforcing the memory-bound nature of the workload.

In other words, the system’s overall performance is limited by memory access speed rather than computational throughput.

Conclusions

Empirical evidence suggests that the Sumcheck protocol is inherently memory-bound—its performance is limited more by memory access speeds than by computational power. As a result, optimization efforts should focus on improving memory efficiency rather than purely enhancing computation.

Focusing on memory optimization can yield significant performance gains in Sumcheck implementations—even on hardware with limited computational resources.