### **Technical Report: Performance Benchmark of bn256 Implementations in the SP1 zkVM**
This report details a performance comparison of different Rust libraries for `bn256` elliptic curve operations within the Succinct SP1 Zero-Knowledge Virtual Machine (zkVM). The experiment benchmarks the performance of three distinct cryptographic libraries (`substrate_bn`, `crypto-bigint`, and `arkworks`) and quantifies the impact of specialized SP1 precompiles on execution efficiency.
**Date:** August 8, 2025
**Platform:** SP1 zkVM
**Hardware:** Apple MacBook Pro (M3 Max)
### **Objective**
The primary goal of this expanded experiment is to benchmark the performance difference between:
* The standard `substrate_bn` crate.
* A modified `substrate_bn` that uses the `U256` type from `crypto-bigint`.
* The `ark_bn254` crate from the `arkworks` ecosystem.
This comparison is conducted with and without SP1's specialized precompiles to measure their impact. The key performance metric is the total cycle count required to execute the guest program.
### **Methodology**
A test program simulating a tripartite Diffie-Hellman key exchange was used to create a consistent workload involving common elliptic curve operations like scalar multiplication and pairings.
#### **Guest Program Logic**
The core logic, executed inside the zkVM, performs the following steps in a loop:
1. Generates three private keys ($sk\_A, sk\_B, sk\_C$).
2. Calculates the corresponding public keys in groups $G\_1$ and $G\_2$ via scalar multiplication (e.g., $pk\_{A1} = G\_1 \\cdot sk\_A$).
3. Computes a shared secret using a combination of pairing operations and exponentiation in the target group $G\_T$. For example, Alice computes her shared secret as $ss\_A = e(pk\_{B1}, pk\_{C2})^{sk\_A}$.
4. Asserts that all three computed shared secrets are equal, verifying the correctness of the cryptography.
```rust
// General Guest Program for bn256 Key Exchange
pub fn general_guest_program() {
let rands = init_rands_bn_batched();
for rand in rands {
// Generate private keys
let alice_sk = rand;
let bob_sk = rand + Fr::one();
let carol_sk = bob_sk + Fr::one();
// Generate public keys in G1 and G2
let (alice_pk1, alice_pk2) = (G1::one() * alice_sk, G2::one() * alice_sk);
let (bob_pk1, bob_pk2) = (G1::one() * bob_sk, G2::one() * bob_sk);
let (carol_pk1, carol_pk2) = (G1::one() * carol_sk, G2::one() * carol_sk);
// Each party computes the shared secret
let alice_ss = pairing(bob_pk1, carol_pk2).pow(alice_sk);
let bob_ss = pairing(carol_pk1, alice_pk2).pow(bob_sk);
let carol_ss = pairing(alice_pk1, bob_pk2).pow(carol_sk);
assert!(alice_ss == bob_ss && bob_ss == carol_ss);
}
}
```
#### **Experimental Configurations**
Five distinct guest programs were benchmarked:
1. **bn-pairing**: Uses the standard `substrate_bn` crate without precompile optimizations.
2. **bigint-pairing**: Uses a modified `substrate_bn` crate that relies on `crypto-bigint` for $U256$ operations, also without precompiles.
3. **ark-pairing**: Uses the `ark_bn254` crate from `arkworks` without precompile optimizations.
4. **bn-pairing-patched**: Identical to `bn-pairing` but enables SP1's "fat" `bn` precompile, which accelerates the full suite of `bn256` operations.
5. **bigint-pairing-patched**: Identical to `bigint-pairing` but enables a generic `bigint` precompile to accelerate low-level $U256$ arithmetic.
6. **ark-pairing-patched**: This is Identical to the `ark-pairing` guest program, but the bigint operation swapped with `sp1::mul_mod` precompiles where is can be applied, this should cut do the execution cycle count.
### **Results**
The execution cycle counts and times for each configuration are summarized below. **Lower cycle counts indicate higher performance.**
| Configuration | Base Crate | Precompile Enabled | Cycle Count | Execution Time |
| :--- | :--- | :--- | :--- | :--- |
| `bn-pairing` | `substrate_bn` | No | 1,105,498,339 | 26.8 s |
| `bigint-pairing` | `crypto-bigint` | No | 1,523,558,068 | 26.0 s |
| `ark-pairing` | `ark_bn254` | No | **428,207,591** | **7.96 s** |
| `bigint-pairing-patched` | `crypto-bigint` | Yes (`bigint`) | 518,877,400 | 11.8 s |
| `ark-pairing-patched` | `ark_bn254` | Yes (`ff-bigint`) | **422,122,898** | **2.16 s** |
| `bn-pairing-patched` | `substrate_bn` | Yes (`bn`) | **40,014,404** | **2.16 s** |
### **Analysis & Discussion**
The results provide clear insights into the performance characteristics of the libraries and the effectiveness of different optimization strategies within the zkVM.
#### **Performance Without Precompiles**
In the standard RISC-V execution environment, the choice of library has a profound impact on performance:
* The **`ark-pairing`** program (428M cycles) was the clear winner, proving to be approximately **2.6x faster** than `bn-pairing` (1.1B cycles) and **3.5x faster** than `bigint-pairing` (1.5B cycles). This indicates that the `arkworks` implementation of `bn254` is highly optimized for this workload, even without zkVM-specific accelerations.
* The original `substrate_bn` implementation was \~38% faster than the version modified to use `crypto-bigint`, suggesting its native integer arithmetic is more efficient for this use case.
#### **The Power of Precompiles: Specialization vs. Generalization**
The impact of precompiles is dramatic and highlights a key tradeoff between specialized and generalized acceleration.
* **"Fat" `bn` Precompile:** Enabling the `bn` precompile (`bn-pairing-patched`) reduced the cycle count from 1.1B to just **40 million**, a staggering **\~27.6x performance improvement**. This demonstrates the immense value of a high-level, "fat" precompile that accelerates not just integer math, but the entire elliptic curve group and field logic.
* **Generic `bigint` Precompile:** The generic `bigint` precompile (`bigint-pairing-patched`) also provided a benefit, cutting cycles from 1.5B to 519M—a **\~2.9x improvement**. However, its impact is limited because it only accelerates the fundamental $U256$ arithmetic, leaving the complex and expensive curve-specific logic to be executed as regular, less efficient instructions.
#### **Key Insight: Optimized Library vs. Generic Precompile**
The most surprising and crucial finding of this expanded research is the comparison between `ark-pairing` and `bigint-pairing-patched`.
* The `ark-pairing` program, with **no precompiles** (428M cycles), was **\~1.2x faster** than the `bigint-pairing-patched` program (519M cycles), which **used a precompile**.
This result strongly suggests that a well-optimized, general-purpose cryptographic library (`arkworks`) can be more performant than a less-optimized library that is only aided by generic, low-level precompiles. High-level algorithmic optimizations within the library itself proved more valuable than just accelerating the underlying integer math.
### **Conclusion**
This benchmark analysis leads to two primary conclusions for developers building ZK applications on SP1:
1. **Specialized Precompiles are Supreme:** The most effective path to performance is using specialized, high-level precompiles. The SP1 `bn` precompile delivered an order-of-magnitude performance gain that is unmatched by any other method. For applications involving `bn256`, using this precompile is essential for efficiency.
2. **Library Choice is Critical:** In the absence of a specialized precompile, the choice of the underlying cryptographic library is paramount. A highly optimized, general-purpose library like **`arkworks`** can provide significant performance benefits, even outperforming other libraries that have been augmented with generic, low-level arithmetic precompiles.
Therefore, the recommended strategy for developers is to **prioritize specialized, high-level precompiles** whenever available. If such a precompile does not exist for a required cryptographic primitive, selecting a modern, highly-optimized library is the next best step for achieving performant and practical ZK programs.
_Report structured by an LLM (Gemini)... :)_