Benchmark Hash in SNARK

Use cases

Aggregation for hash-based signature
- Optimistic requirement: 500k hash/s ^[1]
STARKed binary hash tree
- Worst-case requirement: 100k hash/s ^[2]

Requirements

Post-quantum friendly
Provable soundness
Not-too-complex to write spec

Benchmark candidates

plonky3 - FRI + AIR
- Blake3 circuit, Keccak circuit and Poseidon2 circuit are implemented
stwo - FRI + AIR
- Blake2s circuit and Poseidon2 circuit are implemented
binius
- Keccak circuit and Vision circuit are implemented
hashcaster - GKR on binary field
- Keccak circuit is implemented but no PCS connected yet
expander - GKR
- Keccak circuit and Poseidon circuit are implemented

Benchmarks

Machine spec
- CPU: i9-13900K
- RAM: 64GB (2x Crucial DDR5 5600 32GB Dual Channel)
- OS: Linux 5.15.0-124-generic x86-64
- Env:
  - RAYON_NUM_THREADS: 4
Code: https://github.com/han0110/bench-hash-in-snark/tree/3398a8c
Output
- perm: Number of permutations proving (2^10..20)
- time: Average proving time of 10 samples
- throughput: Average throughput of 10 samples (= perm / time)
- proof_size: Average proof size of 10 samples
- peak_memory: Peak memory usage measured by /usr/bin/time the proving command (if the process is killed due to OOM, the row will be filled with -)

The implementation of all packages are one round of permutation with fixed initial state, if the actual input length is larger than block length of the hash function, it'd require more constraints on the implementation based on the construction, but the cost should be negligible compared to the permutation.

plonky3

FRI parameter:

Field size: 124 (degree-4 extension of 31-bits field)
Rate: ¹/₂
Number of query: 256
Poseidon2 parameter (notation following https://eprint.iacr.org/2023/323.pdf)
- $n = 31 t = 16 d = 3 R_{F} = 8 R_{P} = 20$
- $p = 2^{31} - 2^{24} + 1$ (KoalaBear)

The 124-bits field is too small to reach 128-bits provable soundness, we need roughly 192-bits field but currently there is no available option.
This makes the throughput below a bit optimistic.
The number of query also needs to be increased a bit to reach accurate 128-bits provable soundness, but won't be too far from current one.
This makes the proof_size below a bit optimistic.

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 ./bench.sh plonky3 blake3 $i
    RAYON_NUM_THREADS=4 ./bench.sh plonky3 keccak $i
    RAYON_NUM_THREADS=4 ./bench.sh plonky3 poseidon2 $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`blake3`	`2¹⁰`	`59.02 ms`	`17.35 K/s`	`9.85 MB`	`100.58 MB`
`blake3`	`2¹¹`	`120.67 ms`	`16.97 K/s`	`9.96 MB`	`170.68 MB`
`blake3`	`2¹²`	`242.09 ms`	`16.92 K/s`	`10.08 MB`	`316.95 MB`
`blake3`	`2¹³`	`483.01 ms`	`16.96 K/s`	`10.20 MB`	`605.91 MB`
`blake3`	`2¹⁴`	`982.55 ms`	`16.67 K/s`	`10.33 MB`	`1.16 GB`
`blake3`	`2¹⁵`	`2.06 s`	`15.88 K/s`	`10.47 MB`	`2.29 GB`
`blake3`	`2¹⁶`	`4.35 s`	`15.06 K/s`	`10.62 MB`	`4.54 GB`
`blake3`	`2¹⁷`	`10.11 s`	`12.96 K/s`	`10.77 MB`	`9.06 GB`
`blake3`	`2¹⁸`	`23.10 s`	`11.35 K/s`	`10.93 MB`	`18.09 GB`
`blake3`	`2¹⁹`	`49.49 s`	`10.59 K/s`	`11.10 MB`	`36.15 GB`
`blake3`	`2²⁰`	`-`	`-`	`-`	`-`

`keccak`	`2¹⁰`	`587.01 ms`	`2.33 K/s`	`3.89 MB`	`692.65 MB`
`keccak`	`2¹¹`	`1.20 s`	`2.27 K/s`	`4.04 MB`	`1.34 GB`
`keccak`	`2¹²`	`2.50 s`	`2.19 K/s`	`4.19 MB`	`2.66 GB`
`keccak`	`2¹³`	`5.06 s`	`2.16 K/s`	`4.35 MB`	`5.31 GB`
`keccak`	`2¹⁴`	`10.74 s`	`2.03 K/s`	`4.52 MB`	`10.61 GB`
`keccak`	`2¹⁵`	`23.03 s`	`1.90 K/s`	`4.70 MB`	`21.20 GB`
`keccak`	`2¹⁶`	`54.17 s`	`1.61 K/s`	`4.89 MB`	`42.39 GB`
`keccak`	`2¹⁷`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁸`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁹`	`-`	`-`	`-`	`-`
`keccak`	`2²⁰`	`-`	`-`	`-`	`-`

`poseidon2`	`2¹⁰`	`3.61 ms`	`283.96 K/s`	`1.68 MB`	`44.39 MB`
`poseidon2`	`2¹¹`	`4.50 ms`	`455.39 K/s`	`1.76 MB`	`44.80 MB`
`poseidon2`	`2¹²`	`6.04 ms`	`677.63 K/s`	`1.85 MB`	`45.20 MB`
`poseidon2`	`2¹³`	`9.86 ms`	`830.99 K/s`	`1.95 MB`	`44.31 MB`
`poseidon2`	`2¹⁴`	`16.98 ms`	`965.01 K/s`	`2.06 MB`	`44.86 MB`
`poseidon2`	`2¹⁵`	`34.77 ms`	`942.38 K/s`	`2.17 MB`	`53.09 MB`
`poseidon2`	`2¹⁶`	`78.66 ms`	`833.12 K/s`	`2.30 MB`	`96.97 MB`
`poseidon2`	`2¹⁷`	`161.45 ms`	`811.83 K/s`	`2.43 MB`	`184.11 MB`
`poseidon2`	`2¹⁸`	`327.44 ms`	`800.58 K/s`	`2.57 MB`	`359.19 MB`
`poseidon2`	`2¹⁹`	`647.94 ms`	`809.16 K/s`	`2.71 MB`	`708.80 MB`
`poseidon2`	`2²⁰`	`1.34 s`	`779.93 K/s`	`2.87 MB`	`1.38 GB`

Same FRI parameter but with rate = ¹/₄

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh plonky3 blake3 $i
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh plonky3 keccak $i
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh plonky3 poseidon2 $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`blake3`	`2¹⁰`	`95.24 ms`	`10.75 K/s`	`5.10 MB`	`167.62 MB`
`blake3`	`2¹¹`	`184.65 ms`	`11.09 K/s`	`5.16 MB`	`311.15 MB`
`blake3`	`2¹²`	`375.61 ms`	`10.90 K/s`	`5.22 MB`	`600.53 MB`
`blake3`	`2¹³`	`766.45 ms`	`10.69 K/s`	`5.29 MB`	`1.15 GB`
`blake3`	`2¹⁴`	`1.58 s`	`10.34 K/s`	`5.36 MB`	`2.28 GB`
`blake3`	`2¹⁵`	`3.30 s`	`9.94 K/s`	`5.43 MB`	`4.54 GB`
`blake3`	`2¹⁶`	`6.98 s`	`9.39 K/s`	`5.51 MB`	`9.06 GB`
`blake3`	`2¹⁷`	`16.11 s`	`8.14 K/s`	`5.59 MB`	`18.08 GB`
`blake3`	`2¹⁸`	`37.18 s`	`7.05 K/s`	`5.67 MB`	`36.14 GB`
`blake3`	`2¹⁹`	`-`	`-`	`-`	`-`
`blake3`	`2²⁰`	`-`	`-`	`-`	`-`

`keccak`	`2¹⁰`	`942.43 ms`	`1.45 K/s`	`2.04 MB`	`1.34 GB`
`keccak`	`2¹¹`	`1.93 s`	`1.42 K/s`	`2.12 MB`	`2.66 GB`
`keccak`	`2¹²`	`3.95 s`	`1.38 K/s`	`2.20 MB`	`5.30 GB`
`keccak`	`2¹³`	`8.07 s`	`1.35 K/s`	`2.28 MB`	`10.60 GB`
`keccak`	`2¹⁴`	`17.35 s`	`1.26 K/s`	`2.37 MB`	`21.18 GB`
`keccak`	`2¹⁵`	`37.05 s`	`1.18 K/s`	`2.46 MB`	`42.35 GB`
`keccak`	`2¹⁶`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁷`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁸`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁹`	`-`	`-`	`-`	`-`
`keccak`	`2²⁰`	`-`	`-`	`-`	`-`

`poseidon2`	`2¹⁰`	`4.03 ms`	`254.37 K/s`	`903.14 KB`	`44.68 MB`
`poseidon2`	`2¹¹`	`4.41 ms`	`464.28 K/s`	`950.17 KB`	`44.64 MB`
`poseidon2`	`2¹²`	`7.68 ms`	`533.45 K/s`	`0.98 MB`	`44.70 MB`
`poseidon2`	`2¹³`	`12.51 ms`	`654.95 K/s`	`1.03 MB`	`44.86 MB`
`poseidon2`	`2¹⁴`	`27.83 ms`	`588.72 K/s`	`1.09 MB`	`52.23 MB`
`poseidon2`	`2¹⁵`	`59.70 ms`	`548.83 K/s`	`1.15 MB`	`95.39 MB`
`poseidon2`	`2¹⁶`	`123.37 ms`	`531.22 K/s`	`1.22 MB`	`182.63 MB`
`poseidon2`	`2¹⁷`	`249.28 ms`	`525.81 K/s`	`1.29 MB`	`357.43 MB`
`poseidon2`	`2¹⁸`	`500.62 ms`	`523.64 K/s`	`1.36 MB`	`706.47 MB`
`poseidon2`	`2¹⁹`	`1.02 s`	`514.07 K/s`	`1.44 MB`	`1.37 GB`
`poseidon2`	`2²⁰`	`2.09 s`	`502.81 K/s`	`1.52 MB`	`2.74 GB`

stwo

FRI parameter:

Field size: 124 (degree-4 extension of 31-bits field)
Rate: ¹/₂
Number of query: 256
Poseidon2 parameter (notation following https://eprint.iacr.org/2023/323.pdf)
- $n = 31 t = 16 d = 5 R_{F} = 8 R_{P} = 14$
- $p = 2^{31} - 1$ (Mersenne31)

The 124-bits field is too small to reach 128-bits provable soundness, we need roughly 192-bits field but currently there is no available option.
This makes the throughput below a bit optimistic.
The number of query also needs to be increased a bit to reach accurate 128-bits provable soundness, but won't be too far from current one.
This makes the proof_size below a bit optimistic.

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 ./bench.sh stwo blake2s $i
    RAYON_NUM_THREADS=4 ./bench.sh stwo poseidon2 $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`blake2s`	`2¹⁰`	`807.93 ms`	`1.27 K/s`	`3.28 MB`	`772.36 MB`
`blake2s`	`2¹¹`	`898.52 ms`	`2.28 K/s`	`3.28 MB`	`853.77 MB`
`blake2s`	`2¹²`	`1.06 s`	`3.85 K/s`	`3.32 MB`	`1018.40 MB`
`blake2s`	`2¹³`	`1.43 s`	`5.73 K/s`	`3.30 MB`	`1.31 GB`
`blake2s`	`2¹⁴`	`1.90 s`	`8.62 K/s`	`3.43 MB`	`2.04 GB`
`blake2s`	`2¹⁵`	`3.10 s`	`10.58 K/s`	`3.57 MB`	`3.49 GB`
`blake2s`	`2¹⁶`	`5.35 s`	`12.24 K/s`	`3.67 MB`	`6.40 GB`
`blake2s`	`2¹⁷`	`10.02 s`	`13.08 K/s`	`3.79 MB`	`12.21 GB`
`blake2s`	`2¹⁸`	`19.57 s`	`13.40 K/s`	`3.96 MB`	`23.83 GB`
`blake2s`	`2¹⁹`	`-`	`-`	`-`	`-`
`blake2s`	`2²⁰`	`-`	`-`	`-`	`-`

`poseidon2`	`2¹⁰`	`7.28 ms`	`140.66 K/s`	`892.29 KB`	`36.17 MB`
`poseidon2`	`2¹¹`	`12.66 ms`	`161.80 K/s`	`1.15 MB`	`36.61 MB`
`poseidon2`	`2¹²`	`19.98 ms`	`204.98 K/s`	`1.25 MB`	`36.62 MB`
`poseidon2`	`2¹³`	`45.35 ms`	`180.63 K/s`	`1.40 MB`	`40.01 MB`
`poseidon2`	`2¹⁴`	`85.72 ms`	`191.14 K/s`	`1.53 MB`	`75.77 MB`
`poseidon2`	`2¹⁵`	`137.20 ms`	`238.84 K/s`	`1.63 MB`	`147.96 MB`
`poseidon2`	`2¹⁶`	`289.18 ms`	`226.63 K/s`	`1.70 MB`	`291.55 MB`
`poseidon2`	`2¹⁷`	`523.06 ms`	`250.59 K/s`	`1.79 MB`	`579.75 MB`
`poseidon2`	`2¹⁸`	`1.03 s`	`253.99 K/s`	`1.91 MB`	`1.13 GB`
`poseidon2`	`2¹⁹`	`1.94 s`	`269.86 K/s`	`2.00 MB`	`2.25 GB`
`poseidon2`	`2²⁰`	`2.44 s`	`429.44 K/s`	`2.12 MB`	`4.50 GB`

Same FRI parameter but with rate = ¹/₄

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh stwo blake2s $i
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh stwo poseidon2 $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`blake2s`	`2¹⁰`	`1.35 s`	`756.10 /s`	`1.81 MB`	`1.53 GB`
`blake2s`	`2¹¹`	`1.49 s`	`1.38 K/s`	`1.81 MB`	`1.67 GB`
`blake2s`	`2¹²`	`1.74 s`	`2.36 K/s`	`1.81 MB`	`1.93 GB`
`blake2s`	`2¹³`	`2.24 s`	`3.65 K/s`	`1.81 MB`	`2.46 GB`
`blake2s`	`2¹⁴`	`2.92 s`	`5.61 K/s`	`1.88 MB`	`3.82 GB`
`blake2s`	`2¹⁵`	`4.70 s`	`6.97 K/s`	`1.94 MB`	`6.67 GB`
`blake2s`	`2¹⁶`	`7.92 s`	`8.28 K/s`	`2.01 MB`	`12.37 GB`
`blake2s`	`2¹⁷`	`-`	`-`	`-`	`-`
`blake2s`	`2¹⁸`	`-`	`-`	`-`	`-`
`blake2s`	`2¹⁹`	`-`	`-`	`-`	`-`
`blake2s`	`2²⁰`	`-`	`-`	`-`	`-`

`poseidon2`	`2¹⁰`	`7.11 ms`	`144.09 K/s`	`721.87 KB`	`36.41 MB`
`poseidon2`	`2¹¹`	`11.88 ms`	`172.43 K/s`	`746.97 KB`	`36.50 MB`
`poseidon2`	`2¹²`	`21.57 ms`	`189.87 K/s`	`810.15 KB`	`36.62 MB`
`poseidon2`	`2¹³`	`32.85 ms`	`249.40 K/s`	`857.73 KB`	`36.67 MB`
`poseidon2`	`2¹⁴`	`48.68 ms`	`336.57 K/s`	`899.84 KB`	`63.93 MB`
`poseidon2`	`2¹⁵`	`126.33 ms`	`259.38 K/s`	`947.25 KB`	`123.99 MB`
`poseidon2`	`2¹⁶`	`216.96 ms`	`302.07 K/s`	`987.39 KB`	`244.27 MB`
`poseidon2`	`2¹⁷`	`398.75 ms`	`328.71 K/s`	`1.02 MB`	`484.21 MB`
`poseidon2`	`2¹⁸`	`819.53 ms`	`319.87 K/s`	`1.07 MB`	`964.48 MB`
`poseidon2`	`2¹⁹`	`1.64 s`	`319.43 K/s`	`1.15 MB`	`1.88 GB`
`poseidon2`	`2²⁰`	`2.30 s`	`455.03 K/s`	`1.20 MB`	`3.76 GB`

binius

FRI parameter:

Field size: 128
Rate: ¹/₂
Number of query: 241
Provable soundness: 100-bits

The 128-bits field is too small to reach 128-bits provable soundness, we need roughly 150-bits field but currently there is no available option.
This makes the throughput below a bit optimistic.

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 ./bench.sh binius keccak $i
    RAYON_NUM_THREADS=4 ./bench.sh binius vision $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`keccak`	`2¹⁰`	`372.85 ms`	`2.75 K/s`	`429.68 KB`	`153.11 MB`
`keccak`	`2¹¹`	`734.93 ms`	`2.79 K/s`	`453.10 KB`	`288.49 MB`
`keccak`	`2¹²`	`1.41 s`	`2.90 K/s`	`536.74 KB`	`551.79 MB`
`keccak`	`2¹³`	`2.86 s`	`2.86 K/s`	`561.60 KB`	`1.06 GB`
`keccak`	`2¹⁴`	`5.59 s`	`2.93 K/s`	`588.65 KB`	`2.11 GB`
`keccak`	`2¹⁵`	`11.12 s`	`2.95 K/s`	`619.60 KB`	`4.26 GB`
`keccak`	`2¹⁶`	`21.96 s`	`2.98 K/s`	`710.77 KB`	`8.50 GB`
`keccak`	`2¹⁷`	`43.68 s`	`3.00 K/s`	`745.79 KB`	`16.75 GB`
`keccak`	`2¹⁸`	`87.99 s`	`2.98 K/s`	`780.49 KB`	`34.11 GB`
`keccak`	`2¹⁹`	`-`	`-`	`-`	`-`
`keccak`	`2²⁰`	`-`	`-`	`-`	`-`

`vision`	`2¹⁰`	`571.78 ms`	`1.79 K/s`	`847.87 KB`	`155.05 MB`
`vision`	`2¹¹`	`886.24 ms`	`2.31 K/s`	`867.41 KB`	`235.49 MB`
`vision`	`2¹²`	`1.47 s`	`2.78 K/s`	`890.87 KB`	`391.64 MB`
`vision`	`2¹³`	`2.37 s`	`3.46 K/s`	`974.54 KB`	`705.71 MB`
`vision`	`2¹⁴`	`4.18 s`	`3.92 K/s`	`999.43 KB`	`1.28 GB`
`vision`	`2¹⁵`	`8.04 s`	`4.08 K/s`	`1.00 MB`	`2.44 GB`
`vision`	`2¹⁶`	`15.19 s`	`4.31 K/s`	`1.03 MB`	`4.77 GB`
`vision`	`2¹⁷`	`29.34 s`	`4.47 K/s`	`1.12 MB`	`9.54 GB`
`vision`	`2¹⁸`	`57.11 s`	`4.59 K/s`	`1.16 MB`	`19.07 GB`
`vision`	`2¹⁹`	`113.15 s`	`4.63 K/s`	`1.19 MB`	`37.87 GB`
`vision`	`2²⁰`	`-`	`-`	`-`	`-`

Same FRI parameter but with rate = ¹/₄

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh binius keccak $i
    RAYON_NUM_THREADS=4 PCS_LOG_INV_RATE=2 ./bench.sh binius vision $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`keccak`	`2¹⁰`	`440.88 ms`	`2.32 K/s`	`316.43 KB`	`173.86 MB`
`keccak`	`2¹¹`	`827.01 ms`	`2.48 K/s`	`331.88 KB`	`330.25 MB`
`keccak`	`2¹²`	`1.62 s`	`2.53 K/s`	`384.24 KB`	`633.39 MB`
`keccak`	`2¹³`	`3.27 s`	`2.51 K/s`	`402.57 KB`	`1.22 GB`
`keccak`	`2¹⁴`	`6.46 s`	`2.54 K/s`	`421.90 KB`	`2.42 GB`
`keccak`	`2¹⁵`	`13.06 s`	`2.51 K/s`	`441.98 KB`	`4.82 GB`
`keccak`	`2¹⁶`	`25.85 s`	`2.54 K/s`	`498.96 KB`	`9.81 GB`
`keccak`	`2¹⁷`	`51.52 s`	`2.54 K/s`	`521.92 KB`	`19.70 GB`
`keccak`	`2¹⁸`	`103.61 s`	`2.53 K/s`	`545.87 KB`	`38.87 GB`
`keccak`	`2¹⁹`	`-`	`-`	`-`	`-`
`keccak`	`2²⁰`	`-`	`-`	`-`	`-`

`vision`	`2¹⁰`	`625.80 ms`	`1.64 K/s`	`739.43 KB`	`164.54 MB`
`vision`	`2¹¹`	`951.49 ms`	`2.15 K/s`	`754.16 KB`	`255.32 MB`
`vision`	`2¹²`	`1.52 s`	`2.69 K/s`	`769.65 KB`	`432.86 MB`
`vision`	`2¹³`	`2.57 s`	`3.19 K/s`	`822.04 KB`	`779.41 MB`
`vision`	`2¹⁴`	`4.57 s`	`3.58 K/s`	`840.40 KB`	`1.42 GB`
`vision`	`2¹⁵`	`8.79 s`	`3.73 K/s`	`859.76 KB`	`2.76 GB`
`vision`	`2¹⁶`	`16.91 s`	`3.88 K/s`	`879.87 KB`	`5.42 GB`
`vision`	`2¹⁷`	`33.11 s`	`3.96 K/s`	`936.88 KB`	`10.92 GB`
`vision`	`2¹⁸`	`65.25 s`	`4.02 K/s`	`959.87 KB`	`21.57 GB`
`vision`	`2¹⁹`	`127.21 s`	`4.12 K/s`	`983.85 KB`	`43.49 GB`
`vision`	`2²⁰`	`-`	`-`	`-`	`-`

hashcaster

Since hashcaster only implements the PIOP part, the benchmark below uses Binius commitment scheme for it for completeness.

FRI parameter:

Field size: 128
Rate: ¹/₂
Number of query: 241
Provable soundness: 100-bits

The 128-bits field is too small to reach 128-bits provable soundness, we need roughly 150-bits field but currently there is no available option.
This makes the throughput below a bit optimistic.
The iota step in keccak-f is not implemented yet, tho it should be negligible because it could be deferred and merged in the linear layer.
This makes the throughput below a bit optimistic.

To reproduce, get into the code repository and run:

for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 ./bench.sh hashcaster keccak $i
done;
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`keccak`	`2¹⁰`	`84.53 ms`	`18.17 K/s`	`522.81 KB`	`47.09 MB`
`keccak`	`2¹¹`	`126.91 ms`	`24.21 K/s`	`540.91 KB`	`46.80 MB`
`keccak`	`2¹²`	`202.20 ms`	`30.39 K/s`	`622.99 KB`	`86.95 MB`
`keccak`	`2¹³`	`338.92 ms`	`36.26 K/s`	`642.52 KB`	`168.63 MB`
`keccak`	`2¹⁴`	`676.60 ms`	`36.32 K/s`	`664.24 KB`	`333.86 MB`
`keccak`	`2¹⁵`	`1.35 s`	`36.44 K/s`	`689.87 KB`	`662.56 MB`
`keccak`	`2¹⁶`	`2.87 s`	`34.22 K/s`	`779.48 KB`	`1.31 GB`
`keccak`	`2¹⁷`	`6.15 s`	`31.99 K/s`	`806.55 KB`	`2.59 GB`
`keccak`	`2¹⁸`	`12.87 s`	`30.54 K/s`	`835.80 KB`	`5.12 GB`
`keccak`	`2¹⁹`	`25.81 s`	`30.47 K/s`	`868.95 KB`	`10.20 GB`
`keccak`	`2²⁰`	`52.34 s`	`30.05 K/s`	`966.10 KB`	`20.39 GB`

expander

Poseidon parameter (notation following https://eprint.iacr.org/2023/323.pdf)
- $n = 31 t = 16 d = 5 R_{F} = 8 R_{P} = 14$
- $p = 2^{31} - 1$ (Mersenne31)

PCS is not finished yet so the commitment is just raw witness.
This makes the proof_size below much bigger than expected.
The 93-bits field used in GKR is too small to reach 128-bits provable soundness.
This makes the throughput below a bit optimistic.

To reproduce, get into the code repository and run:

cd expander/circuit/
go run gf2_keccak.go
go run m31_poseidon.go
cd ../../
for i in $(seq 10 20); do
    RAYON_NUM_THREADS=4 ./bench.sh expander keccak $i
    RAYON_NUM_THREADS=4 ./bench.sh expander poseidon $i
done
python3 render_table.py

`hash`	`perm`	`time`	`throughput`	`proof_size`	`peak_mem`
`keccak`	`2¹⁰`	`481.14 ms`	`2.13 K/s`	`518.82 KB`	`3.93 GB`
`keccak`	`2¹¹`	`1.10 s`	`1.87 K/s`	`788.70 KB`	`7.86 GB`
`keccak`	`2¹²`	`2.32 s`	`1.76 K/s`	`1.28 MB`	`15.70 GB`
`keccak`	`2¹³`	`4.97 s`	`1.65 K/s`	`2.30 MB`	`31.34 GB`
`keccak`	`2¹⁴`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁵`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁶`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁷`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁸`	`-`	`-`	`-`	`-`
`keccak`	`2¹⁹`	`-`	`-`	`-`	`-`
`keccak`	`2²⁰`	`-`	`-`	`-`	`-`

`poseidon`	`2¹⁰`	`2.40 ms`	`426.93 K/s`	`222.76 KB`	`61.77 MB`
`poseidon`	`2¹¹`	`4.99 ms`	`410.02 K/s`	`357.09 KB`	`62.34 MB`
`poseidon`	`2¹²`	`9.68 ms`	`423.27 K/s`	`619.41 KB`	`119.96 MB`
`poseidon`	`2¹³`	`18.74 ms`	`437.14 K/s`	`1.11 MB`	`235.33 MB`
`poseidon`	`2¹⁴`	`37.95 ms`	`431.75 K/s`	`2.12 MB`	`465.13 MB`
`poseidon`	`2¹⁵`	`78.41 ms`	`417.88 K/s`	`4.12 MB`	`926.41 MB`
`poseidon`	`2¹⁶`	`180.71 ms`	`362.65 K/s`	`8.13 MB`	`1.81 GB`
`poseidon`	`2¹⁷`	`449.54 ms`	`291.57 K/s`	`16.14 MB`	`3.59 GB`
`poseidon`	`2¹⁸`	`1.09 s`	`241.31 K/s`	`32.14 MB`	`7.18 GB`
`poseidon`	`2¹⁹`	`2.33 s`	`225.40 K/s`	`64.15 MB`	`14.36 GB`
`poseidon`	`2²⁰`	`4.59 s`	`228.30 K/s`	`128.15 MB`	`28.74 GB`

Possible further optimizations

hashcaster
- Use binius PCS on polyval field directly

8k signatures * 250 hashes per signature by 4 seconds in optimistic scenario. ↩︎
https://vitalik.eth.limo/general/2024/10/23/futures4.html#starked-binary-hash-trees
https://ethresear.ch/t/proposal-delay-stateroot-reference-to-increase-throughput-and-reduce-latency/20490 ↩︎

asn-d6

2024/12/27 12:12:36

Number of query

Thanks for this great document! Just curious, in all the proof systems here, the number of queries is independent of the size of the statement? Like, for plonky3 we do 256 queries, regardless of whether we are proving 2^10 poseidons, or 2^20 poseidons?

Han

2024/12/30 02:52:50

FRI-Binius has its own analysis and it's independent of the size of the statement https://eprint.iacr.org/2024/504. (Edited)

2024/12/30 11:32:56

Thanks for the responses! WRT Binius, it seems like the 2024/504 paper you linked has an improved scheme (construction 6.1) compared to the one in the benchmarks. See top of page 35: it shows a scheme with 142-144 queries, and rate = 1/4. According to table 2, it seems like smaller proof and smaller verifier, but larger prover costs. BTW, I assume that Binius is not created to compete with the other proof systems on Poseidon; and it's mainly good for logic-gate hash functions like Keccak, right?

2024/12/31 04:37:06

As far as I understand current Binius already implemented 2024/504, when I set the rate = 1/4 and security_bits = 96, it also gives 142 number of queries as the paper says, and indeed the proving time is worse than the one with rate = 1/2. I'll also try to add binius benchmark with different rate later. > I assume that Binius is not created to compete with the other proof systems on Poseidon; and it's mainly good for logic-gate hash functions like Keccak, right? Yes exactly.

2024/12/27 12:15:52

577.77

Throughput is actually a useful stat. Good idea for including it. I'm actually surprised to see it flactuate like this. I would expect it to increase, as the size of the statement increases, because of amortizations. However, it actually drops between 2^14 poseidons and 2^20 poseidons in the case of plonky3. I was expecting this would be because of memory usage, but actually memory usage of plonk3 is quite low. Memory usage of plonk3 is actually surprising low. It's so low, we should probably employ some kind of space-time-tradeoff to increase throughput by using more RAM.

2024/12/30 03:16:07

On the memory usage, it seems proving poseidon could just be so low because we don't waste any bit in the finite field, compared to proving the traditional hash function. As for the space-time-tradeoff, the current bottleneck is the LDE computation and merklization, we could focus on exploiting them if there is any chance.

2024/12/30 11:45:44

Interesting. Do you have an intuition on the runtime % of LDE computation and merklization compared to the rest of the proof system steps for plonky3 with 2^20 poseidons?

2024/12/31 04:57:11

Here is the trace plonky3 generates https://gist.github.com/han0110/240e711ee6c1d0a26245a1c898c56b10. The main components are: - commit witness (LDE + merklization): 68% - compute+commit quotient: 12% - open: 17%

2024/12/27 12:17:58

much bigger than expected.

Yep. I also see the throughput continuously dropping, and the memory doubling. I guess these are all artifacts of the lack of PCS? Do we make use of the fact that we are performing the same computation over-and-over in GKR? GKR should be able to take advantage of this.

2024/12/30 03:34:22

> I guess these are all artifacts of the lack of PCS? hmm I don't think so, as far as I understand, at least the memory usage should be linear with the size of the statement, not sure why the throughput continously drops. > Do we make use of the fact that we are performing the same computation over-and-over in GKR? I think expander already implements the data-parallel feature, but not sure if I use it correctly or not, will double check with them. (Edited)

2024/12/27 12:22:58

2.12 MB

I'm a bit sad by how big the STARKs ended up being across all proof systems. I was expecting them somewhere in the order of 100kb to 500kb, but we are very far from that. This goes back to my question about number of queries being static. Is the proof size increase just because the depth of the tree changes (and hence the merkle proofs become larger)? Or we do more queries for bigger statements? Even swapping FRI with STIR would only provide a 35% improvement or so, according to Figure 2 of the STIR paper. Looking at the data, it seems really hard to go below 500kb, regardless of the number of hash invocations.

2024/12/30 03:41:58

> Is the proof size increase just because the depth of the tree changes (and hence the merkle proofs become larger)? Yes I think so, I use the same number of queries for all size of statement. > I was expecting them somewhere in the order of 100kb to 500kb, but we are very far from that. To make the proof size smaller, a simple approach is to choose smaller rate, to reach 100kb to 500kb we might need to choose rate = 1/16 to 1/4, and this makes the throughput much worse. Let me try to run the benchmark again with smaller rate to see how bad it is, perhasp it not that bad.

2025/01/01 08:19:57

Another observation: When higher folding arity and shared merkle tree top layers opening, we can cut the proof-size to 1/2~1/3, Binius already does this so its proof size looks better, plonky3 can also adopt this trick (iirc plonky2 already has this).