Try   HackMD

On Benchmarking Zero Knowledge Proof Systems.

In the Zero Knowledge sphere, there exists a plethora of Zero Knowledge Systems and languages. From zk-SNARK languages like Zokrates, Leo, and Vamp-IR; to STARK machines like Triton VM, Miden VM, and RISC Zero. It is difficult in the current landscape to make an informed decision about which system fits their use case best. This benchmarking endeavor seeks to alleviate these issues by readily making performance numbers between the various systems be known and provide an easy way to compare the same program across languages and systems.

Currently, the amount of backends and programs that are compared is limited. However, community contributions should amplify the value provided by the benchmarking project.

In particular, the benchmarking results and programs will be most useful to five kinds of individuals:

  1. Users who are building an application with a Zero Knowledge System.
  2. Compiler writers who wish to target a Zero Knowledge System.
  3. Users who have a choice of the Zero Knowledge System in their application.
  4. Zero Knowledge System implementers who wish to know optimization points.
  5. Performance-oriented users who wish to see optimal code in their desired system.

Both Individuals 4. and 5. can use these results in a straightforward manner, seeing what in their programs and compilers to improve and as a comparison point if any results seem suspect.

Individuals 1. through 3. all have the system as a free variable and can wisely pick the language/system that is fit for their problem domain. It is important to note that the benchmarking results to these individuals is not the full story in making decisions, it is important to take the benchmark results as a data point in a more full analysis:

  • What is my speed versus expressivity trade-off?
  • What kind of tooling does my application need? Do I have the time to develop said tooling?
  • How much storage space per proof generated am I willing to spend?
  • Do I want to work within a particular proving system?

These are all equally important questions, and no singular system wins in every single category. For example, RISC zero may be slower than another STARK system like Miden, however, it offers your users the ability to use a language they are more familiar with (Rust, C, etc.) with little effort, while Miden has the advantage of being faster but requires writing tooling or using a language your users are unlikely to be familiar with.

With these caveats made apparent, we can now discuss the data collection methodology.

Methodology

Most Zero Knowledge Systems are implemented in Rust. This gives the benchmarking initiative an easy through line to implement a general framework.

In particular, we define the ZeroKnowledge trait within rust to act as our API across any Zero Knowledge System. Future work will be had to incorporate systems/languages which do not expose a Rust API into the benchmark.

pub trait ZeroKnowledge { type C; type R; fn name(&self) -> String; fn compile(&self) -> Self::C; fn prove(&self, setup: &Self::C) -> Self::R; fn verify(&self, receipt: Self::R, program: &Self::C) -> (); fn prove_and_verify(&self) -> () { let circuit = self.compile(); let receipt = self.prove(&circuit); self.verify(receipt, &circuit); } }

Thecompile, prove, and verify functions are responsible for compiling the circuit, running the prover and the verifier respectively, with prove_and_verify running all of the functions. The name function is there to name the circuit for data reporting reasons.

The prove and verify functions, in particular, are where most of the benchmarking interests lay. Since every Zero Knowledge Proof System offers some kind of prover and verifier, this interface is easy to implement for any such system.

For the benchmarking aspect in Rust, we have decided to go with the criterion micro benchmakrer. We use criterion to aggregate the average times of each interface functions (compile, prove, verify, and prove_and_verify). For the sake of speed, we only collect twenty samples each with the default benchmarking settings.

On Different Zero Knowledge Systems

As alluded to earlier, in analyzing the data, it is important to understand the capability of the machinery of each system. For example, we can break down the base machinery into a few broad categories:

  1. zk-SNARKs
  2. zk-STARKs
  3. Recursive zk-SNARKs

Even within these categories, great differences can be seen. For example within zk-SNARKs, there is a divide between the R1CS style of systems and more Plonkish style of systems. Typically Plonk style circuits are universal (due to the polynomials making up the circuit being of a fixed degree) while many R1CS style circuits are not universal (and thus require a setup for every new circuit).

With that said, this categorization is useful in the rough properties we can surmise. A good comparison table can be found on the awesome zero knowledge proof repository. Recursive ZK-SNARKS are not included in the table; for analysis purposes, one can lump them in with ZK-SNARKs. However, where the difference comes in is in the expressivity of each of the systems:

  1. zk-SNARKs: Limited Expressivity Power
  2. zk-STARKs: Turing Complete Systems
  3. Recursive zk-SNARKs: Turing Complete Systems

For non recursive ZK-SNARKS doing something basic like

if (2 = 3) then 3 + 5 else 5 + 8

requires executing both the then (3 + 5), and the else (5 + 8). This is in contrast to both ZK-STARKS and recursive ZK-SNAKRS which allow execution of only one branch [1]. Due to the difference in expressivity, this is why projects like:

Can all be turing complete.

Another point to consider when comparing data, is that many of these projects are on different levels of abstraction. For example, we may have benchmark results of RISC0 code and Miden code. The RISC0 code is generated by Rust, that is to say we have taken a Von Neuman programming language compiled down to the RISCV instruction set, which is then converted into an underlying ZK-STARK circuit. While the Miden code may have been written directly in Miden, meaning that the code just has to be translated from Miden's stack representation to the underlying ZK-STARK representation. Thus, the Miden code used for benchmarking will be closer to the underlying ZK-STARK representation and will likely have better performance characteristics.

Further, machines like Miden and Triton have static call graphs (MAST for Miden, Triton), which means that all the jumps/branches on the machines must be known statically. This results in machines which have various limitations when building on top of them. Due to the limitations and also trying to maintain multiple non-trivial programs, we have also provided minimal languages on top of the assembly languages. Compiler writers, audience 2., should be interested in these to see what kinds of langauges are easy to build on these machines.

Anoma's Stake

This article is published under Anoma's research publication, as such research is useful to the Anoma project. In fact, Anoma makes up all five of the interested individuals:

  1. Users who are building an application with a Zero Knowledge System.

Anoma, through the Taiga protocol, will offer an interface for evaluating validity predicates(Hereby referred to as VPs) in a private circuit.

  1. Compiler writers who wish to target a Zero Knowledge System.

Anoma as a project has many compiler projects that wish to compile to a zero knowledge circuit. From VAMP-IR which seeks to be the "LLVM of Zero Knowledge Systems", to GEB, which is a categorical language that seeks to formally verify compilation from languages such as lambda calculus to polynomial circuits [2].

  1. Users who have a choice of the Zero Knowledge System in their application.

Anoma is a heterogeneous protocol and thus allows developers to chose the Zero Knowledge System that works best for their use case.

  1. Zero Knowledge System implementers who wish to know optimization points.

Anoma has created Plonkup, which is a system that combines Plonk and Plookup.

  1. Performance-oriented users who wish to see optimal code in their desired system.

The aforementioned VP are run in a system that has gas costs, and so Anoma developers would wish to optimize their circuits.

The Current Data

The data currently collected is in the preliminary stage, so the following caveats should be had when looking at the data:

  1. An effort has not currently been made to measure startup time.
  2. Not all examples are representative of real-world use cases.
    • Many are designed around showing certain behaviors and edge cases of particular systems.
  3. Not all programs may be written in the most optimal way.

The data can be seen in an HTML rendered format: here.

All results are taken from a AMD Ryzen 7 5700X 8-Core @ 16x 3.4GHz CPU.

Sudoku Miden Plonk: 3 by 3 Risc Halo: 3 by 3
All Combined 321.75 ms (✅ 1.00x) 128.18 ms (🚀 2.51x faster) 1.39 s (❌ 4.32x slower) 353.85 ms (✅ 1.10x slower)
Compile 2.92 ms (✅ 1.00x) 60.78 ms (❌ 20.84x slower) 726.71 us (🚀 4.01x faster) 271.36 ms (❌ 93.06x slower)
Prove 313.32 ms (✅ 1.00x) 64.96 ms (🚀 4.82x faster) 1.37 s (❌ 4.36x slower) 84.19 ms (🚀 3.72x faster)
Verify 1.94 ms (✅ 1.00x) 4.89 ms (❌ 2.52x slower) 1.78 ms (✅ 1.09x faster) 3.21 ms (❌ 1.65x slower)
Fibonacci Miden: iter-93 Miden: fixed-92 Miden: fixed-50 Risc0: iter-93 Risc0: iter-50 Risc0: fixed-50 Risc0: fixed-92
All Combined 317.49 ms (✅ 1.00x) 151.65 ms (🚀 2.09x faster) 154.28 ms (🚀 2.06x faster) 336.80 ms (✅ 1.06x slower) 333.00 ms (✅ 1.05x slower) 332.51 ms (✅ 1.05x slower) 333.55 ms (✅ 1.05x slower)
Compile 64.50 us (✅ 1.00x) 53.27 us (✅ 1.21x faster) 41.98 us (✅ 1.54x faster) 113.36 us (❌ 1.76x slower) 111.91 us (❌ 1.74x slower) 114.57 us (❌ 1.78x slower) 114.30 us (❌ 1.77x slower)
Prove 316.77 ms (✅ 1.00x) 151.03 ms (🚀 2.10x faster) 150.90 ms (🚀 2.10x faster) 334.39 ms (✅ 1.06x slower) 332.21 ms (✅ 1.05x slower) 330.52 ms (✅ 1.04x slower) 332.26 ms (✅ 1.05x slower)
Verify 1.90 ms (✅ 1.00x) 1.88 ms (✅ 1.01x faster) 1.88 ms (✅ 1.01x faster) 1.67 ms (✅ 1.14x faster) 1.67 ms (✅ 1.14x faster) 1.65 ms (✅ 1.15x faster) 1.65 ms (✅ 1.15x faster)
Fibonacci Miden: iter-1000 Risc0: iter-1000
All Combined 2.76 s (✅ 1.00x) 2.74 s (✅ 1.00x faster)
Compile 63.84 us (✅ 1.00x) 108.18 us (❌ 1.69x slower)
Prove 2.72 s (✅ 1.00x) 2.70 s (✅ 1.01x faster)
Verify 2.11 ms (✅ 1.00x) 12.08 ms (❌ 5.73x slower)

Foot Notes


  1. This does not imply it is cheaper to only execute one branch like on a Von Neuman machine. For simpler branches, it is still more efficient to execute both branches. ↩︎

  2. GEB is much more general, as it seeks to be a categorical compiler pipeline from any language to another, allowing programmers to write code in more flexible ways while also allowing them to give up power for various mathematical properties. ↩︎