Discussion notes #8.2: A Benchmarking Framework for (Zero-Knowledge) Proof Systems

![](https://i.imgur.com/WeIvTiX.png =150x) **Home Edition** # Discussion notes #8.2: A Benchmarking Framework for (Zero-Knowledge) Proof Systems Presenter: Daniel Benarroch and Justin Thaler Authors: - Daniel Benarroch - Aurelien Nicolas - Justin Thaler - Eran Tromer To be presented on 2020-05-14. Resources: * [Latest PDF version](https://docs.zkproof.org/pages/standards/accepted-workshop3/proposal-benchmarking.pdf) * [Miro Whiteboard](https://zkproof.org/workshop3-board) * [SoK Working Group](https://community.zkproof.org/g/WG_BENCHMARK) * [Additional related links](https://hackmd.io/@HtwXZr-PTFCniCs7fWFSmQ/B1AwbdI_8) ---- ## Real-time notes _Note taker: Mary Maller_. > Others are welcome to augment/annotate using notes. Add your name. ---MyName > PRESENTATION Problem statement: Would like an apples to apples comparison method for benchmarking ZKPs Difficulties: There are many different tradeoffs whose importance might depend on the application. May depend on modes of operations e.g. amortisation techniques. May depend on the type or size of the statement Different optimisation efforts Different security levels and assumptions Goals: Carefully enumerate challenges and subtleties Articulate best practices for benchmarking and reporting Discuss tooling to aid benchmarking Put forward a concrete framework for benchmarking a proof system. Concerns with goals: Avoid stifling innovation that put artificial constraints on designers Avoid biases e.g. inadvertently favouring some approaches over others Quantitive Costs: Proof size, proof space, verifier time, verifier space, proof size, public key size, rounds of interaction, security level Qualitative Costs: Hardness assumptions, setup assumptions, zero-knowledge, ease of verifying correctness, parallelization and acceleration Four categories of challenges: Indentifying functionalities, Designer flexibility, Measurement and accounting, Underlying environment Identifying functionalities: Many statements of interest Large tradeoff space Data universe structure We aim to choose a representative set of functionalities Designer flexibility: e.g. hash function pre-images: does the protocol designer get to choose the hash function? Example benchmarking functionality: Set-Membership Branching - we can break down the functionalities used to build a set membership proof Enabling flexibility of choice: Give a high level functionality; full design flexibility, optimise statement for proof machinery Intermediate level functionality; optimise front end (encoding method) and proof machinary for given instantiation Low level; ? Bottom level; ? Important considerations: Overfitting: hard to prevent people from optimising for specific benchmark functionality, danger of inflating importance of bottom level Should designers be able to submit functionalities? Cryptographic functionalities often of interest as subroutines in larger protocols Set membership benchmark may not assume any structure If the structure is application dependent what should we do? Membership and accounting: Literature is rife with ambiguity regarding what counts towards the costs. Underlying environment: All hardware favours some operations over others Critical to have portable and modifiable implementations Machines with massive memory are more forgiving to memory hungry back-ends Benchmarking parallelization is subtle Recommend at least one runtime benchmark should be done on a single thread. Measurement method: Simply run a command and parameters Standard functionalities Self report metrics because implementers know better Simple print in a standard format Process and publication: Curate a list of test commands Accumulate measurements into a global results file Aggregate results into tables and graphs publish regularly on github ---- DISCUSSION Question 1: Kostas Chalkias - Could we have a better categorisation of functionalities in the community reference document? Maybe with real examples? Answer 1: This proposal will update the community reference document. Which functionalities to include is a delicate question, especially as we want to minimise number of functionalities. The multiple levels might help The community could decide to be less ambitious than having a fully representative set of functionalities. Question 2: How to you prove that the benchmark has been run correctly? Answer 2: Need to have the same hardware. Put out implementations that are easy to port and run and modify. Question 3: How representative is single-threaded implementations. Answer 3: Comparing single-threaded implementations has less subtleties. We could include a well designed multithreaded comparison technique as well. Question 4: Should we build a selection of benchmarks for primitives that are used to build zero-knowledge proofs? Answer 4: Extra work would need to be done to refactor the properties of the benchmarks. Question 5: Are there specific things we haven't yet discussed? Answer 5: How can we encourage participation? Question 6: Does anyone have concerns that a small set of functionalities could produce negative effects? Answer 6: Protocol designers will ignore the benchmarks if they don't fit their specific goals. There are advantages to focusing on a few important problems. Question 7: Can we predict which benchmarks will be important 5 years from now. Answer 7: No. Question 8: Should functionalities be high level or should they focus on e.g. elliptic curve operations. Answer 8: Might become non-representative if we do that. Multiexps could work though. Could focus on functionalities that are most used in todays circuits. Comments: Daira: We should definitely include how to measure multithreaded performance (and encourage that to be included) Daniel: Recursion is interesting because it can be considered both as a standalone and as a tool in larger applications. Would be good to find a way to compare properly. Daira: Recursion doesn't restrict you for other functionalities. Eran: External properites. 2 approaches. Treat them orthogonally or consider the properties part of a higher level functionality. Latter approach seems more flexible and lets us focus on useful cases. Daira: Pitfull - not all batching is equivalent. Might be batching the same type of statement or difference types of statements. Daniel: Batching all together or one after the other is a further difference. Antoine: Counting the number of R1CS constraints or the number of multi-exps or the size of ffts should give us meaningful information. Eran: Those that care about the bottom line of the zkps do not care about ingredients. Researchers care a lot about the ingredients as it lets them know where to focues on improvements. Bottom line for practitioners seems more useful. However, if people are interested in other metrics that could definitely be added to the scope. Daira: Code could log time of different steps in the full protocol. Justin: Complication in counting group exponentiations - cost of group exponentiation is highly dependent on which group is being used. Antoine: We're not even using the same programming languages. This will make a huge difference to performance. Daniel: Maybe right now if we could start with one functionality, this won't be fully representative, but it's a place to start. Daniel: In the highest level for set membership, one could use accumulate and prove. Eran: In the top level functionality we must give the designers some flexibility so say how they implement it. Daira: Multi-level approach is absolutely necessary to avoid overfitting. ## Discussion topics _Suggestions welcome! Please append at the end, and the moderators will incorporate into the schedule._ ~15 minutes each, by default. - How do we NOT stifle innovation? - Is such a framework identifying a small number of “representative” functionalities desirable? - At least a standard way of presenting the benchmark in question / best-practices / guidelines to reduce biases - Functionalities and design flexibility - How do we prevent (or encourage) designers from (not) tailoring solutions to benchmark functionalities - Are any levels more important that others, and do they represent the full spectrum of design choices? - What are the functionalities we should choose? - The structure of data is highly application-dependent, what should we do?