Verkle Trees - Proof creation/verification notes

Update (2023-11-28): this document was created ~2023-05 – all the ideas are still relevant. I've also worked on a parallelized version in the Go library that you might also find interesting here.

This document contains notes about implementation optimizations for Verkle Tree proof generation and verification.

Some of these optimizations are present in the spec implementation, while others aren't since the spec code is meant to optimize to be understandable. We list them here and provide links to the reference implementation and other documents with explanations (if available).

Proof generation

Preprocessing
- Collecting polynomials in eval form involves doing toFr at each relevant internal node to prove, which means doing inversions.
  - Optimization: If you're collecting polynomials of relevant internal nodes recursively, at least batch inverses per node (i.e.: between 1 and 256 Point->Fr transformations per internal node). Do the same on leaf nodes.
  - Idea: Note there's no dependency between levels so that you can batch everything in a single batch.
  - Idea: The EL pipeline might already serialized the tree. Try caching the already calculating Frs and avoid redoing work altogether. (Might consume extra memory).
Multiproof
- Fiat-Shamir: We start by adding all Cs to the transcript. This requires serializing the points. Probably, you store them in projective form in memory, normalize them (i.e: to affine) in batch mode (i.e: Montgometry trick for
  $Z$ coordinate). [go-ipa]
- Aggregate all polynomials by evaluation domain before starting with any heavy work. This avoids repeated heavy work when you're aggregating many openings (which is usually the case). [go-ipa]
- When calculating
  $g (x)$ , we have to do polynomial division on the domain
  - Rewrite
    $q_{m}$ in terms of
    $q_{j}$ . [spec code, explanation]
  - Removing field inversions in
    $q_{j}$ [spec code, explanation]
  - Leverage precomputed terms when calculating
    $\frac{A^{'} (x_{m})}{A^{'} (x_{i})}$ [spec code, explanation].
- When calculating
  $g_{1} (x)$ , do batch inversion for
  $\frac{1}{t - z_{i}}$ terms.
Inner product argument
- When evaluating outside of the domain step
  $f (z) = A (z) \sum_{i = 0}^{d - 1} \frac{f_{i}}{A^{'} (x_{i}) (z - x_{i})}$ :
  - Remember that we already have precomputed
    $\frac{1}{A^{'} (x)}$ , use it directly. [spec code]
  - We have many
    $\frac{1}{z - x_{i}}$ , batch those inverse calculations. [spec code]

Proof verification

Proof deserialization
- Make sure you use Tonelli-Shanks with precomputed tables [not in spec code, original impl in Gottfried repo, port in go-ipa with gnark, explanation, perf impact results]
Multiproof
- To calculate
  $g_{2} (t)$ (and
  $E$ ):
  - You'll need
    $\frac{r^{i}}{t - z_{i}}$ terms, batch the denominator inverses [go-ipa]
  - As mentioned for the prover, remember to aggregate polynomials by evaluation point first which will save some work down the road. [go-ipa]
- Note that calculating E is a MSM. Consider using a MSM algorithm, and not the naive sum of scalar multiplications [go-ipa].
Inner product argument
- When evaluating outside the domain step, use the same optimizations mentioned for the prover.
- Only compute basis changes at the end [not in spec code, EIP code, Dankrad's blog post, my handwritten "explanation" of folding trick, go-ipa]
- Do batch inversion for the needed challenges for the basis changes. [go-ipa]

Impact

As an example of what kind of speedup you can get from an implementation that can be missing some of these benchmarks, here's some results on go-ipa:

As an example, for 16.000 random polynomial openings:

For 16000 polynomials:
        Proving:
                Proof serialization duration: (0.11ms -> 0.11ms)
                Proof creation duration: (664ms -> 271ms [2.45x speedup])
                ...
        Verification:
                Proof deserialization duration: 0.11ms
                Total duration: (1166ms -> 38ms [30x speedup]
                ...

Legend: (BEFOREms -> AFTERms, [SPEEDUPx])

The optimized implementation still is single-threaded-ish, and doesn't require extra memory (e.g: precomputed tables).

Closing

Do you know any other missing trick? Ping me!

Verkle Trees - Proof creation/verification notes

Proof generation

Proof verification

Impact

Closing

Read more

Berlin - Stateless Ethereum

Ethproof Killers Test Suite – Overview & Format Guide

EIP-4762 execution witness worst-case

Execution-Spec-Tests TODOs (Stateless Tests)