Try   HackMD

Nova wishlist and next steps

Things we want to improve in Nova and Nova-based ecosystem.

Context: These are issues that came out of work at ZK Vietnam Residency, specifically towards a ZK VM design https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q and seeing feasibility of a different architecture vs e.g. current Halo2-based ZKEVM CE.

Audience: (i) ourselves when we keep working on it (ii) upstream people to understand pain points (iii) Zuzalu hackathon in mid-April (iv) Wider community.

Nova (code)

Upstream repo: https://github.com/microsoft/nova

PSE fork: https://github.com/privacy-scaling-explorations/nova

Parallel Nova: Make it work

  • First of all, we should definitely solve all of the bugs that the current proposal has.
    • The final fold verification never succeeds.
    • The algorithm to digest the folds in parallel is really basic. And provides no security guarantees that the standalone
      F
      instances are either forwarded to a top level folding aggregation.
    • We have no guarantees that the folding proposal for the parallel case is theoretically correct. The paper doesn't provide any guidelines neither. So would be nice to double check with the authors as well as trying to test it exhaustively.
  • Figure out why parallel effort doesn't seem to improve performance a lot (Around 35% speed improvement).

Note that the first clear thing is that now, for each step (

F) we're performing 3 folds instead of 1. This is to do all the consistency checks required by the parallelization. But, at the same time, slows down the parallel solution significantly.

We should explore a bit more the performance of the

F processing and identify if there are any steps that can be done better.

Also, it's important to highlight that the folds should be big. As big as possible as the advantage of Nova is mostly comming from not needing to compute FFTs. Hence, small

F instances don't benefit much of it.

  • The parallel solution currently uses a lot of RAM in order to be able to work. That could be caused by the extreme parallelization we're trying to apply with rayon.
    We should profile the memory to actually know where we're allocating more and see if there's any way to improve it.
    The current implementation required us to store the current level of the tree to perform the folding for the next level that could be the main reason for memory overhead.

The idea for the future is that instead of havign multiple threads computing

F in parallel, we have different machines on a network which serve as parallel provers.

This would definitely decrease the memory consumption significantly as multithreading will be used within

F processing instead of for multiple instances of them at the same time (so mostly dedicated to MSM speedup).

  • We should add signifficantly more testing. Also would be nice to include some assertions in the code to prevent missbehaviours that we might be missing now.
  • It would be nice to polish the PoC made in Nova and Nova-Scotia for the Parallel Nova implementation. But it might be better to first do an exhaustive analysis of the purposed solution PoC and determine it's correctness as well as it's performance implications.

This basically means, that we need to address all the points marked above before we polish and upstream the PoC done in the Vietnam Residency.

Confidence around GPU speedup

Some exploration were done in regards trying to take profit of Neptune-cuda and pasta-msm features. The issue was that the AWS servers we had access to, did not have, or had impossible-to-configure GPUs. Hence, we weren't able to test the performance with the GPU backend with powerful GPU cards.

The intuition is that if we have big-enough folds with several private and public inputs, the speedup should be considerable. Specially when we know that the only heavy operations we perform are Multiscalar Multiplications and Hashing.

A big leftover of this Residency that we would love to see happening is benchmarks that give serious intuition over the expected performance gains when these are used.

API improvements

The API of Nova as well as Nova-Scotia it's a bit painful to work with.
There are a lot of improvements that we could do so that the parallel solution is easier to implement:

Nova Scotia (code)

  1. Improve witness memory handling (WASM only?)
    • Currently runs out of memory for large witness
    • Nova vanilla is memory-efficient, so solving this would remove constraint for programs written in Circom
    • This may just be a WASM constraint
      • Possible upstream Circom issue: For Nova Scotia, it'd be useful to get C++ witness gen to work on M1 as Nova is useful for large circuits and M1 is an increasingly common dev machine
  2. Better generic abstractions on the API
    • Aside from the fixed types, the libraries uses F:PrimeField and G:Group when the ideal scenario to not have trait-madness is to just use G:Group and invoke G::Scalar when we need something that implements PrimeField.
  3. Simplify handling of Circom input
    • Abstract away decimal/hex/bytes etc conversions
    • This would mean that we ideally have an internal type which implements From<F:PrimeField> or something similar and we just use it instead of the current conversion madness.
  4. Minor: don't overload existing types in upstream
    • Makes it harder to write some generic functions
    • E.g. R1CS vs R1CSCircom type, some things around C1 and StepCircuit (?)
  5. Support different curves cycles
    • E.g. BN254/grumpkin and secp/secq. This will be almost for free if we address the Generic enhancements in the current API.
  6. Upstream Pasta curves to Circom
    • Currently done in fork https://github.com/nalinbhardwaj/circom/tree/pasta
    • In general would be nice to upstream to circom a trait-based curve-cycle support or something similar. So that devs using Nova-Scotia don't need to worry at all about these things.
  7. Remove the code duplication for the C++ and the WASM recursive circuit witness generation process.

See https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L51-L65 as an example.

  1. No type aliases and make everything single-trait-based. See: this
    The type aliases G1, G2, F1, F2 make the library extremely confusing. We should delete them and make everything trait-based everywhere.

  2. Find better abstractions for CircomInput which is hard-to-work with. See: https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L121-L124
    It would also be nice to be able to abstract this from the user.

Supernova (code)

  • Should we start something from scratch for it based on Nova?
  • Are the current implementations like https://github.com/jules/supernova good enough? Or should we instead own a made-from-scratch codebase for SuperNova so that is easier to collaborate on it and manage it?
  • Improved version of Supernova.
    • This would allow us to implement ZK VMs with multiple opcodes and going from O(N*(C*L) to O(N*(C+L) complexity (https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q#SuperNova-VM)
    • The codebase has a lot of TODOs and things left for fixing later. We should also evaluate if the trait-system design of the codebase is a good-enough base/starting-point or otherwise, come up with a better desing for it.

Nova benchmarks

  • Improve Nova benchmarks. See current results. Specifically, we want to (i) ensure correctness (verification checks etc) (ii) clean up the test and make them easier to run (less manual editing, use criterion) (iii) upstream to more standardized benchmark efforts (zk-bench.org or celer network benches).

  • Bellperson SHA256 Celer-network benchmark appears to be more performant. Reproduce this and understand why there's a diff.

Specs and other

ZK VM spec: https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q

  1. Improve ZK VM spec in general
    • Laundry list for now, can factor out later
    • zkLLVM better understanding of opcodes and compiler changes
    • Memory/Mini VM PoC (eg add+mul, pc=1..2)
    • Better understanding of privacy and limits around it
    • See if we can tweak LLVM so that the IR chunks we get are big enough so that we get the maximum benefit of the folds.

(other code bases? like Halo2 support, plonkish)