# Nova wishlist and next steps Things we want to improve in Nova and Nova-based ecosystem. Context: These are issues that came out of work at ZK Vietnam Residency, specifically towards a ZK VM design https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q and seeing feasibility of a different architecture vs e.g. current Halo2-based ZKEVM CE. Audience: (i) ourselves when we keep working on it (ii) upstream people to understand pain points (iii) [Zuzalu hackathon in mid-April](https://zuzaluzk.com/nova) (iv) Wider community. ## Nova (code) Upstream repo: https://github.com/microsoft/nova PSE fork: https://github.com/privacy-scaling-explorations/nova ### Parallel Nova: Make it work - First of all, we should definitely solve all of the bugs that the current proposal has. - The final fold verification never succeeds. - The algorithm to digest the folds in parallel is really basic. And provides no security guarantees that the standalone $F'$ instances are either forwarded to a top level folding aggregation. - We have no guarantees that the folding proposal for the parallel case is theoretically correct. The paper doesn't provide any guidelines neither. So would be nice to double check with the authors as well as trying to test it exhaustively. - Figure out why parallel effort doesn't seem to improve performance a lot (Around 35% speed improvement). :::danger Note that the first clear thing is that now, **for each step ($F'$) we're performing 3 folds instead of 1**. This is to do all the consistency checks required by the parallelization. But, at the same time, slows down the parallel solution significantly. ::: We should explore a bit more the performance of the $F'$ processing and identify if there are any steps that can be done better. :::info Also, it's important to highlight that **the folds should be big**. As big as possible as the advantage of Nova is mostly comming from not needing to compute `FFTs`. Hence, small $F'$ instances don't benefit much of it. ::: - **The parallel solution currently uses a lot of RAM** in order to be able to work. That could be caused by the extreme parallelization we're trying to apply with [`rayon`](https://github.com/rayon-rs/rayon). We should profile the memory to actually know where we're allocating more and see if there's any way to improve it. The current implementation required us to store the current level of the tree to perform the folding for the next level that could be the main reason for memory overhead. :::info The idea for the future is that instead of havign multiple threads computing $F'$ in parallel, **we have different machines on a network which serve as parallel provers**. ::: This would definitely decrease the memory consumption significantly as multithreading will be used within $F'$ processing instead of for multiple instances of them at the same time (so mostly dedicated to MSM speedup). - We should add signifficantly more testing. Also would be nice to include some assertions in the code to prevent missbehaviours that we might be missing now. - It would be nice to polish the PoC made in [Nova](https://github.com/privacy-scaling-explorations/nova/tree/parallel_prover_bench) and [Nova-Scotia](https://github.com/privacy-scaling-explorations/nova-scotia/tree/parallel_nova) for the `Parallel Nova` implementation. But it might be better to first do an exhaustive analysis of the purposed solution PoC and determine it's correctness as well as it's performance implications. :::info This basically means, that we need to address all the points marked above before we polish and upstream the PoC done in the Vietnam Residency. ::: ### Confidence around GPU speedup Some exploration were done in regards trying to take profit of [`Neptune-cuda`](https://github.com/lurk-lab/neptune#rust-feature-flags) and [`pasta-msm`](https://github.com/lurk-lab/pasta-msm) features. The issue was that the AWS servers we had access to, did not have, or had impossible-to-configure GPUs. Hence, we weren't able to test the performance with the GPU backend with powerful GPU cards. :::success The intuition is that if we have big-enough folds with several private and public inputs, the speedup should be considerable. Specially when we know that the only heavy operations we perform are Multiscalar Multiplications and Hashing. ::: A big leftover of this Residency that we would love to see happening is benchmarks that give serious intuition over the expected performance gains when these are used. ### API improvements The API of Nova as well as Nova-Scotia it's a bit painful to work with. There are a lot of improvements that we could do so that the parallel solution is easier to implement: - Make the API inside of Nova and Nova-Scotia easier to use by providing more abstractions & automatization capabilities. https://github.com/privacy-scaling-explorations/Nova/blob/parallel_prover_bench/src/lib.rs#L191-L200 is an example of all the inputs that need to be provided. It would be much better if we could have a structure that wraps them up and provides constructors and other helpful methods. This is specifically useful for the parallel case. Where we created the `FoldInput` struct to manage the parallel witnessing. See: https://github.com/privacy-scaling-explorations/Nova/blob/parallel_prover_bench/src/parallel_prover.rs#L652-L656. We belive this can be significantly improved so that all the public inputs and outputs can be generated prior to the IVC accomulation/folding as well as allow for easier interfaces for witness-gen. See a first try into this direction here: https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L93-L172 - Currently the PoC is focused on implementing `Parallel Nova` but not focused on API ergonomics. We should try to unify the API for the parallel and regular cases. And let Nova do the re-arrenging work over the Folds behind the scenes. This means that the API is the same (or as close as possible) and we don't worry about providing public inputs/outputs. We can also consider a parallel feature-flag so that writing the same circuit allows us to - One of the latest things to pay attention to, would be the development of an FPGA-based Prover. This would significantly speedup the overall implementation speed. Combined with the fact that we don't do FFTs, the outcome of that would imply a significant performance boost over the protocol implementation. It is also easy to implement with a feature flag. - We find useful to perform some changes in the API so that we already account for recieving all of the `FoldInput`s and we let Nova internally handle the ordering + the checks of the Folds depending on the feature flag used. See: https://github.com/privacy-scaling-explorations/Nova/blob/parallel_prover_bench/src/parallel_prover.rs#L652-L656. and https://github.com/privacy-scaling-explorations/Nova/blob/parallel_prover_bench/src/parallel_prover.rs#L658-L716 - We've also experimented with using both sides of the curve-cycle so that we can double the amount of work we do at the same time. This means that in `pallas` and `vesta` we perform useful work instead of just verifying fold accomulation correctness in one side and do the useful program-logic stuff in the other. See: ## Nova Scotia (code) 1. Improve witness memory handling (WASM only?) - Currently runs out of memory for large witness - Nova vanilla is memory-efficient, so solving this would remove constraint for programs written in Circom - This may just be a WASM constraint - Possible upstream Circom issue: For Nova Scotia, it'd be useful to get C++ witness gen to work on M1 as Nova is useful for large circuits and M1 is an increasingly common dev machine 2. Better generic abstractions on the API - Aside from the fixed types, the libraries uses `F:PrimeField` and `G:Group` when the ideal scenario to not have trait-madness is to just use `G:Group` and invoke `G::Scalar` when we need something that implements `PrimeField`. 3. Simplify handling of Circom input - Abstract away decimal/hex/bytes etc conversions - This would mean that we ideally have an internal type which implements `From<F:PrimeField>` or something similar and we just use it instead of the current conversion madness. 4. Minor: don't overload existing types in upstream - Makes it harder to write some generic functions - E.g. R1CS vs R1CSCircom type, some things around C1 and StepCircuit (?) 5. Support different curves cycles - E.g. BN254/grumpkin and secp/secq. This will be almost for free if we address the Generic enhancements in the current API. 6. Upstream Pasta curves to Circom - Currently done in fork https://github.com/nalinbhardwaj/circom/tree/pasta - In general would be nice to upstream to circom a trait-based curve-cycle support or something similar. So that devs using `Nova-Scotia` don't need to worry at all about these things. 7. Remove the code duplication for the C++ and the WASM recursive circuit witness generation process. - https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L261-L262 - https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L174-L177 See https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L51-L65 as an example. 8. No type aliases and make everything single-trait-based. See: [this](https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L32-L41) The type aliases `G1`, `G2`, `F1`, `F2` make the library extremely confusing. We should delete them and make everything trait-based everywhere. 9. Find better abstractions for `CircomInput` which is hard-to-work with. See: https://github.com/privacy-scaling-explorations/Nova-Scotia/blob/parallel_nova/src/lib.rs#L121-L124 It would also be nice to be able to abstract this from the user. ## Supernova (code) - Should we start something from scratch for it based on Nova? - Are the current implementations like https://github.com/jules/supernova good enough? Or should we instead own a made-from-scratch codebase for SuperNova so that is easier to collaborate on it and manage it? - Improved version of Supernova. - This would allow us to implement ZK VMs with multiple opcodes and going from `O(N*(C*L)` to `O(N*(C+L)` complexity (https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q#SuperNova-VM) - The codebase has a lot of TODOs and things left for fixing later. We should also evaluate if the trait-system design of the codebase is a good-enough base/starting-point or otherwise, come up with a better desing for it. ## Nova benchmarks - Improve [Nova benchmarks](https://github.com/privacy-scaling-explorations/nova-bench). See current [results](https://hackmd.io/0gVClQ9IQiSXHYAK0Up9hg?view=). Specifically, we want to (i) ensure correctness (verification checks etc) (ii) clean up the test and make them easier to run (less manual editing, use criterion) (iii) upstream to more standardized benchmark efforts (zk-bench.org or celer network benches). - Bellperson SHA256 Celer-network benchmark appears to be more performant. Reproduce this and understand why there's a diff. ## Specs and other ZK VM spec: https://hackmd.io/0bcDwP5QQp-eiAMVEulH7Q 1. Improve ZK VM spec in general - Laundry list for now, can factor out later - zkLLVM better understanding of opcodes and compiler changes - Memory/Mini VM PoC (eg add+mul, pc=1..2) - Better understanding of privacy and limits around it - See if we can tweak LLVM so that the IR chunks we get are big enough so that we get the maximum benefit of the folds. (...other code bases? like Halo2 support, plonkish...)