Towards a Nova-based ZK VM

A really quick summary of our journey

Motivations

😕😕We wanted to do ZKWASM!! Why did we end up here?😕😕

Alternative designs to ZKEVM
New proving systems sounded really interesting
WASM doesn't have gas -> Unbounded computations
Aggregation would be a nightmare.

Goals

Be able to sketch out the Memory and program execution consistency.
Figure out if Parallel Supernova was feasible.
Be able to implement a Parallel-ready version of Nova in Rust.
Benchmark the overall solution to see how it compares to other state-of-the-art proving systems avaliable for devs.

The Parallelization Promised Land

Since the IVC process seperates the folding into many small steps can we allocate different threads or even machines to doing subsections of the proof?
This gives us almost infinite parallelization capacity as given an arbitrarily long calculation we can do it on an arbitrary number of machines machines.

Unfortunately out of the box Nova can't support this.

Curve Cycling and Sequential Nova

Folding != Proving

Each folding step only guarantees that if it is done correctly and is given two valid relaxed R1CS instances it produces a valid relaxed R1CS.
Therefore we must create a proof that folding is done correctly for every fold in order to check that if the final proof is valid that all proofs are valid.

But folding is a wrong field operation because it directly manipulates points on the EC not in the EC base field

Therefore to be efficient we have to use curve cycling
We run two Nova folding chains and two F' circuits along side eachother where each's proofs validate their own F and folding of the other Nova chain

How sequential Nova looks in Practice

digraph structs {
    node [shape=box];
    newrank=true;
    style=dashed;
    compound=true;
    subgraph cluster1 {
        label="Primary curve"
        subgraph some {
            rank = same
            U₃
            U₂
            U₁
            U₀
            ubrick2 [style=invis];
            ubrick1 [style=invis];
            ubrick0 [style=invis];
        }
        subgraph some1 {
            rank = same
            u₃
            u₂
            u₁
            u₀
            ufold2 [shape=circle, label="Fold"]
            ufold1 [shape=circle, label="Fold"]
            ufold0 [shape=circle, label="Fold"]
        }
        
        U₃ -> u₃ [style=invis, weight=100];
        U₂ -> u₂ [style=invis, weight=100];
        U₁ -> u₁ [style=invis, weight=100];
        U₀ -> u₀ [style=invis, weight=100];


        u₃ -> ubrick2 [style=invis, weight=100];
        u₂ -> ubrick1 [style=invis, weight=100];
        u₁ -> ubrick0 [style=invis, weight=100];
        ubrick2 -> u₂ [style=invis, weight=100];
        ubrick1 -> u₁ [style=invis, weight=100];
        ubrick0 -> u₀ [style=invis, weight=100];
        
        ubrick2 -> ufold2 [style=invis, weight=100];
        ubrick1 -> ufold1 [style=invis, weight=100];
        ubrick0 -> ufold0 [style=invis, weight=100];

        {U₂, u₂} -> ufold2; 
        ufold2 -> U₃;
        {U₁, u₁} -> ufold1;
        ufold1 -> U₂ ;
        {U₀, u₀} -> ufold0;
        ufold0 -> U₁;
    }
    
    subgraph cluster2 {
        rankdir="BT";
        label="Secondary curve"
        
        subgraph some1 {
            rank = same
            r₃
            r₂
            r₁
            r₀
            rfold2 [shape=circle, label="Fold"]
            rfold1 [shape=circle, label="Fold"]
            rfold0 [shape=circle, label="Fold"]
        }
        
        subgraph some {
            rank = same
            R₃
            R₂
            R₁
            R₀
            rbrick2 [style=invis];
            rbrick1 [style=invis];
            rbrick0 [style=invis];
        }
        
        R₃ -> r₃ [style=invis, weight=100];
        R₂ -> r₂ [style=invis, weight=100];
        R₁ -> r₁ [style=invis, weight=100];
        R₀ -> r₀ [style=invis, weight=100];


        r₃ -> rbrick2 [style=invis, weight=100];
        r₂ -> rbrick1 [style=invis, weight=100];
        r₁ -> rbrick0 [style=invis, weight=100];
        rbrick2 -> r₂ [style=invis, weight=100];
        rbrick1 -> r₁ [style=invis, weight=1000];
        rbrick0 -> r₀ [style=invis, weight=1000];
        
        rbrick2 -> rfold2 [style=invis, weight=1000];
        rbrick1 -> rfold1 [style=invis, weight=1000];
        rbrick0 -> rfold0 [style=invis, weight=1000];
        
        r₀ -> rfold0;

        {R₂, r₂} -> rfold2; 
        rfold2 -> R₃;
        {R₁, r₁} -> rfold1;
        rfold1 -> R₂ ;
        {R₀, R₀} -> rfold0;
        rfold0 -> R₁;
    }    
    
    u₃ -> r₃ [style=invis, weight=1000];
    u₂ -> r₂ [style=invis, weight=1000];
    u₁ -> r₁ [style=invis, weight=1000];
    u₀ -> r₀ [style=invis, weight=1000];
    
    u₁ -> rfold0 [color=green];
    u₂ -> rfold1 [color=green];
    u₃ -> rfold2 [color=green];
    
    r₁ -> ufold0 [color=green];
    r₂ -> ufold1 [color=green];
    r₃ -> ufold2 [color=green];
}

Parallelization Nova

The challenge is that we can't just fold instances and then use the result as a proof. We have to construct new R1CS instances at each fold which validate the folds.
We do this by curve cycling and arranging the execution like this:

Parallelization Nova (cont.)

digraph hierarchy {
    node [fontname=Monospace,fontsize=10,shape=box]
    "F3" -> {"F1"} [dir=back, label=" (0,2)"];
    "F3" -> {"F5"} [dir=back, label="  (4,6)"];
    "F7" -> {"F9"} [dir=back, label=" (8,10)"];
    "F7" -> {"F3"} [dir=back, label="  (0,6)"];
    "F1" -> {"F0"} [dir=back,color="green", label=" (0,0)"];
    "F1" -> {"F2"} [dir=back,color="green", label="  (2,2)"];
    "F5" -> {"F4"} [dir=back, label="  (4,4)"];
    "F5" -> {"F6"} [dir=back, label = " (6,6)"];
    "F9" -> {"F8"} [dir=back, label="  (8,8)"];
    "F9" -> {"F10"} [dir=back, label="  (10,10)"];
    
    F0 [label="F0", color="green"];
    F2 [label="F2", color="green"];
    F4 [label="F4"];
    F6 [label="F6"];
    F8 [label="F8"];
    F10 [label="F10"];
    "(init)" [style=dashed];
    "(end)" [style=dashed];
    
    "proof" [dir=blue, style=dashed, label="Proof: \n   (z0 => z10)  "];
    
    {
        rank=same;
        F7 -> "proof"
    }
    
    {
    rank=same;
    "(init)" -> "F0" [label="z0"];
    "F10" -> "(end)" [label="z10"];
    "F8"
    }
}

Parallelizing F' in Depth

Given two nodes of \((L, l)\) \((R, r)\) claiming \((z_i, z_{k-1})\) and (z_k, z_j). We fold all instances together with three foldings, then we create a proof that \(F(z_{k-1}) = z_k\) and that a triple folding is correct. The folding in in the primary proven for the opposing secondary circuit which will prove our folding of \((L, l)\) \((R, r)\).

Parallelizing F' in Depth (cont)

digraph structs {
    node [shape=box];
    newrank=true;
    style=dashed;
    compound=true;
    subgraph cluster1 {
        labeljust="l";
        labelloc="b";
        label="Primary curve"
        L₁
        l₁
        R₁
        r₁
        R;
        R₁ -> R [label="  Right Fold  "];
        r₁ -> R;
        L₁ -> L [label="  Left Fold"];
        l₁ -> L;
        L -> N [label="  Final Fold"];
        R -> N;
    }
    

    subgraph cluster2 {
        label="Secondary curve"
        u
    }
    
    u -> N [label = " Proves Fold", style=bold];
    u -> R₁ [label = "    Input", style=dashed];
    u -> r₁ [label = "  Input", style=dashed];
    u -> L₁ [label = " Input", style=dashed];
    u -> l₁ [label = "Input   ", style=dashed];
}

Parallelization result

Super naive POC in the Microsoft Research Nova Lib:

Proves any combo of nodes which are one step different.
Archives that we can split any computation of N fold steps into N proof steps on N/2 processes.
Has bugs and is not optimized yet. But should be possible to have memory and proving time not that much worse than sequential.

Extracting Execution Trace

To prove arbitrary computation we need to extract the execution trace and prove that every step of the computation is correct.

digraph structs {
    node [shape=box];
    newrank=true;
    style=dashed;
    compound=true;
    subgraph cluster1 {
        label="Wasmi's Engine"
        subgraph some {
            rank = same
            e [label="Executor", color=red]
            ex [label="Engine Executor"]
            ee [label="Engine"]
            l [label="Linker"]
        }
        
        l -> ee [weight=100];
        ee -> ex [weight=100];
        ex -> e [weight=100];

    }
    
    subgraph cluster2 {
        rankdir="BT";
        label="Injector"
        
        subgraph some1 {
            rank = same
            t [label="Trace"]
            p [label="Predator"]   
        }
        
        p -> t [label="1..*"]
        
    }
    e -> p [style=dashed];
    
}

We pick the wasmi as a starting point, this approach can apply to any kind of VM.

zkMemory Module

You migh aware that the memory can be constructed as a simple state machine with three instructions INIT,READ, WRITE and configurable WORD_SIZE detail here.

Address	Time	Instruction	Value
0x…00	1	INIT	0x…0000
0x…00	2	READ	0x…0000
0x…00	3	WRITE	0x…0a20
0x…20	4	INIT	0x…0010
0x…20	5	READ	0x…0010
0x…	..	…	…

Memory consistency with examples

The idea is to commit to the memory state of the VM before and after the \(F'\) execution.

With vector commitments we have the following operations:

Open(Com, Idx) = Value
Edit(Com, Idx, NewValue) = NewCommitment

Vector commitments

Remember that a vector commitment commits to a vector \(a_0,…,a_{n−1}\) and lets you prove that you committed to \(a_i\) for some \(i\).

\[ \sum_{n=0}^{n=i_{max}}Mem_i \prod_{n=0}^{n=i_{max}} {X-j\over n-j} \]

MUL opcode example

\[ \mathbb{Opcode}_{MUL} \gets a \times b = c \]

\[ Fetch(Mem(idx_a)) \gets \mathbb O(C_M, idx_a) = a\\ Fetch(Mem(idx_b)) \gets \mathbb O(C_M, idx_b) = b\\ Set(Mem(idx_{c+1})) \gets \mathbb E(C_{M+1}, c_{+1}, idx_{c+1})\\ \]

MUL opcode example

\[ Set(Stack(a)) \gets \mathbb E(C_{S+1}, a, idx_{a+1})\\ Set(Stack(b)) \gets \mathbb E(C_{S+1}, b, idx_{b+1})\\ Set(Stack(c_{+1})) \gets \mathbb E(C_{S+1}, c_{+1}, idx_{c+1})\\ \]

\[ R1CS(a \times b = c_{+1}) \]

MUL opcode example

The folds require that the Public Inputs of the current fold are the Public Outputs of the previous one. In that way, at the very top of the aggregation/folding tree we are sure all the memory is consistent.
We can account for program counter correctness too with this technique. And indeed most of the things we have been able to imagine related to VM execution traces.

Benchmarks

Benchmarks to understand Nova performance
Recursively hashing SHA256 \(k\) times:
- \(h(h(h(x)))\)
Nova, Circom (Groth16) and Halo2 (KZG)
Rationale: Similar to zk-benchmark but recursive to take advantage of Nova+IVC
Also many hashes in a single fold
Disclaimer - preliminary, take with grain of salt

Many hashes in a single fold

\(h(h(h(x))) \text{(d times)}\rightarrow h(h(h(x))) \text{(d times)} \rightarrow \dots\)

Proving systems

Framework	Arithmetization	Algorithm	Curve
Circom (snarkjs)	R1CS	Groth16	Pasta
Nova (seq)	Relaxed R1CS	Nova	Pasta
Nova (par PoC)	Relaxed R1CS	Nova	Pasta
Halo2	Plonkish	KZG	BN254

Prover time on powerful laptop

Hardware: MBP M1 Max 2021, 64GB memory

k	Circom	Nova total d=1	Nova step-sum d=1	Halo 2 (KZG)
1	0.3s	0.2s	0.1s	0.8s
10	7.3s	2.4s	1.2s	0.8s
100	62s	24s	12.5s	1.6s
1000	-	240s	125s	25s

Prover time on powerful server

Hardware: Server with 72 cores and ~350GB RAM

k	Nova d=1	Nova d=10	Nova d=100	Nova d=100 par	Halo 2 (KZG)
100	19s	3.6s	-	-	2.5s
1000	190s	36s	26.5s	28.5s	41.6s
10000	1900s	360s	265s	226s	389.1s
100000	19000s				?

Memory usage and SRS

Hardware: Server with 72 cores and ~350GB RAM

k	Nova (seq) d=1	Halo 2 (KZG)	Nova (par PoC)
100	1.6GB	3.7GB	9GB
1000	1.6GB	32GB	244GB
10000	1.6GB	245GB	OOM
100000	1.6GB	?	?

Benchmark conclusion

Nova is memory-efficient.
R1CS vs Plonkish matters a lot.
Consider constraints vs recursion overheard.
Large circuits: Nova seems faster than Halo2.

Conclusion and next steps

Nova new but promising for ZKVM etc
Nova wish list/next steps
- Parallel, GPU, API, …
- Nova, Nova Scotia, Supernova, Nova Bench
Contribute to pse/zkvm-ideas repo
Join Nova track at Zuzalu hackathon