ZKEVM - EVM Circuit Design Rationale

Requirement

Verify a list of transactions
In the beginning of each transaction or internal call, verify one of:
- Simple ether transfer and goto next transaction
- Contract creation and begin bytecode execution
- Contract interaction and begin bytecode execution
In each step of bytecode execution, we need to verify all kinds of possible result, which includes all opcode and their different success or failure case.
In the end of each internal call, it resume to caller. (A call ends with any failure execution results or success of STOP, RETURN, or REVERT)
In the end of each transaction, it handles refund and coinbase reward, then continue to next transaction. (A valid transaction always ends with success of STOP, RETURN, or REVERT)

What's annoying is there are 142 opcodes in total, each of them also have a few possible execution results, which sum to around 470 possible execution results. (plus few call initialization cases)

If we treat ErrOutOfGas as 2 different errors ErrOutOfConstantGas and ErrOutOfDynamicGas, and the former with other common errors which could be verified easily as an single case (other like ErrStackUnderflow, ErrStackUnderflow, ErrWriteProtection, ErrDepth, ErrInsufficientBalance), then we have 187 possible execution results, which seems better but still many for a circuit.

In the end there will be less cause we will find some opcodes can be handled without any extra cost (like DUP* and SWAP*).

In current impl I didn't treat common errors as an single case because I thought it won't save too much, but now for me it seems we should do this to avoid impl redundancy. Then it seems only around 30 opcodes need 2 cases (success and dynamic OOG) and only CREATE* needs to be taken care specially. Others only need to focus on success case.
han

See here for the estimation of possible execution results. Or see evm-opcodes by wolflo for a quick reference for EVM opcodes, which also explains the gas cost fantastically.

The pre-compiled contracts are not considered yet, it seems we could treats them as special opcode that constraint address to some value.

Current Implementation

Limitation

Assume finally we have each polynomial degree to

2^{25}

, then we have

Gate degree bound:
$9$
Subset argument degree bound (input + table):
$7$

Reference. Not certainly sure about this, need @yingtong's confirmation.
han

Layout

Width: 32 + some fixed and advice selectors
Height: 10
Rotation:
- 0~9 for all columns
- 10 for tracking cells accessible to next step (gc, sp, etc…)
Cell Usage:
- For Branching:
  - 1 bool cell is_padding to identify whether it's a padding
  - 1 bool cell is_executing to identify whether it's executing bytecode or initializing a call
  - 80 bool cells q_ops for op gadgets, each op gadget could handle more than 1 op (80 comes from 142 - (num_push - 1) - (num_dup - 1) - (num_swap - 1), could be less in the future).
- Other:
  - $\leq 14$ tracking cells, which are next step accessible like global_counter, call_id, bytecode_source_kind, bytecode_source, stack_pointer, etc…
  - $7 \cdot 32$ generic usage cells (7 comes from the most costly case: the success of CALL which pops 7 words from stack)
Subset Arguments
- 7 bus mapping lookup (7 comes from the same reason as above)
- 32 byte range lookup
- 32 fixed table lookup

Potential Optimization

Assuming memory size has a hard bound
$2^{24}$

When memory is used to

2^{24}

, the gas cost will be around 18 times of current block gas limit

30000000

. (

3 \cdot 2^{19} + \frac{2^{2 \cdot 19}}{512} = 538443776 \approx 18 \cdot 30000000

)

If we can assume the memory will always be in

2^{24}

(prover can't create a proof if memory size out of the range), then any memory offset or size pop from stack will only cost 3 cells instead of 32, and we can reduce words usage from 7 to 4.

The reason is most opcode uses

\leq 3

words with some auxilary witness, and those opcode use more than 3 words always contain 2~4 memory offset and size, which can be put inside single word together if we assume so. Then we can make the height of layout become 7 instead of 10

But not sure if such incompatible with EVM will break some current DApp, need to do more investigation.

Narrow down circuit to reduce subset arguments

Currently we have so many subset arguments… A subset argument costs 3 commitments and 5 evaluations (shuffled table + shuffled input accessible to previous + 1 shuffle product accessible to next), which seems really costly.

Reference. Not sure if it's correct in our case, need @yingtong and @kilic's confirmation.
han

With narrower circuit seems reasonable because we then can make good use of a single subset argument, but still have some drawback:

Pros
- Less subset arguments
Cons
- More rotation access will be required cause we still need so many cells to put witness
- We need to more effort to take care of the layout (be awareness of which columns to put witness for lookup)
- Extra witness selector to enable the subset argument.
- Some data redundancy (seems not big deal)

For each subset argument, we handle the layout:

Bus mapping lookup
- TODO
Byte range lookup
- Put word in the same row (already implemented)
Fixed table lookup
- Add another tag column
- Set tag value when enabled, otherwise set it to 0
- Lookup by (tag * input, tag * table) for all-zero-row when disabled

Other Random Thoughts

Allow Higher Gate Degree Bound

They are several approaches to do branching (creation of bool expression for branches), to represetn

N

unrolled branches, we can do:

Description	Degree Cost	Cell Cost	Note
Lagrange basis in single cell	$N$	$1$	State circuit currently uses this
Binary composition in $\log$ cells	$\log_{2} (N)$	$\log_{2} (N)$
Stand-alone booleans in $N$ cells	$1$	$N$	Current implementation uses this nestedly

Wondering if there are more choices to do efficient branching.
han

Since we have 186 possible execution results (might be much less in the future), we can binary composition with 8 bit to represent each case, which leads to a degree 8 synthetic selector.

It will be too high for previous limitation, but if we relax the mul degree bound to 16, then we can reduce 80 selector to only 8.

It seems not beneficial cause we only save height from 7 to 5, but our total poly degree is also half. And the linearisation could be much more complicated.

Outsource Word Operation

The reason the previous approach doesn't save is because the 4 words still dominate the step cost. If we can outsource word operation, and with the previous approach, the circuit could be much less wide.

So like arithmetic, bitwise, comparator, etc… which needs to be verified byte to byte will be handle in another circuit as a table.

Some special case like bytecode copy, we can assume bytecode table is constraint to order bytecode by index, then do lookup like:

meta.lookup(|meta| {
    // ...

    // encode table with continously 32 rotation instead of decompressing in evm circuit step
    let a_word = meta.query_advice(advice_opcode, Rotaion::cur())
    let t_word = (0..32)
        .iter()
        .map(|rotation| meta.query_advice(table_opcode, Rotation(rotation)))
        .enumerate()
        .fold(
            Expression::Constant(F::zero()),
            |acc, (idx, op)| acc + op * r.exp(idx)
        );
    
    vec![
        // ...
        (a_word, t_word)
    ]
});

Then we don't need to decompress word into 32 bytes in evm circuit.

Other case like memory copy, we need some other constraint on bus mapping to perform such trick.

Replace Fixed Selector with Witness

Currently there is a fixed selector to serve as a pivot to enable each step, it enable in the first row of every 10 rows to fit the most costly case.

What if we replace the fixed selector with a witness one, then we use another fixed selector to constraint the first row's of the witness pivot to 1, then each case will constraint next N+1 row's witness pivot to 1, where N is the amount of rows the case requires. Then we don't waste any rows ever.

To visualize, we can treat the circuit as a giant table, each gadget is just sliding from top to bottom to verify the cells satisfy the constraints when it's enabled, so we should be able to do this trick.

Not sure if there is some drawback (security issue, more complicated linearisation on verifier, etc…)

Requirement

Current Implementation

Limitation

Layout

Potential Optimization

Assuming memory size has a hard bound 224

Narrow down circuit to reduce subset arguments

Other Random Thoughts

Allow Higher Gate Degree Bound

Outsource Word Operation

Replace Fixed Selector with Witness

Read more

2024 zkVM team summary (PRECON)

Halo2 Peak Memory Usage

2024-01-11 Milestone review

2023-11-18 PSE tooling

Assuming memory size has a hard bound
$2^{24}$