In halo2, the lookup is flexible to be configured, anything able to be turned into Expression
could used to be item: Tuple[int, ...]
or table: Set[Tuple[int, ...]]
in lookup, and then it assert item in table
. The Expression
includes constant, queried fixed, advice or instance column at arbitrary rotation, addition/multiplication of Expression
.
The motivation to have multiple circuits as lookup tables is because EVM contains many circuit unfriendly operation like random read-write data access, wrong field operation (ECDSA on secp256k1), traditional hash function (keccak256), etc… And many of them accept variable lenght input.
These expensive operations makes it hard to design a EVM circuit to verify computation trace because each step could possibly contain some of them. So we try to separate these expensive operations to other single-purpose circuits which have more friendly layout, and use them by a lookup (or serveral lookups) to communicate input and output to outsource the effort.
The reason lookup with input and output could be used to outsource the effort is that we know the table lookup-ed is configured with constraints to verify the input and output are in some relationship. For example, we let Bytecode circuit to holds a set of tuple (code_hash, index, opcode)
, and each code_hash
is verified to be the keccak256 digest of opcodes it contains, then in EVM circuit we can load opcode
with (code_hash, program_counter)
by lookup to Bytecode circuit.
However, sometimes there are some properties we can't ensure by only lookups. For example, the amount of all item
should equal to the size of table
, which is required by between EVM circuit and State circuit to prevent extra malicious write. In such case (assert set(items) == table
), we need some extra constraint to ensure the relationship is correct. A naive approach is to also count all item
in State circuit, which in the end is the size of the table
, and constraint it to be equal to the one counted in EVM circuit.
The original approach is using lookup to move the meaningful items from State circuit to bus mapping, which is another private table, and verify the bus mapping has degree bound that eqaul to the one counted in EVM circuit.
But it turns out that we also need to count in State circuit, otherwise the prover could insert something in bus mapping but skip it in State circuit. We can try to do lookup from bus mapping to State circuit to avoid the counting in State circuit, but it just makes bus mapping seem to be a redundant layer.
In general, such case is more like reorder something to be friendly to perform other constraint instead of subset, which should be easy to add in halo2 since subset argument already uses such shuffle argument. But for flexibility, we might have multiple lookup to State circuit likeassert set(items1).intersection(set(items2)) == {0} and set(items1) + set(items2) == table
) , which is not a simple shuffle, so counting items in State circuit seems to be a more general solution.
han
Each circuit is layouted to be friendly to build their own custom constraints. When circuits encounter some expensive operations, they can outsource the effort to other circuits by lookup. The relationship between circuits would be like:
In the end the circuits would be assembled depending on their dimension and the desired capacity. For example, we can just combine 2 different circuits by using different columns, or stack them using same columns with extra selectors.
Name | Value | Description |
---|---|---|
MAX_MEMORY_ADDRESS |
2**40 - 1 |
max memory address allowed [1] |
MAX_GAS |
2**64 - 1 |
max gas allowed |
MAX_ETHER |
2**256 - 1 |
max value of ether allowed [2] |
EVM circuit iterates over transactions included in proof to verify each execution step of transaction is valid. Basically the scale of a step is same as in EVM, so usually we handle one opcode per step, except those opcodes like SHA3
or CALLDATACOPY
that operate on variable size of memory would require multiple steps.
The scale of a step somehow could be different depends on the approach, an extreme case is to implement a VM with reduced instruction set (like TinyRAM) to emulate EVM, which would have a much smaller step, but not sure how it compares to current approach.
han
To verify if a step is valid, we first enumerate all possible execution results of a step in the EVM including success and error cases, and then build the custom constraint to verify the step transition is correct for each execution result.
In each step, we constrain it to enable one of execution results, and specially constrain the first step to enable BEGIN_TX
, then repeats the step to verify the full execution trace. Also each step is given access to next step to propagate the tracking information, by putting constraints like assert next.program_counter == curr.program_counter + 1
.
It's intuitive to have each opcode as a branch in step. However, EVM has so rich opcodes that some of them are very similar like {ADD,SUB}
, {PUSH*}
, {DUP*}
and {SWAP*}
that seem to be handled by almost identical constraint with small tweak (to swap a value or automatically done due to linearity), it seems we could reduce our effort to only implement it once to handle multiple opcodes in single branch.
In addition, an EVM state transition could also contain serveral kinds of error cases, we also need to take them into consideration to be equivalent to EVM. It would be annoying for each opcode branch to handle their own error cases since it needs to halt the step and return the execution context to caller.
Fortunately, most error cases are easy to verify with some pre-built lookup table even they could happen to many opcodes, only some tough errors like out of gas due to dynamic gas usage need to be verified one by one. So we further unroll all kinds of error cases as kinds of execution result.
So we can enumerate all possible execution results and turn EVM circuit into a finite state machine like:
In current implementation, we ask opcode implementer to also implement error cases, which seems to be redundant efforts. If we do this, they can focus more on opcode's success case. Also error cases are usually easier to verify, so I think it also reduces the overall implementation complexity.
han
In EVM, the interpreter has ability to do any random access to data like block context, account balance, stack and memory in current scope, etc… And some of them are read-write and others are read-only.
In EVM circuit, we leverage the concept Circuit as a lookup table to duplicate these random accessible data to other circuits in different layout and verify they are consistent and valid. After these random accessible data are verified, we can use them just like they are tables. Here are the tables currently used in EVM circuit.
For read-write accessible data, EVM circuit lookup State circuit with a sequentially rw_counter
(read-write counter) to make sure the read-write access is chronological, and a flag is_write
to have State circuit to check data is consistency between different write accesses.
For read-only accessible data, EVM circuit lookup Bytecode circuit, Tx circuit, Call circuit directly.
In EVM, state write could be reverted if any call fails. There are many kinds of state write, a complete list can be found here.
In EVM circuit, each call is attached with a flagis_persistent
to know if it succeeds in the end or not, so ideally we only need to do reversion on those kinds of state write which affects future execution before reversion:
TxAccessListAccount
TxAccessListStorageSlot
AccountNonce
AccountBalance
AccountCodeHash
AccountStorage
Others we don't need to do reversion because they don't affect future execution before reversion, we only write them when is_persistent
is 1
:
TxRefund
AccountDestructed
Another tag is
TxLog
, which also doesn't affect future execution. It should be explained where to write such record to after we decide where to build receipt trie.
han
To enable state write reversion, we need some meta information of a call:
is_persistent
- To know if we need reversion or notrw_counter_end_of_reversion
- To know at which point in the future we should revertstate_write_counter
- To know how many state write we have done by nowThen at each state write, we first check if is_persistent
is 0
, if so we do an extra state write at rw_counter_end_of_reversion - state_write_counter
with the old value, which reverts the state write in a reverse order.
In EVM circuit, there are 3 kinds of opcode source for execution or copy:
(code_hash, index, opcode)
(tx_id, TxTableTag.Calldata, index, opcode)
(rw_counter, False, caller_id, index, opcode)
Before we fetch opcode from any source, it checks if the index is in the given range, if not, it follows the behavior of current EVM to implicitly returning 0
.
EVM supports internal call triggered by opcodes. In EVM circuit, the opcodes (like CALL
or CREATE
) that trigger internal call will save its own call_state
into State circuit, setup next call's context, and initialize next step's call_state
to start a new environment. Then the opcodes (like RETURN
or REVERT
) and error cases that halt will restore caller's call_state
and set it back to next step.
For a simple CALL
example with illustration (many details are hided for simplicity):
main
TODO Explain each execution result
Tx circuit iterates over transactions included in proof to verify each transaction has valid signature. It also verifies the built transaction merkle patricia trie has same root hash as public input.
Main part of Tx circuit will be instance columns whose evaluation values are built by verifier directly. See the issue for more details.
To verify if transaction has valid signature, it hashes the RLP encoding of transaction and recover the address of signer with signature, then verifies the signer address is correct.
It serves as a lookup table for EVM circuit to do random access of any field of transaction.
To prevent any skip of transaction, we verify te amount of transactions in Tx circuit is equal to the amount that verified in EVM circuit.
State circuit iterates over random read-write access records of EVM circuit to verify each data is consistent between different writes. It also verifies the state merkle patricia trie root hash has a valid transition from old to new one incrementally, where both are from public input.
To verify if data is consistent, it first verify all access records are grouped by their unique identifier and sorted by order of access, then verify the records between writes are consistent. It also verifies data is in correct format.
It serves as a lookup table for EVM circuit to do consistent random read-write access.
To prevent any malicious insertion of access record, we also verify the amount of random read-write access records in State circuit is equal to the amount in EVM circuit (the final value of rw_counter
).
The first thing to ensure data is consistent between different writes is to give each data an unique identifier, group them by the unique identifier and then sort them by order of access rw_counter
. Here are all kinds of data with their unique identifier:
Tag | Unique Index | Values |
---|---|---|
TxAccessListAccount |
(tx_id, account_address) |
(is_warm, is_warm_prev) |
TxAccessListAccountStorage |
(tx_id, account_address, storage_slot) |
(is_warm, is_warm_prev) |
TxRefund |
(tx_id) |
(value, value_prev) |
Account |
(account_address, field_tag) |
(value, value_prev) |
AccountStorage |
(account_address, storage_slot) |
(value, value_prev) |
AccountDestructed |
(account_address) |
(is_destructed, is_destructed_prev) |
CallContext |
(call_id, field_tag) |
(value) |
Stack |
(call_id, stack_address) |
(value) |
Memory |
(call_id, memory_address) |
(byte) |
Different tags have different constriant on their grouping and values.
Most tags also keep the previous value *_prev
for convenience, which helps reduce the lookup when EVM circuit is performing a write with a diff
to current value, or performing a write with a reversion.
EVM's memory expands implicitly, for example, when the memory is empty and it enounters a mload(32)
, EVM first expands to memory size to 64
, and then load the bytes just initialized to push to stack, which is always a 0
.
The implicit expansion behavior makes even the simple MLOAD
and MSTORE
complicated in EVM circuit, so we have a trick to outsource the effort to State circuit by constraining the first record of each memory unit to be a write or have value 0
. It saves the variable amount of effort to expand memory and ignore those never used memory, only used memory addresses will be initlized with 0
so as lazy initialization.
This concept is also used in another case: the opcode
SELFDESTRUCT
also has ability to update the variable amount of data, it reset thebalance
,nonce
,code_hash
, and everystorage_slot
even it's not used in a step. So for each state under account, we can add arevision_id
handle such case, see here for details.
TODO Convert this into an issue for discussion
han
main
TODO Explain each tag
Bytecode circuit iterates over contract bytecodes to verify each bytecode has valid hash.
It serves as a lookup table for EVM circuit to do random access of any index of bytecode.
ECDSA circuit iterates over signatures to verify recovered public key from signature is valid, or verify signature is invalid.
It serves as a lookup table for EVM and Tx circuit to do public key recover.
Keccak256 circuit iterates over hashes to verify each hash is valid.
It serves as a lookup table for EVM, Bytecode, Tx and MPT circuit to do hash.
MPT circuit (Merkle Patricia Trie circuit) iterates over merkle patricia trie transition to verify each update is valid.
It serves as a lookup table for State and Tx circuit to do merkle patricia trie update.
The explicit max memory address in EVM is actually 32 * (2**32 - 1)
, which is the one that doesn't make memory expansion gas cost overflow u64
. In our case, memory address is allowed to be 5 bytes, but will constrain the memory expansion gas cost to fit u64
in success case. ↩︎
I didn't find a explicit upper bound on value of ether (for balance
or gas_price
) in yellow paper, but handling unbounded big integer seems unrealistic in circuit, so with u256
as a hard bound seems reasonable. ↩︎