zkVM Ethereum - OpenVM guest input & optimized MPT implementation

# zkVM Ethereum - OpenVM guest input & optimized MPT implementation ## Introduction This document explores a new potential candidate for a standardized *ExecutionWitness*, specifically in the Merkle Patricia Trie (MPT) input encoding and implementation. In this case, we explore the current design used in [openvm-reth-benchmark](https://github.com/axiom-crypto/openvm-reth-benchmark). That repository is written in Rust, but this won't be relevant. We want to focus on the data structures and algorithms used to see whether this design can be a standard, i.e., in other ELs in other languages. The rest of the document has three main parts: - *Guest program input format*: specifies which bytes are passed to the guest program as input. This is the only relevant part for standardizing the Guest Program API. - *Trie deserialization optimizations*: this part explains how an efficient implementation can deserialize such bytes into a usable sparse MPT for block execution. Note that this is not part of any standardization and is up to the EL guest programs to use the bytes as they see fit. Still, we explain an optimized implementation so others can be inspired by it. - *Benchmarks*: we benchmarked this guest program input and sparse MPT implementation in 500 consecutive mainnet blocks against [eth-act/zkvm-ethereum-mpt](https://github.com/eth-act/zkvm-ethereum-mpt) (sparse MPT used in zeth). This zeth-MPT was shown to be already faster than the default sparse-MPT in Reth. ## Guest program input format The guest program receives the Ethereum block pre-state in a format called `EthereumStateBytes`. This would be an alternative version of the `nodes` field in the current [proposed standardized format](https://hackmd.io/@jsign/zkvm-eth-minimal-execution-witness#Proposal). ### EthereumStateBytes Structure ```python @dataclass class EthereumStateBytes: # A tuple representing a serialized account state trie: # - 1st field: number of nodes in the serialized trie # - 2nd field: bytes buffer containing the serialized trie state_trie: tuple[int, bytes] # A list of tuples representing serialized storage tries: # Each tuple contains: # - 1st field: account address (B256) # - 2nd field: number of nodes in the serialized storage trie # - 3rd field: bytes buffer containing the serialized storage trie storage_tries: list[tuple[B256, int, bytes]] ``` The *number of nodes* (`int`) allows the guest program to know upfront how many nodes to expect in each trie, enabling efficient pre-allocation of memory during deserialization. The serialized trie(s) (`bytes`) are encoded in a format that allows for efficient deserialization. Let’s dive more into the trie serialization format. ### Trie Serialization Format Each trie (state or storage) is serialized into a contiguous byte buffer using a depth-first pre-order traversal. ```python def encode_trie(node: Node) -> bytes: """ Encodes an MPT node and all its children into bytes. Uses depth-first pre-order traversal. """ result = bytearray() # 1. First, RLP-encode the current node rlp_encoded = rlp_encode_node(node) result.extend(rlp_encoded) # 2. Pad to 4-byte alignment (more about this later) padding_len = (4 - (len(rlp_encoded) % 4)) % 4 result.extend(b'\\x00' * padding_len) # 3. Recursively encode children (depth-first) if isinstance(node, BranchNode): for child in node.children: if child is not None: result.extend(encode_trie(child)) elif isinstance(node, ExtensionNode): result.extend(encode_trie(node.child)) # Leaf, Digest, and Null nodes have no children return bytes(result) ``` The serialization format has several key properties: 1. Pre-order traversal: Parent nodes are encoded before their children, enabling streaming deserialization without hash maps (i.e., no extra hashing or random memory access). Note a drawback: repeated sub-tries aren’t deduplicated. 2. 4-byte alignment padding: After each node's RLP encoding, if needed, zero bytes are added to align to the OpenVM word boundary (4 bytes). This allows us to avoid unaligned memory access when we `keccak` RLP-encoded node slices in this buffer would be a performance hit. Instead of 4 bytes, we can change this to 8 bytes if that is better for all zkVMs. To help in understanding, let’s look at an example sparse MPT and how the serialization looks like ![image](https://hackmd.io/_uploads/ByuO8prVWe.png) In the above diagram, we can see: - An example sparse MPT containing a root node (BranchNode), with only two revealed children: Subtrie-1 and Subtrie-2. The rest of the children's subtries aren’t revealed (i.e., are not needed for the block execution). - The `Trie encoding` represents the bytes resulting from encoding the trie. This is a flat byte array. To help in readability, the `Subtrie-1` and `Subtrie-2` sections are explained separately in the `Subtrie-{1|2} =` sub-diagrams. - If you read `Trie encoding` from left to right, you can deserialize the trie in one pass in a recursive manner. ## Trie deserialization optimizations Given the `Trie encoding` received in the guest input for the state trie and many storage tries, we need to efficiently transform it into a usable structure to: i) Validate it is correct, ii) Use it to execute the state transition function (STF). The way that the deserialized MPT is represented is roughly like this Python-like snippet: ```python class Mpt: root_id : NodeId # NodeId = integer nodes : list[NodeData] enum NodeData: Empty Branch(array[Optional[NodeId], 16]) Leaf(&bytes, &bytes) Extension(&bytes, NodeId) Digest(&bytes) ``` The `Mpt.nodes` is a (growable) vector that will contain all the nodes. When created, it is pre-allocated by using the received “number of trie nodes” hint in the input (e.g. `state_trie: tuple[int, bytes]` mentioned before). The first item of the tuple (`[int]`) allow us to: 1. Know how many nodes the pre-state trie will have, so we know that when deserializing the RLP-encoded nodes, we need *at least* that many. 2. Note that `nodes` might grow when we calculate the post-state root. Trie writes can create new nodes, so the `nodes` vector might grow. The current implementation does an estimation to decide on the capacity of `nodes` with `nodes + (nodes / 2)`. Getting the right capacity is important to avoid adding a new node during post-state root calculation, which would trigger a reallocation of the whole vector and be costly. Additionally, the `NodeId` you see in `NodeData` refer to the indices in the `Mpt.nodes` vector. The idea is that all the trie nodes live in this vector, so instead of using memory references, we can use vector indices to “point to” children. `NodeData` is an enum which represents all the node types that we can materialize in the trie: - `Branch` an array of 16 `Optional[NodeId]` values for each child. If a value is `Null` then this child is empty. If is a `NodeId` it means which index from `nodes` the child is (i.e., instead of using memory pointers, we use indexes in this array). - `Leaf` contains two memory pointers **to the original `Trie encoding` byte received in guest program inputs**: one for the path and the other for the value. - `Extension` contains a memory pointer to the original `Trie encoding` for the path, and a `NodeId` with the index value in `nodes` of the child. - `Digest` represents just a node for an unrevealed sub-trie, i.e., what we referenced as `A`, `B`, etc, in our previous diagram. To understand this better, we can also look at the following diagram: ![image](https://hackmd.io/_uploads/rkaftM3HZl.png) The dotted arrows (i.e., *weak pointers*) are `NodeId` references within the same array. The non-dotted arrows are memory pointers to the original `Trie encoding` byte array received as input. To summarize the benefit of this implementation: - The `Mpt.nodes` is a heap-allocated vector with a reserved capacity we explained before, i.e., to hold pre-state nodes, plus some reserved extra capacity for new nodes to be created in post-state root calculation. - A memory arena is created by preallocating `BUMP_AREA_SIZE` bytes: - `BUMP_ARENA_SIZE` can be estimated to avoid extra allocations. This memory arena can be allocated and “leaked”, meaning we don’t care about its lifetime (which simplifies memory management). Using this memory arena means we made a single allocation at the start, and any subsequent memory needs will be served by a quick “bump allocator”. - It will contain any required allocationg during trie decoding and post-state root calculation (e.g. new data for created nodes -- `GX` in the diagram). - In short, `Mpt.nodes` is a pre-allocated vector in the heap so decoding pre-state nodes and post-state new nodes do not trigger new allocations. Any other allocation required, will always happen in the memory arena which is also pre-allocated. - A `Mpt` is created by performing a single-pass read of the guest program input. Since the main backing data structure is a vector (`Mpt.nodes`), we don’t use hash maps, thus avoiding expensive hashes or scattered memory accesses. - References between nodes use `NodeId` which are indices in this array. `NodeId` size can be smaller than a memory pointer (depends on the zkVM). - For pre-state node types like `Leaf` and `Extension` which have *path* or *values*, we reference the original input array. This avoids new allocations and copying memory by referencing this data directly in the RLP-encoded nodes. There are two extra optimizations done in `Mpt`: - The `Mpt` has an extra `rlp_scratch` which mainly serves as a reusable buffer to encode nodes when the post-state root is calculated and _new_ nodes should be RLP encoded to calculate their hash. - There is a _mirror_ array of `Mpt.nodes` named `Mpt.cached_references` which caches any keccak256 calculation of `Mpt.nodes`; just to avoid any duplicated work. i.e., ~`keccak256(Mpt.nodes[i])=Mpt.cached_references[i]`. Some open questions or general remarks: - We might need a more systematic way to conclude which are good constants, like: memory-arena initial capacity (and potential growth), `Mpt.nodes` capacity and implications of being insufficient for performance or DoS vectors. - The “Node counts” hint received as guest input is a potential DoS vector—additional checks are needed to prevent an easy OOO crash. - Compared to the current `debug_executionWitness` format, the serialized MPT isn’t deduplicated. This has an impact on two sides: i) Input size, ii) Extra `keccak256` done because of duplicated nodes. It might be worth analyzing the size overhead of this proposed format, since it might not be significant considering other advantages. ## Benchmarks A benchmark was done running 500 (`[23533500-23533999]`) consecutive mainnet blocks. We used the Reth [guest program](https://github.com/eth-act/zkevm-benchmark-workload) which is a thin wrapper on unmodified Reth code. The wrapper allows to plug-and-play different implementations of an stateless-MPT abstraction. Today, the default used stateless-MPT implementation can be found [here](https://github.com/eth-act/zkvm-ethereum-mpt), which was originally created by the Risc0 team used in zeth. The following is the result of executing all the mainnet blocks and _only_ changing the stateless-MPT implementation expected by Reth MPT trait: ``` ### Summary Statistics Metric mainnet-2353.. zkevm-metric.. Diff -------------------------------------------------------------------------------- Total proving time 4h 59m 23.67s 4h 13m 56.83s -15.2% Avg throughput 624.25K gas/s 735.96K gas/s +17.9% ### Per-Fixture Proving Time Difference (mainnet-2353.. vs zkevm-metric..) Metric Value ----------------------------------- Mean -15.1% Median -15.5% Min -22.0% Max +3.4% ``` The median speedup in **total block proving time** is ~16%, which is quite significant. To see a more detailed per-block comparsion, see the [full output](https://gist.githubusercontent.com/jsign/8c10af2fd883ea2841fb577dbf23ff40/raw/1498d14d29ce65cf37499eaef6588a7a590df066/trie-benchmark).