owned this note
owned this note
Published
Linked with GitHub
This document is an attempt to start exploring the idea of storing the Ethereum state in a highly compact form, to reduce storage costs. It relates to storing the most up-to-date world state as used by full nodes. It is not suitable for archival nodes.
This document will be updated as more insights and hard numbers are obtained.
<span style="color: #880">Highlighted text</span> is speculatory and pending investigation / expert opinion.
# Rationale
## The problem
The classic and most straightforward way for storing Ethereum data is by RLP-encoding it and then hashing it, and using the hash as the identifier used to look up the data in the database. Here's an example merkle tree:
```graphviz
digraph {
graph [];
node [shape=record fontname="Monospace" width="5.6"];
root [label="{(A) Branch (tree root)|Child #0 hash: 0x2774b304dd5356a8fc2631df50c11ac\l\ 940e7b75f8f4dd0e5b91dc3e4e1bdcb50\lChild #1 hash: nil\l...\lChild #f hash: 0x180e52d9644f2f46d621d8bdc0075db\l\ d9a5bc6e1891eb4974f3b1f3cfef6daa4\l|{<child0>0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|<childf>f}}"];
node0 [label="{(B) Leaf|Remainder path: 0xd538189e68464913ba78b5f493fd53\l\ 397ab7798479002d356aeb3332873f38\l|Value: 0x22ef116dd16e260b637a1a0a9d060a3a\l\l}"];
nodef [label="{(C) Extension|Remainder path: 0x3805459bbc1fc201514433d61fe17a\l\ 63fd7fe99eb9a0cdf05b2e0a4869581f\l|Child hash: 0x6be68666ce6ca5a1baf2779d77f56ec\l\ 76be68666ce6ca5a1baf2779d77f56ec7\l}"];
nodef0 [label="{(D) Branch|Child #1 hash: 0xd8ade416ad3b176182e25a41b00d69a\l\ 4b599c45b6ca461672f0f3eff6541cb37\lChild #4 hash: 0xfc52e45fbf2a477a6f22cf959269d9f\l\ acf806ad820684f606f478b1e542fa27b\l|{0|<child1>1|2|3|<child4>4|5|6|7|8|9|a|b|c|d|e|f}}"];
nodef03 [width="2.4" label="{(E) Leaf|Remainder path: nil\l|Value: 0x528b\l\l}"];
nodef04 [width="4.7" label="{(F) Leaf|Remainder path: nil\l|Value: 0x60cac43b2f4f000042e503b1c4d9018\l\ cc45483ecaa5af88e5532697c4dfdc606\l}"];
root:child0 -> node0;
root:childf -> nodef;
nodef -> nodef0;
nodef0:child1 -> nodef03;
nodef0:child4 -> nodef04;
}
```
The content of the bottom leaves `E` and `F` are RLP-encoded and hashed, and these hashes are stored in branch `D` (child #1 and #4 hashes). That branch is hashed as well (including the hashes of its children) and that hash is held in the extension node `C`. The extension node and its sibling leaf `B` have their hashes computed and stored in the top branch `A` (at offsets 0x0 and 0xf), the tree root. The hash of that branch is the "root hash" (not shown in the diagram).
The way that data would be stored in a flat key ⟶ value database is essentially as follows:
<pre>
<font color="#f88">(tree root)</font> ⟶
[ <font color="#f88">0x2774b304dd5356a8fc2631df50c11ac940e7b75f8f4dd0e5b91dc3e4e1bdcb50</font>,
nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil,
<font color="#f88">0x180e52d9644f2f46d621d8bdc0075dbd9a5bc6e1891eb4974f3b1f3cfef6daa4</font> ]
<font color="#f88">0x2774b304dd5356a8fc2631df50c11ac940e7b75f8f4dd0e5b91dc3e4e1bdcb50</font> ⟶
[ 0xd538189e68464913ba78b5f493fd53397ab7798479002d356aeb3332873f38,
0x22ef116dd16e260b637a1a0a9d060a3a ]
<font color="#f88">0x180e52d9644f2f46d621d8bdc0075dbd9a5bc6e1891eb4974f3b1f3cfef6daa4</font> ⟶
[ 0x3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f,
<font color="#f88">0x6be68666ce6ca5a1baf2779d77f56ec76be68666ce6ca5a1baf2779d77f56ec7</font> ]
<font color="#f88">0x6be68666ce6ca5a1baf2779d77f56ec76be68666ce6ca5a1baf2779d77f56ec7</font> ⟶
[ nil, <font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font>,
nil, nil, <font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font>,
nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil ]
<font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font> ⟶
[ nil, 0x528b ]
<font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font> ⟶
[ nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606 ]
</pre>
The hashes have been <font color="#f88">highlighted</font> above. As you can see, most of the data stored on disk is actually hashes.
## Obtaining a node's value
To look up a node in the tree, we need its _logical key_. In the example above, leaf node `F`'s logical key is `0xf3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f4`. It represents a path down the tree. Each hex character is called a _nibble_. Leaves always have 64-nibble paths, which are encoded as 32 bytes (256 bits).
To get the leaf value from the database, we first need to fetch the root node (branch `A`). The first nibble is `0xf` meaning that leaf resides somewhere inside branch `A`'s child #15 (`0xf`). The hash of that child is `0x180e52d9644f2f46d621d8bdc0075dbd9a5bc6e1891eb4974f3b1f3cfef6daa4`, and we use it to perform another database lookup to obtain its value. Since it's an extension node (`C`), we verify that its _remainder path_ matches the next nibbles in our key (`3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f`). Once confirmed, we can go ahead and fetch the node that it points to (branch `D`). At that point we only have one nibble left to resolve (`0x4`). We fetch child #4 by its hash and obtain leaf `F`.
## Using shorter database identifiers
If you think about it, using hashes as database lookup keys is wasteful. If your database were to store, say, 1 billion items, then ideally you'd use a 30-bit identifier (which can represent 2^30^ = 1,073,741,824 items) to uniquely identify each item. That's a lot shorter than 256-bit hashes.
Let's take a look at how it would look like. Let's take our example above of branch node `D` referencing leaves `E` and `F`, if we were to assign them 32-bit identifiers:
<pre>
<s><font color="#f88">0x6be68666ce6ca5a1baf2779d77f56ec76be68666ce6ca5a1baf2779d77f56ec7</font></s> <font color="#7bf">0x6b3ee8bf</font> ⟶
[ nil,
<s><font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font></s> <font color="#7bf">0xb1f2922c</font>,
nil, nil,
<s><font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font></s> <font color="#7bf">0x223bbe40</font>,
nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil ]
<s><font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font></s> <font color="#7bf">0xb1f2922c</font> ⟶
[ nil, 0x528b,
<font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font> ]
<s><font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font></s> <font color="#7bf">0x223bbe40</font> ⟶
[ nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606,
<font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font> ]
</pre>
The new 32-bit identifiers are highlighted in <font color="#7bf">blue</font>, and the ones that are no longer needed were <s><font color="#f88">striked through</font></s>.
Note that we still need to store one copy of the merkle hash, so we moved the hashes into the leaf nodes.
Also note that the new 32-bit identifiers need to be stored twice: once as a database key, and once in the parent node pointing to its child.
Still, we end up storing 320 bits (256 + 32 + 32) instead of 512 bits (256 + 256) per parent-child relationship.
But can we do better?
## Using tree paths as database identifiers
Ethereum's modified Patricia tree is designed to be structurally compact, meaning most layers of the tree are comprised of branch nodes with a full set of 16 children each. Only the few lower layers of the tree are sparse.
This is achieved by letting leaves that logically reside at the bottom of the tree to reside higher up and to hold the _remainder_ of their logical path inside their payload. Same goes for extension nodes.
Furthermore, the keys used to store data in the tree are computed as the _keccak_ hash (the precursor to SHA-3) of an Ethereum account address. Hashing produces random keys which has the nice side effect of preventing hotspots and sparse regions, leading to a balanced and nicely compact tree.
<span style="color: #880">As of now the Ethereum world state tree should be around ~10 levels deep (I'll verify this and update this document). Levels 1-8 should be comprised exclusively of branch nodes with a full set of 16 children each. Levels 9-11 should become progressively sparser, and I'd expect to see extension nodes and leaves exclusively at that range.</span>
A path pointing to a node near the bottom of the tree will be ~10 nibbles long, i.e. 40 bits (5 bytes).
Now let's take a look at the example tree again, with the tree paths annotated:
```graphviz
digraph {
graph [];
node [shape=record width="4"];
root [label="{Branch (tree root)|Tree path: (empty)|{<child0>0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|<childf>f}}"];
node0 [label="{Leaf|Tree path: 0|Remainder path: d538189e68464913ba78b5f4\n93fd53397ab7798479002d356aeb3332873f38\l}"];
nodef [label="{Extension|Tree path: f|Remainder path: 3805459bbc1fc201514433d6\n1fe17a63fd7fe99eb9a0cdf05b2e0a4869581f}"];
nodef0 [label="{Branch|Tree path: f3|{0|<child1>1|2|3|<child4>4|5|6|7|8|9|a|b|c|d|e|f}}"];
nodef03 [label="{Leaf|Tree path: f31|Remainder path: nil}"];
nodef04 [label="{Leaf|Tree path: f34|Remainder path: nil}"];
root:child0 -> node0;
root:childf -> nodef;
nodef -> nodef0;
nodef0:child1 -> nodef03;
nodef0:child4 -> nodef04;
}
```
Note that a node's _logical path_ (i.e. its key) does not match its _tree path_. For example, leaf `B`'s logical path is `0d538189e68464913ba78b5f493fd53397ab7798479002d356aeb3332873f38` but its tree path is simply `0`. Branch `D`'s logical path is `f3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f` due to the extension node above it, but its tree path is `f3`.
If we use tree keys as our database identifier, there's a very nice bonus to it: we can _deduce_ a child tree key as follows:
* The tree key of a branch node's child is equal to the branch node's own tree key plus the child offset (`0`-`f`)
* The tree key of an extension node's (only) child is equal to the extension node's own tree key plus the first nibble of its remainder path.
Meaning, nodes that are persisted to the database do not need to contain their children database identifiers. Let's look at our previous example, now using tree keys:
<pre>
<s><font color="#f88">0x6be68666ce6ca5a1baf2779d77f56ec76be68666ce6ca5a1baf2779d77f56ec7</font></s> <font color="#7bf">0xf3</font> ⟶
[ nil,
<s><font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font></s>,
nil, nil,
<s><font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font></s>,
nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil ]
<s><font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font></s> <font color="#7bf">0xf31</font> ⟶
[ nil, 0x528b,
<font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font> ]
<s><font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font></s> <font color="#7bf">0xf34</font> ⟶
[ nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606,
<font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font> ]
</pre>
The tree paths in this example are short due to it being a small tree. In practice we expect them to be ~10-nibble long, i.e. 40 bits. Meaning, we can now store a parent-child relationship, including the child hash, as 296 bits (256+40) instead of 512 (256+256).
## Cons of tree paths as identifiers
### No state history
This technique has a huge drawback: it is impossible to store different versions of a certain tree node. The tree path always resolves to the same node. This implies that it's impossible to store Ethereum state at different points in time, as is required for archival nodes.
But since the target audience of this storage scheme are full nodes that only need the current world state, it's less of an issue.
### No merkle hash lookup
Another point to consider is that we can't look up a node by its merkle hash. <span style="color: #880">Luckily, as far as I'm aware the Ethereum RPC APIs do not use merkle hashes as identifiers, except perhaps the root hash, which is a special case.</span>
### No reuse of identical state
When merkle hashes are used as the key to look up a piece of data, a nice side effect is that if that piece of data is embedded in several locations in the tree, it will actually be stored just once. <span style="color: #880">I assume this isn't very common, and that the storage savings described in this document far outweight this perk.</span>
## Pros of tree paths as identifiers
### No need for state pruning
Storing tree nodes by their Merkle hashes, and having nodes reference child nodes by their hash, produces a *simple directed graph*, not a tree. Meaning, a node can have multiple parents.
When the content of that node changes (either its own payload or the content of one of its children), so does its hash. When we persist that new hash and value, it would be tempting to delete ("prune") the previous hash and value from the database, but this can only be done if the old value is no longer referenced.
To prune old state that is no longer reachable from the current root node, we would have to employ reference counting, or some form of "garbage collection", or to clone the active state into a new database and delete the old, bigger state database.
By using tree paths we store an actual tree and not a graph, and don't need to deal with that problem.
### Quicker leaf lookup
It should be possible to reduce the number of hops needed to resolve a leaf. Given that leaves reside around 10 levels deep in the tree, and we're pretty sure level 8 branches have virtually no gaps, we could just fetch the ancestor branch at level 8 (by taking the first 8 nibbles of the leaf key), and continue downwards from there. Meaning, we'll reach the leaf in about 3 lookups instead of 10. In the rare case we don't find the ancestor branch, we can try again one or two levels higher up.
### Quicker loading of ancestor nodes
When we need to supply a Merkle proof for a certain node, we need to provide the hashes of that node along with all its ancestors, all the way to the tree root. Since we know the tree path of each ancestor, we could just perform a multi-get query or parallel queries to the database to fetch all the nodes at once (e.g. 10 queries assuming the tree is about 10 levels deep).
We also need this when a value is being updated in the tree; we need to update all ancestor Merkle hashes, so we need to read them beforehand.
### Fast tree enumeration and bulk loading subtrees
If the tree is persisted using the popular RocksDB database, then the key ⟶ values will be stored by their tree path order, e.g.:
```
0 ⟶ ...
02 ⟶ ...
1 ⟶ ...
12345 ⟶ ...
1d ⟶ ...
8 ⟶ ...
fff ⟶ ...
```
RocksDB offers efficient iterators to read a range of keys. If we're interested in, say, the `1` sub-tree, then we can make a query for keys in the lexicographical range of `1` to `2`, meaning `12345` and `1d` would be included.
This can be useful for tools, backups, traversal, node data format updates and various background jobs operating on the whole tree.
More importantly, Snap Sync RPC requests can be handled more efficiently, since they serve contiguous ranges of accounts data.
### Simplicity
When inspecting the database, we can readily know what account a value belongs to; the key _is_ the hashed ethereum address, and can be readily copy-pasted into tools such as https://etherscan.io for inspection.
## Foregoing leaf hashes
Let's take a look at our example sub-tree again:
```graphviz
digraph {
graph [];
node [shape=record];
nodef0 [label="{Branch|Tree path: f3|{0|<child1>1|2|3|<child4>4|5|6|7|8|9|a|b|c|d|e|f}}"];
nodef03 [label="{Leaf|Tree path: f31}"];
nodef04 [label="{Leaf|Tree path: f34}"];
nodef0:child1 -> nodef03;
nodef0:child4 -> nodef04;
}
```
The database representation is:
<pre>
f3 ⟶ [ (boolean flags denoting the presence of each child),
hash: <font color="#f88">0x6be68666ce6ca5a1baf2779d77f56ec76be68666ce6ca5a1baf2779d77f56ec7</font> ]
f31 ⟶ [ nil, 0x528b,
<font color="#f88">0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37</font> ]
f34 ⟶ [ nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606,
<font color="#f88">0xfc52e45fbf2a477a6f22cf959269d9facf806ad820684f606f478b1e542fa27b</font> ]
</pre>
Let's look at leaf node `f31`. Its payload is just `[ nil, 0x528b ]` but it also stores the merkle hash of that payload, `0xd8ade416ad3b176182e25a41b00d69a4b599c45b6ca461672f0f3eff6541cb37`
That hash is several times bigger than the leaf payload. It would make sense to not store the hash at all, and just compute it on-demand. This makes sense for all leaf nodes.
## Foregoing most hashes
Let's assume our example tree is actually a subtree in a much larger tree:
```graphviz
digraph {
graph [];
node [shape=record width="4.4"];
root [label="{Branch|Tree path: 7ff6149|{<child0>0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|<childf>f}}"];
node0 [label="{Leaf|Tree path: 7ff61490}"];
nodef [label="{Extension|Tree path: 7ff6149f}"];
nodef0 [label="{Branch (tree root)|Tree path: 7ff6149f3|{0|<child1>1|2|3|<child4>4|5|6|7|8|9|a|b|c|d|e|f}}"];
nodef03 [label="{Leaf|Tree path: 7ff6149f31}"];
nodef04 [label="{Leaf|Tree path: 7ff6149f34}"];
root:child0 -> node0;
root:childf -> nodef;
nodef -> nodef0;
nodef0:child1 -> nodef03;
nodef0:child4 -> nodef04;
}
```
We can easily forego storing the hashes of the leaves, but could we forego storing hashes in higher up nodes? Clearly we need to store _some_ hashes; we couldn't hash the whole Ethereum world state every time we need the root hash.
Assuming the tree is around ~10 levels deep, we could decide, for example, to not hash nodes at the 2 bottom levels, i.e. levels 9 and 10. Meaning, to compute the hash of a node on level 9, we need to read its children and hash them first. This could amount to a total of 17 hash operations.
If we'd decide to start from level 8 instead, we would have to hash a _sub-tree_ of up to 273 nodes. At level 7 it would be up to 4369 nodes, etc. The higher up we go, the better the storage savings, but the more I/O we use for reading nodes, and the more CPU we use to hash them.
Out of these two drawbacks, the I/O is more of an issue; fetching 273 nodes from the database one by one by traversing the sub-tree top-to-bottom would be a lot costlier than hashing them.
But wait, remember we can [bulk-read whole subtrees](#Fast-tree-enumeration-and-bulk-loading-subtrees)? In the example above, we could bulk read the whole subtree by fetching all nodes in the range of `7ff6149` to (and not including) `7ff6150`.
Since RocksDB reads all nodes from one contiguous region, the overheads are minimal and the throughput is very high. It probably costs less to read a bulk of nodes than to hash them.
So how many levels at the bottom of the tree should not have hashes? <span style="color: #880">We need to get some stats about the density of the bottom levels of the tree. A rough guess would be that nodes at level 7 have around ~500 descendants (rarely getting past level 10)</span>, and that would probably be the sweet spot between storage savings and CPU overhead.
If we'd go with that, **we'd be storing roughly x470 times less hashes in total!**
However, picking the right level at which not to store hashes can be tricky. As the Ethereum state grows, the tree will get deeper. For networks other than `mainnet`, the tree might be shallower. The tree might also not be perfectly balanced, so we might end up having no-hash sub-trees that are either too small or too big.
## Foregoing keys & reducing DB overheads
Till now we ignored the fact that we need to be able to identify whether a node is a branch, extension or leaf. Let's use the following _type enum_:
0. Nil (no node)
1. Branch
2. Extension
3. Leaf
With the hashes overhead mostly out of the way, let's look at the key ⟶ value storage representation of our example sub-tree (above). I added a _type enum_ to each node, highlighted in <font color="#5c5">green</font>. Keys are highlighted in <font color="#7bf">blue</font>.
<pre>
<font color="#7bf">7ff6149</font> ⟶
[ <font color="#5c5">1</font>, (boolean flags denoting the presence of each child) ]
<font color="#7bf">7ff61490</font> ⟶
[ <font color="#5c5">3</font>,
0xd538189e68464913ba78b5f493fd53397ab7798479002d356aeb3332873f38,
0x22ef116dd16e260b637a1a0a9d060a3a ]
<font color="#7bf">7ff6149f</font> ⟶
[ <font color="#5c5">2</font>, 0x3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f ]
<font color="#7bf">7ff6149f3</font> ⟶
[ <font color="#5c5">1</font>, (boolean flags denoting the presence of each child) ]
<font color="#7bf">7ff6149f31</font> ⟶
[ <font color="#5c5">3</font>, nil, 0x528b ]
<font color="#7bf">7ff6149f34</font> ⟶
[ <font color="#5c5">3</font>, nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606 ]
</pre>
Note that the overhead of keys is not insignificant. Could we eliminate them as well? Why, yes we could. We could pack data into _blobs_. For example:
<pre>
<font color="#666"> 1</font> <font color="#7bf">7ff6149</font> ⟶ [
<font color="#666"> 2</font> <font color="#5c5">1</font>,
<font color="#666"> 3</font> [ <font color="#5c5">3000000000000002</font> ],
<font color="#666"> 4</font> [ 0xd538189e68464913ba78b5f493fd53397ab7798479002d356aeb3332873f38,
<font color="#666"> 5</font> 0x22ef116dd16e260b637a1a0a9d060a3a ],
<font color="#666"> 6</font> [ 0x3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f ]
<font color="#666"> 7</font> [ <font color="#5c5">0300300000000000</font> ],
<font color="#666"> 8</font> [ nil, 0x528b ],
<font color="#666"> 9</font> [ nil, 0x60cac43b2f4f000042e503b1c4d9018cc45483ecaa5af88e5532697c4dfdc606 ]
<font color="#666">10</font> ]
</pre>
In this representation the nodes' payload is serialized recursively. This can be done using RLP, but more space savings could be obtained by using a serialization that's aware of the layout of branch, extension and leaf nodes.
Let's examine the lines above one by one:
1. The _blob_ is identified by the tree path of the root node stored in the blob
2. We need to know the type of the root node. It may not necessarily be a branch. In this case it is. This allows us to process the next entry correctly during deserialization.
3. This is the encoding of branch node `7ff6149`. It contains a leaf at offset `0` and and extension node at offset `f`. The rest are are nils. This information is encoded as an array of _type enum_ with 16 elements (<font color="#5c5">3000000000000002</font>). Note that the enum can be represented using 2 bits, hence branch nodes can be encoded as 4 bytes. In that case it would be: `0xC0000002`.
4. This and the following line is the payload of the leaf `7ff61490`
5.
6. That's the payload of the extension node `7ff6149f`. Note that extension nodes always have one child, hence the next entry is its child. The child is always a branch.
7. Branch `7ff6149f3` contains leaf nodes at offsets `1` and `4`
8. That's the payload of leaf node `7ff6149f31`
9. That's the payload of leaf node `7ff6149f34`
By storing several nodes in a single blob, we eliminate the storage overhead of their tree paths. But even more importantly, we eliminate overheads caused by the database. For example, the database needs to store the length of keys and values. It may pad / align keys and values for better throughput. It may have indexes. By storing less entries, we incur less overheads.
Note that using blobs is only viable for full ethereum nodes. Archival nodes need to store the Ethereum state at every point in time. To illustrate the problem, imagine that the payload of the Ethereum storage slot represented by the leaf node `7ff6149f34` is modified. The change is highlighted in <font color="#ea5">orange</font>:
<pre>
<font color="#666"> 1</font> <font color="#ea5">7ff6149'</font> ⟶ [
<font color="#666"> 2</font> <font color="#5c5">1</font>,
<font color="#666"> 3</font> [ <font color="#5c5">3000000000000002</font> ],
<font color="#666"> 4</font> [ 0xd538189e68464913ba78b5f493fd53397ab7798479002d356aeb3332873f38,
<font color="#666"> 5</font> 0x22ef116dd16e260b637a1a0a9d060a3a ],
<font color="#666"> 6</font> [ 0x3805459bbc1fc201514433d61fe17a63fd7fe99eb9a0cdf05b2e0a4869581f ]
<font color="#666"> 7</font> [ <font color="#5c5">0300300000000000</font> ],
<font color="#666"> 8</font> [ nil, 0x528b ],
<font color="#666"> 9</font> [ nil, 0x60cac43b2f4f000042e503b1c4d9018c<font color="#ea5">40e5712a5e5f6aca</font>5532697c4dfdc606 ]
<font color="#666">10</font> ]
</pre>
To hold both world states, we'd have to store the whole modified blob apart from the first. This is wasteful since most data in the blob has not changed compared to the original. Furthermore, the blobs would have to have distinct identifiers.
## Introducing bottom blobs
How do we generate these blobs? What happens when nodes are created, modified and removed from certain positions in the tree? How do we adjust the blobs and maintain reasonable binary sizes?
Let's start by assuming we have a tree that has blobs at its bottom layer; the layers above it are not held in blobs. Meaning, every upper, _stand-alone_ node has its own entry in the database, that includes its hash. Let's also assume for this discussion that we wish our blobs to be around 50kb to 100kb big.
Initially, we start with an empty tree. Meaning, one blob holding inside it one branch node with zero children.
```mermaid
flowchart
n[(Root blob)]
```
As we add nodes to the tree, or update leaves' payloads with bigger ones, they're stored inside the blob, which grows larger.
```mermaid
flowchart
n[(Root blob\n..........\n..........\n....\n)]
```
Once the blob exceeds the upper limit (e.g. 100kb), we take the 16 (non-nil) children of the root node within that blob and store them as independent blobs. We then store that root node independently, replacing the original blob. In this example, let's assume a split factor of 4 instead of 16, for simplicity:
```mermaid
flowchart
n(Root node)
n --> n0[(Path: 0\n.....)]
n --> n1[("Path: 1\n.")]
n --> n2[(Path: 2\n........\n........\n.....)]
n --> n3[(Path: 3\n......)]
```
As the tree grows, blobs split and are pushed downwards. The tree portion above the blobs gets deeper.
```mermaid
flowchart
n(Root node)
n --> n0[(Path: 0\n.....)]
n --> n1[("Path: 1\n.")]
n --> n2(Path: 2)
n --> n3[(Path: 3\n......)]
n2 --> n10[(Path: 20\n..)]
n2 --> n11[(Path: 21\n........\n..)]
n2 --> n12[(Path: 22\n....)]
n2 --> n13[(Path: 23\n.)]
```
Stand-alone nodes in the upper portion of the tree have a logical size equal to their own payload and the size of all their children, recursively, whether they are stored in blobs or not. I.e. the size of the whole sub-tree.
When the size of a stand-alone node drops below the lower limit (e.g. 50kb), we merge it along with its descendants into one blob, and delete its descendants.
The reason we have different thresholds for splitting blobs or merging them is to prevent frequent split/merge events. For example, if we were to use a single threshold, e.g. 100kb, and some blob would reach that size, it would trigger a split, but then if we were to delete a value from that sub-tree it might get immediately merged again.
Earlier in this article, before discussing blobs, we contemplated reading sub-trees in bulk, but were hesitant as to the exact depth in the tree we should employ this, with the risk of reading too much data or too few. With blobs, this concern is removed; there are clear size boundaries, and the blobs adjust to the depth of the tree dynamically.
Furthermore, operators running our software might pick a blob size that's right for them. A smaller size will lead to more storage overheads but also to quicker processing of transactions due to lower I/O, which might be beneficial for miners, for example.
## Keeping track of the size of sub-trees
You might have asked yourself how we intend to know the size of a sub-tree. One way would be to read all of it and sum it, but that's expensive, and we might need to evaluate that size frequently.
Hence, we could just store the size on upper nodes (above the blobs layer). Whenever a value is updated in the tree, we need to recompute the hashes up the tree and persist them, so we might as well update the sizes during that process.
For example, if we added a new leaf node that has a certain payload size of X, then we'll add that amount to the size of all upper, stand-alone ancestors of that leaf.
We do not need to store sizes for nodes persisted inside blobs, as they can be readily computed when the whole blob is read.
The overhead of storing these sizes is small. First off, it's only needed for upper nodes which are a small subset of the whole logical tree. Second, if we use a variable-length integer encoding (e.g. 7-bit), then the root node might need, say, 5-6 bytes to encode its size, but nodes lower down the tree will only require ~3 bytes, and they're the vast majority.
## Introducing top blobs
(TBD)
## Value compression
(TBD)
## Code compression
(TBD)
## Unifying accounts with their code and storage
The Ethereum world state tree is a bit different than the other tree instances; leaves always encode an account, i.e. they contain the account's balance, nonce, the hash of the contract code, and the hash of the contract state. The contract code and state are stored *separately*, outside of the world state tree.
For efficiency, we'll store accounts along with their contract code and state. <span style="color: #880">This is assuming that particularly large contracts are rare and aren't much bigger than the intended size of our data blobs. Pending verification.</span>
## Verkle support
Up until now we've been discussing how to store state in a form compatible with Merkle Patricia Tree. But a new form is being introduced nowadays, called a Verkle Tree.
Here's how it compares to MPT:
* Branch nodes contain 256 children instead of 16
* There is no concept of "nibbles"; a path down the tree is a simple sequence of bytes.
* There are no leaf and extension nodes
* Instead, there's a leaves node that contains (up to) 256 values and a 'stem' (a partial path) that acts similar to an MPT extension node.
* Instead of merkle hashes, vector commitments are used. Leaves nodes hold 3x 32-byte commitments (usually called C, C1, C2). Branch nodes hold one 32-bytes commitment.
* When the child of a branch node is modified, there is no need to obtain the commitments of all children in order to compute that branch's new commitment. Instead, the "delta" between the child's previous and current commitment can be "added" to the branch's own commitment.
* Commitments are much more expensive to compute than Merkle hashes.
* When updating commitments in the tree bottom-up, the CPU cost can be reduced by bulk-updating a whole depth layer in the tree at a time (e.g. updating all modified nodes at depth 6, then all the modified nodes at depth 5, etc).
* Contracts code and contracts storage are held in the world state instead of separately as in MPT.
* The path of an account in the tree is the Pedersen hash of its Ethereum address (padded with 0's to 32 bytes), not the Keccak hash.
If all goes well, sometime during 2024 Ethereum clients will switch to using MPT and Verkle trees side-by-side, such that:
* All write operations are done (logically) to the Verkle tree. Commitments are updated.
* Read operations are first attempted from the Verkle tree. If the value wasn't found, it's read from the MPT.
* The MPT is frozen henceforth and its root hash will no longer change.
Later on, state will be transferred from the frozen MPT to the Verkle tree, over a certain period of time (due to the large amount of data). Subsequently, the MPT tree can be dropped altogether by non-archival nodes.
All the storage optimizations that were mentioned in this document are valid for Verkle as well. However, the cost of computing commitments is way higher than Merkle hashes. Whereas we could conceivably read, say, a 1000-nodes subtree from a *bottom blob* and recompute their Merkle hashes with little impact, we can't do the same for vector commitments.
Therefore, we'll have to either employ smaller *bottom blobs*, or to store the commitments of the upper nodes inside the blob. <span style="color: #880">This will be finalized after we perform a POC and obtain some stats</span>.
Since verkle leaves hold 256 values, and accounts are leaves, accounts have 256 data slots available to them. Each data slot can hold 32 bytes.
Accounts are stored in a more free-form way in Verkle. The nonce and balance each use one data slot. There are additional headers, and not all headers need to be present. Contract code is stored in slots 128-255, and if that's not enough then it spills over in additional batches of 256 storage slots in random locations in the tree (outside of the range of Ethereum addresses). Same goes for contract state; it starts out in slots 64-127, and spills to other regions if it's bigger.
Here's an example verkle structure with two accounts. Account A only has a balance, and its address hashes to `0x0404040404040404040404040404040404040404040404040404040404040404`. Account B is a contract with a hefty amount of code and state, that "spills over" into other regions of the tree. Its address hashes to `0xabababababababababababababababababababababababababababababababab`
```graphviz
digraph {
graph [ranksep="1.5"]
node [shape=record]
root [label="{Branch (tree root)|{0x00|....|<child4>0x04|...|<childab>0xab|...|<childfe>0xfe|<childff>0xff}}"]
node4 [width=4 label="{Leaves (account A)|Stem: 0x040404040404040404040404040404\l\ 040404040404040404040404040404\l|{0x00\nVersion|0x01\nBalance}}"]
nodeab [label="{Leaves (account B)|Stem: 0xababababababababababababababab\l\ ababababababababababababababab\l|{0x00\nVersion|0x01\nBalance|0x02\nNonce|0x3\nCode\nhash|0x04\nCode\nsize|0x40 .. 0x7f\nStorage|0x80 .. 0xff\nCode}}"]
nodefe [label="{(branches)}"]
nodeff [label="{(branches)}"]
nodefe0 [label="{Leaves|{0x00 .. 0x92\nAdditional code of account B}}"]
nodeff0 [label="{Leaves|{0x00 .. 0x17\nAdditional storage of account B}}"]
root:child4 -> node4
root:childab -> nodeab
root:childfe -> nodefe
root:childff -> nodeff
nodefe -> nodefe0
nodeff -> nodeff0
}
```
One unfortunate consequence of the Verkle design is that an account cannot be read as one sub-tree; there might be code and state fragments stored elsewhere. Furthermore, these fragments cannot be readily mapped back to their original accounts, meaning we don't really know who these pieces of data "belong to".
## Compatibility with ETH Json-RPC APIs
The proposed storage scheme presented here allows looking up accounts by their 32-bytes hashed address, or by their 20-bytes address (which we then hash ourselves). Same goes for storage slots. We do not support looking up a code contract by the hash of the code, or an account storage tree by its root hash, or an account by the hash of its payload + nonce + storage root hash + code hash. But that is not required by ETH APIs.
The `eth_getBalance`, `eth_getTransactionCount`, `eth_getCode`, `eth_call` APIs expect 20-bytes pre-hashed account address(es). The `eth_getStorageAt`, `eth_getProof` APIs expect in addition a pre-hashed storage slot ID.
## Snap sync support
Snap-sync APIs serve batches of accounts and account states, ordered by their logical paths. This is exactly the way we intend to store data, so there's great synergy between the two.
While syncing, we'll be reading batches of consecutive accounts and writing them directly into blobs, meaning the write performance will be excellent. One minor complication is that snap sync serves accounts, state and code separately. And since we intend to store them together, we'll have to follow up every call to `GetAccountRange` with calls to `GetByteCodes` and `GetStorageRanges` and persist the accounts only when we have their states and bytecodes.
While serving snap-sync to other peers, the read performance will be excellent as well, since we'll typically load a blob and serve a range of accounts from within that blob, or maybe across two blobs.
## Storage performance guesstimates
(TBD)
## Naming
This type of tree is neither Merkle nor Verkle (it's both). More generally, it stores _attestations_. It's not (just) hexary. It is a Patricia tree, also called a Radix tree. And its main characteristic is that it uses _blobs_. Hence, Blobs & Attestations Radix Tree, or BART for short.
# Specification
## Encoding of MPT nodes
An Ethereum MPT is comprised of 3 types of nodes: leaf nodes, extension nodes and branch nodes.
They're typically serialized using RLP. RLP is a schema-less serialization method that doesn't assume anything about the structure of the data. We can attain smaller footprints by using a serialization that *is* aware of the schema.
Leaf nodes in the Ethereum MPT *world state* are always accounts, meaning they encode the balance and nonce, and the hash of contract code and state, which are stored separately. We choose to store MPT accounts as a distinct data type, including their code and state.
To support deserialization, we need to store the data type along with the payload. We encode the data type in the first two bits of the payload, as follows: 0 = Leaf, 1 = Account, 2 = Extension, 3 = Branch.
Here's how each of these data types is serialized:
### MPT Leaf
* Meta-data byte:
* Data type enum, 2 bits, '00'
* 6 bits: length of extension path nibbles (up to 64)
* Extension path. Last byte can be half-used (one out of two nibbles) in case the length above is odd.
* Length of payload, 7-bit encoding
* Payload
### MPT Account
* Meta-data byte:
* Data type enum, 2 bits, '01'
* 1 bit: nonce == 0
* 1 bit: balance == 0
* 1 bit: code size == 0
* 1 bit: storage size == 0
* 2 bits: for future use; currently 0's
* 1 byte: length of extension path nibbles (up to 64)
* Extension path. last byte can be half-used (one out of two nibbles) in case the length above is odd.
* (optional) Nonce, 7-bit encoding
* (optional) Balance, 7-bit encoding
* (optional) length of code, 7-bit encoding
* (optional) Code
* (optional) size of storage blob, 7-bit encoding
* (optional) Storage tree, serialized as a [blob](#Encoding-of-blobs)
### MPT extension
* Meta-data byte:
* Data type enum, 2 bits, '10'
* 1 bit: whether the child hash and sub-tree size are stored
* 1 bit: whether the child payload is stored subsequently ([explanation](#Encoding-of-blobs))
* 4 bits: for future use; currently 0's
* 1 byte: length of extension path nibbles (up to 64)
* Extension path. last byte can be half-used (one out of two nibbles) in case the length above is odd.
* (optional) 32 bytes, hash of child branch node
* (optional) 7-bit encoding of the sub-tree size
* (note: address of child Branch == extension's own key + first extension path nibble. this isn't serialized.)
### MPT Branch
* Meta-data byte:
* Data type enum, 2 bits, '11'
* 1 bit: whether the children hashes and sub-tree size are stored
* 1 bit: whether the children payloads are stored subsequently ([explanation](#Encoding-of-blobs))
* 4 bits: whether there's a non-nil child in segment offsets 0-3, 4-7, 8-11, 12-15
* 0-2 bytes: for each of the 4 bits above that are on, 4 bits denoting which children are non-nil within that segment.
* (optional) 32 bytes hashes for each non-nil child
* (optional) 7-bit encoding of the sub-tree size
* (note: the address of each child is deduced as the branch's own key + child offset. this is not serialized)
## Encoding of Verkle nodes
The type of the node can be determined by its first bit. 0 = Leaves, 1 = Branch. Note: we might want to have a distinct node type for an account, which is a specialized type of a Leaves node, in the future.
### Verkle Leaves
* Meta-data byte:
* Data type enum, 1 bit, '0'
* 2 bits: for future use; currently 0's
* 1 bit: whether the commitments were stored
* 4 bits: whether there's one or more non-nil values somewhere in the offsets ranges of 0-63, 64-127, 128-191, 192-255. One bit for each range.
* for each of the 4 bits above that are on, 8 bits (1 byte) that denote which sub-ranges have a non-nil value, e.g. 1 bit for 0-7, 1 bit for 8-15, etc.
* for each of the bits above that are on, 8 bits (1 byte) that denote which concrete offsets have a non-nil value.
* for each non-nill value, 1 byte to denote the length of the value, then the value (up to 32 bytes long)
* 1 byte denoting the length of the stem (the remainder path between the parent node and this one -- same as an MPT extension node)
* up to 31 bytes for the stem
* If commitments are stored:
* 32 bytes for the C commitment
* (optional) 32 bytes for the C1 commitment
* (optional) 32 bytes for the C2 commitment
* note: the presence of C1/C2 can be deduced from the offsets of non-nil values
Note: the maximum serialized length of that record is 8613 bytes.
### Verkle Branch
* Meta-data byte:
* Data type enum, 1 bit, '1'
* 1 bit:for future use; currently 0
* 1 bit: whether the commitment and sub-tree size are stored
* 1 bit: whether the children payloads are stored subsequently ([explanation](#Encoding-of-blobs))
* 4 bits: whether there's one or more non-nil child somewhere in the offsets ranges of 0-63, 64-127, 128-191, 192-255. One bit for each range.
* for each of the 4 bits above that are on, 8 bits (1 byte) that denote which sub-ranges have a non-nil child, e.g. 1 bit for 0-7, 1 bit for 8-15, etc.
* for each of the bits above that are on, 8 bits (1 byte) that denote which concrete offsets have a non-nil child.
* (optional) 32 bytes for the commitment
* (optional) 7-bit encoding of sub-tree size
## Encoding of blobs
A blob is a chunk of the tree that's stored together. It can contain a whole tree, or just the upper portion of a tree, or just a sub-tree extending to the leaves, or some portion of the tree in-between. It always starts from one node, that is the "root" node within that blob.
The binary format of a blob is simply a sequence of serialized [MPT nodes](#Encoding-of-MPT-nodes) or [Verkle nodes](#Encoding-of-Verkle-nodes), depth-first. For example, given this sub-tree:
```graphviz
digraph {
node [shape=record];
root [label="{(A) MPT Branch|{<child0>0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|<childf>f}}"];
node0 [label="(B) MPT Extension"];
nodef [label="(F) MPT Leaf"];
node00 [label="{(C) MPT Branch|{0|<child1>1|2|3|<child4>4|5|6|7|8|9|a|b|c|d|e|f}}"];
node003 [label="(D) MPT Leaf"];
node004 [label="(E) MPT Leaf"];
root:child0 -> node0;
root:childf -> nodef;
node0 -> node00;
node00:child1 -> node003;
node00:child4 -> node004;
}
```
...the blob would contain the following payloads: `(A)(B)(C)(D)(E)(F)`. Note how `(F)` is serialized last and not adjacent to `(B)`, due to the depth-first ordering. Also note that we don't need delimiters between the payloads; the length of each payload is known.
Nodes that can have one or more children (i.e. MPT or Verkle branches, and MPT extensions) store in their payload an indication whether these children's payloads are serialized after their own payload. Let's annotate the previous example:
`(A; child payloads? true)(B; child payload? true)(C; child payloads? true)(D)(E)(F)`
If we wanted our blob to only store nodes `(A)`, `(B)` and `(F)`, then its content would look like that:
`(A; child payloads? true)(B; child payload? false)(F)`
## RocksDB structure
We store different data types in RockDB. To prevent key collisions, each data type has its own 1-byte prefix (a "namespace").
Here's the structure of each data type. In the future we might add blocks, logs, etc, each with its own prefix.
### MPT world state, key has even-length nibbles (prefix: 0x00)
Paths in MPT trees are represented using 4-bit nibbles. For efficiency, we store two nibbles per byte. However, in case there's an odd number of nibbles in a path, we can't distinguish it from a similar path with one more nibble; the keys will collide. We use a prefix to tell them apart.
Keys in this namespace encode a sequence of nibbles (2 nibbles per byte) representing a path down the hexary tree. The number of nibbles is even. The value is an [MPT node](#Encoding-of-MPT-nodes), and the concrete type can be determined by the first two bits.
Note that the tree root has no path; its key is simply the byte 0x00.
### MPT world state, key has even-length nibbles (prefix: 0x01)
Keys in this namespace encode a sequence of nibbles (2 nibbles per byte) representing a path down the hexary tree. The number of nibbles is odd, meaning the last half-byte should be ignored. The value is an [MPT node](#Encoding-of-MPT-nodes), and the concrete type can be determined by the first two bits.
### Verkle world state (prefix: 0x02)
Keys in this namespace are paths down the Verkle tree structure. Each byte represents the offset at a certain depth. The value is a [Verkle node](#Encoding-of-Verkle-nodes), and the concrete type can be determined by the first two bits.
# TBD & concerns to handle
### Requirements from Nimbus code
Go over the `CoreDbBaseFns` interface and the methods it exposes, directly and indirectly. Check which ones are in use and for what purpose. Verify that the proposed design can support them.
### Diffs list, rollback, chain oscillations
We persist the world state when it's finalized, around ~75 blocks old. The state changes during that time must be cached in memory for the client to function (e.g. act as a validator).
Furthermore, we need to support competing block chains, i.e. a situation where blocks are produced in two or more chains concurrently, and are all being processed and validated by our client. This can be achieved by holding several cached world states, or making use of diff layers, or by quickly undoing changes from one chain and redoing changes from the other, for example.
Note that the chains cannot have diverged from a block earlier than the one that has been finalized and persisted.
### Network doesn't reach finalization for a long time
It may be possible for the network to not reach finality for a long time. During that time, the in-memory "diff layers" described above will grow in size. We need to handle that case such that the client can keep functioning as best it can. However, considering this is an exceedingly rare event, there's no need to handle it perfectly.
### Caching
It makes sense to cache parts of the Merkle and Verkle trees. For example, the upper layers that are accessed frequently.
### Serving the state of the last 128 blocks
Full nodes are expected to provide information/state from the past 128 blocks.
### Read/write performance worst case
A worst-case transaction that does 5000 updates that touch ~6000 blobs that are at maximum capacity might cause delays during the reading of the data and during persisting it. This can slow down a miner / validator and cause the proposed block to be ignored, and the attestation to be too old and lose profit. See how this can be minimized / avoided.
### Real-world POC
* Install and run Erigon as a full node, sync latest state
* Export state
* Implement a basic version of _bottom blobs_ and store the imported world state. Compare storage requirements with Erigon.
### Tuning
* Analyze ethereum world state and get hard numbers (depth, sparsity etc)
* Tune parameters for optimal balance of storage, I/O performance and CPU usage.