[TOC]
# Merklized Data
This document is intended to provide a summary of Merkle tree technology employed by various blockchain projects.
## Introduction
To assess how Merkle trees are used in various blockchain projects, we consider the following questions:
- What data is Merklized?
- When are Merkle trees stored in-memory? When are Merkle trees persisted?
- What Merkle tree algorithms are employed? (E.g., dense or sparse, balanced or MMR, Patricia trees, Jellyfish trees, etc.)
*Note: This form of analysis may be considered a form of reverse engineering. We explore project documentation and source code to understand how Merkle trees are used in these contexts so that we may better understand modern applications and implementations of Merkle trees and how to apply these tools in the context of our own blockchain.*
Generally speaking, there are two main applications of Merklized data:
1. Calculating a global state hash
2. Calculating block metadata (headers)
Layer 1 blockchains that calculate a global state tree, such as Ethereum, use a Merkle tree to calculate a hash of the global state. Furthermore, Ethereum clients like Geth and Erigon store the Merkle tree data (leaves and internal nodes) on disk to facilitate this calculation. As per the Ethereum specification, the global state hash is the Merkle root calculated using a Patrica Merkle tree, where the leaf data comprises all Ethereum addresses (see [Notes on Ethereum](#Notes-on-Ethereum)). The global state hash must be included in every Ethereum account and block header.
For both Layer 1 and Layer 2 blockchains, block headers often contain Merkle roots to provide a summary of the block's data (a Merkle root can be used to provide a unique fingerprint of a data set instead of storing the data in its entirety) as well a verifiable proof of data authenticty (the Merkle root calculated from the block's raw data must match the expected Merkle root). More concretely, a block header may contain, in addition to a state root, Merkle roots of the block's transactions and of the block's transaction receipts.
A review of existing projects suggest the following patterns:
- The global state tree (if present) is persisted to disk. The global state tree, including all internal intermediate nodes, is saved to persistent memory since generating the state tree from leaf data is slow and expensive. When using a sparse Merkle tree algorithm, key-value data (such as Ethereum addresses and accounts) can be inserted or removed with random access, and the new Merkle root can be calculated with logarithmic time complexity.
- Block metadata trees, such as a block's transaction Merkle tree and receipts Merkle tree, are calculated in memory. These Merkle trees are small enough that they can be constructed on demand entirely from leaf data. Some implementations use densely stored Merkle trees (i.e. binary) since the data has deterministic ordering. Other implementations use key-value data stores, where the key is the data's hash, suggesting a sparse Merkle tree.
- Some projects, such as [Diem](#Diem), use a single database to store both ledger data and Merklized ledger data. Other projects, such as [Aptos](#Aptos) use separate databases for ledger data and Merklized ledger data.
## Fuel
### Merkle Data
This section aims to summarize Fuel's requirements for Merklized data and proposes how to achieve these requirements. (This section is an ongoing work in progress.)
Fuel's block header definition is undergoing formal specification, wtih a working draft [here](https://github.com/FuelLabs/fuel-specs/issues/284). The header definition currently includes the following Merkle roots:
- `blocksRoot`*: Root of the chain of blocks (BMT)
- `transactionsRoot`: Root of the block's transactions (BMT)
- `messagesRoot`*: Root of the block's message IDs (BMT)
**These names are not formalized and are used here for communication purposes only.*
`blocksRoot`: Unlike other chains that just include the hash of the previous block into the current block header, Fuel will use a Merkle mountain range of block IDs and store the current root on each header. This allows L1 proofs about data in previous block headers without needing to maintain the state of all headers in L1.
Because the `blocksRoot` spans all blocks currently in the chain, and not just the current block, we can build the blocks Merkle tree in persistent storage. By using persistent storage, we can incrementally build upon intermediate states of the blocks Merkle tree at each block, rather than building it from scratch. This is similar to how Ethereum clients incrementally build the global state root.
For the calculation of the `transactionsRoot` and `messagesRoot` stored in Fuel block headers, we can use ephermal in-memory binary Merkle trees. This is congruent with the pattern of Merkle tree applications described above.
Fuel will also Merklize:
- Contract state
- Contract balances
### Merkle Trees
Fuel currently implements three types of Merkle trees:
1. Sparse Merkle Tree (SMT)
2. Binary Merkle Tree (BMT)
3. Binary Merkle Sum Tree (BMST)
Merkle trees can be broadly classified into one of two memory categories: sprase or dense. These categories refer to the tree's interface for adding leaves: sparse trees model a sparse memory data structure, such as a set, to allow insertion and deletion with random access; dense trees model a dense memory data structure, such as a vector, and present an append-only interface.
The Sparse Merkle tree is an implementation of a sparse tree, while the Binary Merkle tree and Binary Merkle Sum tree are implementations of dense trees. Fuel's binary Merkle tree implementation supports Merkle mountain ranges (unbalanced dense trees) by using the algorithm defined in [RFC 6962](https://www.rfc-editor.org/rfc/rfc6962).
Merkle trees are agnostic of their underlying storage technology and communicate to an underlying data store through a `Storage` interface. The `Storage` interface can be implemented on ephemeral data stores, such as an in-memory hash map, or on persistent storage, such as a database. The lifetime of the underlying storage represents another axis of Merkle tree categories.
### Merkle Data and Merkle Trees
We can now map our data and Merkle roots to our memory and lifetime axes. This allows us to identify the appropriate Merkle tree implementation for each root.
| Root | Data | Memory category | Lifetime category | Merkle Tree |
| ---------------------- | ----------------- | --------------- | ----------------- | -------------- |
| `blocksRoot` | Block IDs | Dense | Persistent | Persistent BMT |
| `transactionsRoot` | Transaction IDs | Dense | In-memory | In-memory BMT |
| `messagesRoot` | Message IDs | Dense | In-memory | In-memory BMT |
| `contractStateRoot` | Contract state | Sparse | Persistent | Persistent SMT |
| `contractBalancesRoot` | Contract balances | Sparse | Persistent | Persistent SMT |
### Databases
This document proposes the use of two RocksDB databases for Fuel: the first for ledger data, and the second for Merklized ledger data.
## Ethereum
Ethereum is a Layer 1 blockchain. All of the Merkle trees in Ethereum's execution layer use a Merkle Patricia Trie.
There is one global state trie. The leaves in the global state trie comprise all Ethereum accounts. Therefore, a `path` is always `keccak256(ethereumAddress)` and a `value` is always `rlp(ethereumAccount)`[<sup>[1]</sup>](#ethereum_1). The global state trie is used to calculate the `stateRoot`.
More specifically an Ethereum account is a 4 item array of `[nonce, balance, storageRoot, codeHash]`. The account's `storageRoot` is the root of another Merkle tree formed by all of the account's contract data[<sup>[1]</sup>](#ethereum_1).
All global state changes (e.g., token transfers) are called transactions. Transaction receipts (or simply, receipts) record the transaction outcome. Transactions are batched together into blocks. A block additionally contains a header (a set of metadata) as well as a list of ommer (uncle) blocks.
### Block Headers
The structure of the block header is outlined in the Ethereum Yellow Paper[<sup>[2]</sup>](#ethereum_2):
- **parentHash**: The Keccak 256-bit hash of the parent block’s header, in its entirety
- **ommersHash**: The Keccak 256-bit hash of the ommers list portion of this block
- **beneficiary**: The 160-bit address to which all fees collected from the successful mining of this block be transferred
- **stateRoot**: The Keccak 256-bit hash of the root node of the state trie, after all transactions are executed and finalisations applied
- **transactionsRoot**: The Keccak 256-bit hash of the root node of the trie structure populated with each transaction in the transactions list portion of the block
- **receiptsRoot**: The Keccak 256-bit hash of the root node of the trie structure populated with the receipts of each transaction in the transactions list portion of the block
- **logsBloom**: The Bloom filter composed from indexable information (logger address and log topics) contained in each log entry from the receipt of each transaction in the transactions list
- **difficulty**: A scalar value corresponding to the difficulty level of this block. This can be calculated from the previous block’s difficulty level and the timestamp
- **number**: A scalar value equal to the number of ancestor blocks. The genesis block has a number of zero
- **gasLimit**: A scalar value equal to the current limit of gas expenditure per block
- **gasUsed**: A scalar value equal to the total gas used in transactions in this block
- **timestamp**: A scalar value equal to the reasonable output of Unix’s time() at this block’s inception
- **extraData**: An arbitrary byte array containing data relevant to this block. This must be 32 bytes or fewer
- **mixHash**: A 256-bit hash which, combined with the nonce, proves that a sufficient amount of computation has been carried out on this block
- **nonce**: A 64-bit value which, combined with the mixhash, proves that a sufficient amount of computation has been carried out on this block
A block's header contains three Merkle roots: the global `stateRoot` as well as the block's local `transactionsRoot` and `receiptsRoot`.
The transactions of a block form the leaves of the block's Transaction Merkle tree. The block's transaction Merkle tree calculates the `transactionsRoot`. Similarly, the transaction receipts form the leaves of the block's receipts Merkle tree, which is used to calculate the `receiptsRoot`.
#### References
<a id="ethereum_1">1.</a> [Ethereum Developer Resources](https://ethereum.org/en/developers/docs/data-structures-and-encoding/patricia-merkle-trie/#state-trie), Patricia Merkle Trees
<a id="ethereum_2">2.</a> [Ethereum Yellow Paper](https://ethereum.github.io/yellowpaper/paper.pdf), p. 5
## Projects
### Bitcoin
#### Repos
- https://github.com/bitcoin/bitcoin
#### Languages
- C++ (client)
- Python
#### Database
- *TODO*
Bitcoin is a Layer 1 blockchain.
Bitcoin blocks contain a block header. Block headers comprise 80 bytes and include the following metadata[<sup>[1]</sup>](#bitcoin_1):
- version
- previous block header hash
- Merkle root hash
- time
- nBits
- nonce
The Merkle root hash is the 32 byte root derived from the block's transaction Merkle tree. The transaction Merkle tree is a densly packed binary Merkle tree where all leaves represent the TXIDs of transactions in this block[<sup>[2]</sup>](#bitcoin_2).
A block's transaction Merkle is artificially balanced when calculating the root. If the number of transactions in the block does not land on a power of 2, then the last transactions are duplicated to pad the difference[<sup>[3]</sup>](#bitcoin_3). This is in contrast to the Merkle Mountain Range approach employed by other binary Merkle tree implementations.
*TODO: Is any Merklized data stored on disk? If so, what database technology is used?*
#### References
<a id="bitcoin_1">1.</a> [Bitcoin Developer](https://developer.bitcoin.org/reference/block_chain.html#block-headers), Block Headers
<a id="bitcoin_2">2.</a> [Bitcoin Developer](https://developer.bitcoin.org/reference/block_chain.html#merkle-trees), Merkle Trees
<a id="bitcoin_3">3.</a> [Bitcoin's Merkle Tree on GitHub](https://github.com/bitcoin/bitcoin/blob/master/src/consensus/merkle.cpp)
---
### Go Ethereum
#### Repos
- https://github.com/ethereum/go-ethereum
#### Languages
- Go
#### Database
- LevelDB
Go Ethereum (or Geth) is an Ethereum client written in Go. Geth uses LevelDB as its database solution. LevelDB is a key-value store.
In Geth, the following are persisted to disk:
- The global state tree
- Blocks
##### Global State Tree
The global state tree is persisted to disk for performance reasons. It is infeasibly expensive to recalculate the global state tree from leaf data (Ethereum accounts) alone every time a transaction is made.
##### Blocks
A block's constituent header and body (list of transactions, list of ommers) are persisted separately. For a block's header, the key is the hash of the header and the value is the encoded header. Similary, for a block's body, the key is the hash of the body and the value is the incoded body.
A block's internal Merkle tree data (transaction tree and receipts tree) is constructed and stored only in memory; this data is not persisted to disk.
*TODO: Does Geth use a single LevelDB instance for both the global state tree and block data? Or does it employ separate LevelDB instances?*
---
### Aptos
#### Repos
- https://github.com/aptos-labs/aptos-core
#### Languages
- Rust
#### Database
- RocksDB
Aptos is a Layer 1 blockchain. Aptos is a spiritual successor to [Diem](#Diem) and builds upon the Diem codebase.
Aptos uses RocksDB for its database solution. RocksDB is a key-value store built upon the LevelDB library.
Aptos uses two separate RocksDB databases for ledger data and Merklized ledger data. This is in contrast to Diem's design, where ledger data and Merklized ledger data are stored in a single RocksDB instance. Both RocksDB databases are created under the same directory; generally, this is the temporary directroy.
Aptos uses a persistent Merkle tree to generate Merkle roots for the global state root. Leaf data is comprised of Aptos state keys. Internal nodes are formed by joining leaves and other internal nodes. Node data is stored as a key-value pair: the key is the node key, and the value is the serialized node. Serialized node data is stored in the `"jellyfish_merkle_node"` column.
Aptos uses a Jellyfish Merkle tree, similar to Diem.
Merklized Data Database Column Families:
- `"default"`
- `"db_metadata"`
- `"jellyfish_merkle_node"`
- `"stale_node_index"`
- `"stale_node_index_cross_epoch"`
Merklized Node data is stored as a key-value pair: the key is the Node key, and the value is the serialized Node. Serialized Node data is stored in the `"jellyfish_merkle_node"` column.
`jellyfish_merkle_node`:
| Key | Value |
| -------- | --------------- |
| node_key | serialized_node |
---
### Diem
#### Repos
- https://github.com/diem/diem
#### Languages
- Rust
#### Database
- RocksDB
Diem is a Layer 1 blockchain and the predecessor to fellow Layer 1 blockchains Aptos and Sui.
Diem uses RocksDB as its underlying database solution. RocksDB is a key-value database. Diem uses a single RocksDB database for its ledger data and Merkelized ledger data.
Data is organized in column families:
- `"default"`
- `"epoch_by_version"`
- `"event_accumulator"`
- `"event_by_key"`
- `"event_by_version"`
- `"event"`
- `"jellyfish_merkle_node"`
- `"ledger_counters"`
- `"stale_node_index"`
- `"transaction"`
- `"transaction_accumulator"`
- `"transaction_by_account"`
- `"transaction_info"`
Diem authored the Jellyfish Merkle tree, a custom type of Merkle tree.
Diem uses a persistent Merkle tree to generate Merkle roots for the global state root. Leaf data is comprised of all Diem accounts. Internal nodes are formed by joining leaves and other internal nodes. Node data is stored as a key-value pair: the key is the node key, and the value is the serialized node. Serialized node data is stored in the `"jellyfish_merkle_node"` column.
`jellyfish_merkle_node`:
| Key | Value |
| -------- | --------------- |
| node_key | serialized_node |
---
### Erigon
#### Repos
- https://github.com/ledgerwatch/erigon
#### Languages
- Go
#### Database
- MDBX[<sup>[1]</sup>](#erigon_1)
Erigon is an implementation of Ethereum (execution client) written in Go. It is an alternative to Geth.
Erigon uses Go bindings for MDBX, a key-value database to store data. Data is organized in tables (or "buckets"). The list of main tables is given[<sup>[2]</sup>](#erigon_2):
- Headers
- Block Bodies
- Header Numbers
- Receipts
- PlainState
- History Of Accounts
- Change Sets
- HashedState
- IntermediateTrieHashes
- Tx Senders
This indicates that block headers and block bodies are stored separately - similar to Geth.
##### Global State
`HashedState` and `IntermediateTrieHashes`: these tables are used to calculate Merkle Trie root hash of the current global state. For the Genesis block, these tables are not populated, because the root hash is known in advance[<sup>[2]</sup>](#erigon_2).
Erigon stores the Ethereum state tree on disk[<sup>[3]</sup>](#erigon_3). The `PlainStateBucket` table stores state with keys before hashing. The `CurrentStateBucket` table stores the same data with hashed keys. (*TODO: `CurrentStateBucket` is not mentioned in the above list of main tables. Check the code to verify the most current list of buckets*)
Erigon's global Merkle state tree design is described in-depth here: https://github.com/ledgerwatch/erigon/blob/325c9cd9d8f1034baa218ef46a7f40ae48410904/docs/programmers_guide/guide.md#organising-ethereum-state-into-a-merkle-tree.
##### Block Headers
The `Headers` table stores block header records. For a given block, there are three types of block headers stored herein:
- Full Header
- Total Difficulty Header
- Canonical Header
The structure of a Full Header conforms to the Ethereum specification outlined in the Ethereum Yellow Paper. A block's Full Header in Erigon is equivalent to a block's Header in Geth.
For the purposes of this document, the definitions of the Total Difficulty Header and Canonical Hearder are out of scope.
A Full Header is keyed with the following structure:
```key: 8-byte big-endian block number + 32-byte block hash```
The Total Difficulty Header and Canonical Header are keyed similarly, except they are suffixed with `0x74` (ASCII code for `t`) and `0x6E` (ASCII code for `n`) respectively.
#### References
<a id="erigon_1">1.</a> [Erigon FAQ](https://github.com/ledgerwatch/erigon/blob/325c9cd9d8f1034baa218ef46a7f40ae48410904/docs/programmers_guide/db_faq.md)
<a id="erigon_2">2.</a> [Erigon DB Walkthrough](https://github.com/ledgerwatch/erigon/blob/325c9cd9d8f1034baa218ef46a7f40ae48410904/docs/programmers_guide/db_walkthrough.MD)
<a id="erigon_3">3.</a> [Erigon Guide, Organizing Ethereum State into a Merkle Tree](https://github.com/ledgerwatch/erigon/blob/325c9cd9d8f1034baa218ef46a7f40ae48410904/docs/programmers_guide/guide.md#organising-ethereum-state-into-a-merkle-tree)
---
### Hedera
TODO
---
### Sui
TODO