Efficient management of authenticated data has emerged as a critical research challenge in the blockchain space, particularly given the recent rise of parallel EVM.
This post reviews Merkle Patricia Trie-based authenticated storage (currently used in Ethereum)[1] and analyzes some of its potential alternatives.
A typical blockchain manages its state via a two-layered architecture in which an authenticated data structure (ADS) is built on top of a backend database.
A read to the state involves traversing the ADS from its root down to the target leaf node. During this traversal, the ADS must fetch the content of each node from the backend database (if not cached). Due to the LSM tree structure of the database, each query itself involves multiple disk accesses. Consequently, a single read operation often translates into many I/O operations. This amplification effect worsens as the state size increases.
a7
prefix so an extension node is created to save space and redundancy. The same goes for a77d337
and a77d397
which share the same d3
prefix.A Merkle Patricia Trie (MPT) is a form of an ADS where nodes are linked by hashes and the root node represents the commitment to the underlying data. The MPT aims to reduce redundancy and I/Os by compressing nodes with a single child to form so-called extension nodes. As reads need to query all nodes along the path from the root down to the target leaf node, extension nodes help reduce the number of queries. Extension nodes are more effective when the underlying data is sparse. For example, around 90% of Ethereum accounts have 8-9 depths in the MPTs[2]. Without extension nodes, accounts must make 64 database queries (assuming no cache).
Log-Structured Merge Trees (LSM trees), play a pivotal role as the foundational data structures underpinning widely adopted industrial key-value stores like Google LevelDB.
The core principle behind the LSM tree structure is to optimize write performance by initially buffering ingestions in memory, thereby avoiding the overhead of immediate random writes to disk. Once this buffer reaches capacity, the data is written to disks as immutable sorted runs (i.e., SSTables). Periodically, a background process merges multiple sorted runs, maintaining an organized structure that facilitates efficient data retrieval.
LSM trees employ a "log-first" strategy, where incoming writes are initially buffered in an in-memory data structure known as a memtable.
Upon reaching a predefined threshold, the memtable is flushed to disk as an immutable Sorted String Table (SSTable). SSTables are append-only files, storing key-value pairs in lexicographic order. This sequential write pattern aligns seamlessly with modern storage devices, optimizing write throughput and minimizing disk seek overhead.
Reads in LSM trees involve a multi-step process. To perform a read operation, the system first checks the memtable. If the data is not found, it sequentially searches the SSTables on disk, starting from the newest to the oldest, until the data is located.
SSTables facilitate efficient range scans and point lookups. Each SSTable typically includes an index, such as a sparse index or a Bloom filter, to accelerate key searches. However, as the number of SSTables grows, read performance can degrade due to the need to scan multiple files.
Compaction is the cornerstone of LSM tree performance, responsible for merging and consolidating SSTables to reclaim space, eliminate redundant data, and maintain read performance.
Leveling Compaction. The idea of leveling compaction is to keep a single sorted run per level. The SSTable in a lower tier is merged with the SSTable in a higher tier during compaction to form a larger SSTable. This strategy ensures that smaller, frequently accessed SSTables are compacted more aggressively, optimizing read performance and space amplification while incurring more write amplification.
Tiering Compaction. As opposed to leveling compaction, tiering compaction maintains a maximum of a predefined number of SSTables per level. This approach maintains a predictable number of SSTables at each level, balancing read and write amplification.
Table. Different types of compaction.
Properties | Leveling | Tiering |
---|---|---|
Read Performance | Good | Poor |
Write Performance | Poor | Good |
Space Performance | Good | Poor |
Apart from the compaction type, the eagerness of compaction creates a stringent cost contention between the costs of reads and writes. While compacting more eagerly means that queries need to search fewer components, it increases the amortized cost of writes and thus counter-balances the benefits of LSM tree in the first place.
We summarize a few problems in designing effective state accesses in general blockchains and Ethereum specifically.
In this section, we analyze a few attempts to design efficient authenticated data storage as well as improve Ethereum's MPT in the literature.
This is not an exhaustive list regarding state access.
ChainKV[3] provides a state separation scheme to maintain the non-state data and state data individually to take advantage of access patterns by data types. This separation splits the original large LSM tree into two individual zones. Read operations in each zone incur fewer SSTable lookups than the original larger tree. For writes, compaction is performed per zone which incurs less amplification.
Another component is the Prefix MPT, an optimized MPT-to-KV transformation scheme that carefully makes use of the advantages of MPT and the characteristics of LSM-based KV stores. Prefix MPT allows retaining the original in-memory MPT node structure (and verification) while adding more data (i.e., prefixes) when persisting nodes in KV stores. By choosing which type of prefixes to add, Prefix MPT can choose to support temporal locality or spatial locality. While it helps reduce disk I/Os in querying state data, Prefix MPT does not affect non-state data queries. Furthermore, when the proportion of state data decreases, the performance gain by Prefix MPT also decreases.
Layered MPT[4] works by adding two smaller intermediary in-memory MPTs on top of existing on-disk MPT to reduce read and write amplification. The two small in-memory MPTs keep the data of the most frequently used accounts such as USDT or Uniswap, therefore, writes to and reads from these accounts would be performed very quickly. The on-disk MPT serves as a checkpoint for the state data.
To keep the two in-memory efficient enough to fit in memory, merge operations are periodically performed by flushing updates from these MPTs to the on-disk MPT.
LMPT also has a separate flat KV store for unauthenticated queries during block execution.
The RainBlock[5] architecture offered a compelling solution to the I/O challenges that plague current blockchain systems. The core of its innovation lies in the decoupling of storage and computation, and the introduction of the DSM-Tree.
When executing transactions, the miners obtain needed data from multiple I/O Helpers and send the updates to multiple storage nodes. Each storage node maintains a shard of MPTs in memory. By pre-executing transactions, I/O Helpers can proactively fetch the necessary data and witnesses, reducing the latency associated with on-demand state accesses. This helps minimize I/O operations in the critical path of transaction processing, leading to significant throughput improvements without compromising the security or decentralization of the blockchain.
Data persistence is achieved through a combination of write-ahead logs and checkpoints. The process of navigating the Merkle tree is separated from hash computation; retrieving the next node is a simple pointer dereference.
However, while RainBlock reduces local storage I/O, it increases reliance on the network for state access. This makes the system vulnerable to network congestion and latency issues, which could impact transaction processing performance. Besides, decoupling storage and computation potentially increases the attack surface and makes the system more difficult to reason about and debug.
Merkelized LSM (mLSM)[6] replaces the two-layered architecture by maintaining multiple independent Merkle binary tries in an LSM way. mLSM envisions multiple levels, each containing several immutable Merkle trees. Each level also maintains a lookup cache and all tries are connected via a master root node.
Writes are batched in memory and written to level 0 as a new Merkle tree. Similar to LSM trees, mLSM also performs compaction when needed. Reads require inspecting multiple tries at multiple levels similar to an LSM-based database.
However, compaction in mLSM incurs its overhead (thus affecting write amplification). Furthermore, compaction is usually non-deterministic, which is not suitable for the blockchain case where nodes must agree on the mLSM state at any point in time.
A Jellyfish Merkle Trie (JMT)[7] is quite similar to an MPT except for two aspects.
Nearly Optimized Merkle Tree (NOMT)[8] builds an optimized binary Merkle trie on top of RocksDB. The Merkle trie is cleverly organized in such a way a certain rootless subtree of depth 6 fits into an SSD page. Whenever it reads a node, all sibling nodes (in the subtrees) required for hash recalculation are also fetched in a single I/O round trip.
NOMT also makes use of the internal parallelism of modern SSDs to prefetch data before execution.
Similar to JMT, LETUS[9] also adopts delta-encoding and a version-based indexing schema. Besides, LETUS breaks the two-layered architecture by the use of a so-called DMM-tree. DMM-Tree combines Merkle trees, delta encoding, and log-structured files to significantly reduce storage consumption and I/O amplification.
LETUS also groups related nodes (i.e., two-level 16-ary subtrees) into pages that align well with the SSD disk page size. This helps reduce I/Os when recomputing hashes for data updates.
Verkle Trees[10] (Verkles for short) are a critical step on the roadmap to stateless Ethereum clients. Verkles are similar to MPTs except that they replace the hash with a vector commitment to commit to children nodes.
The key benefit of vector commitments is the size of the proof. Instead of hashing all sibling nodes (i.e., 15 nodes in MPTs) at each level along the path from the leaf node to the root, Verkles only need a constant proof size for each level. Therefore, Verkles allow for a larger branching factor as opposed to the branching factor of 16 used in MPTs. As a result, Verkles provide significantly smaller proofs compared to MPTs.
However, Verkles rely on cryptography vector commitments, which are complicated and computationally expensive (elliptic curve operations vs hashing). Therefore, real computation time will be a bottleneck.
LVMT[11] maintains multiple levels of AMTs[12] (built upon vector commitments), each AMT has a 16-bit key-space. An AMT is a cryptographic vector commitment scheme that can update commitment in constant time. However, the constant ratio is large in practice.
To reduce the update time, LVMT employs a version-based approach. Keys are assigned versions. Updates to keys result in increments of the corresponding versions within AMTs. Thus, expensive elliptic curve multiplications are now reduced to multiplies of a precomputed elliptic curve point to the delta version (i.e., the difference between new version value and old version value), which can be seen as several additions (in case an update increases version by 1, this equals to a single addition).
LVMT introduces a proof sharding technique in which each node is responsible for generating proofs for a specific shard (indicated by 4-bit key prefixes).
Table. A comparison of different state access designs.
Layers | Indexing | Backend DB | Sharded | Building Blocks | |
---|---|---|---|---|---|
MPT-Based | 2 | Hash-based | LSM-based DB | ❌ | MPT, LSM-based KV Stores |
Layered MPT | 2 | Hash-based | LSM-based DB | ❌ | Multiple MPTs, Flat KV Store |
ChainKV | 2 | Hash-based | LSM-based DB | ❌ | State vs Non-state data separation, Prefix MPT |
JMT-based | 2 | Version-based | LSM-based DB | ❌ | Version-based |
RainBlock | 2 | In-memory pointer-based | Checkpoints | ✅ | I/O Helpers, Sharded proofs, Storage decoupling from miners |
mLSM | 1 | Hash-based | LSM-based DB | ❌ | Multiple Merkle trees |
NOMT | 2 | Hash-based | LSM-based DB | ❌ | Page optimization |
LETUS | 1 | Version-based | Log-structured files | ❌ | Delta-encoding, Version-based, Page optimization |
Verkles | 2 | Hash-based | LSM-based DB | ❌ | Vector commitments |
LVMT | 2 | Version-based | LSM-based DB | ✅ | AMT, Version-based, Proof sharding |
Table. Pros and cons of state access designs.
Pros | Cons | |
---|---|---|
MPT-Based | - Ethereum compatibility. | - High amplification due to two-layered architecture and hash-based indexing. - High data consumption. |
Layered MPT | - Fast reads from/writes to hot data. - Fast execution by using a flat key-value store. |
- Potentially expensive queries of authenticated cold data due to the need to access multiple trees. - Expensive compaction between multiple MPTs potentially converted to high write amplification. |
ChainKV | - Separation of state and non-state data utilizing access patterns. - Prefix MPTs taking advantage of data locality (temporal and spatial). |
- Low performance gain when the portion of state data is small. |
JMT-based | - Efficient (possibly zero) compaction due to the use of version-based indexing. - Potentially low data consumption if delta-encoding is employed for keys. |
- Slightly high read amplification due to two-layered architecture. - Removal of extension nodes leading to higher data consumption and slightly high read amplification. |
RainBlock | - Separation of execution nodes and storage nodes. - Trading off local I/O for network I/O benefiting from the parallel I/O and in-memory storage. |
- Potential high latency in transaction processing due to remote data fetch. - Requirements on the number of I/O Helpers and storage nodes. - Potentially high failure recovery time and space usage due to checkpoints. |
mLSM | - Removal of two-layered architecture significantly reducing I/O amplification. | - Additional overhead introduced by compaction. - Compaction non-determinism potentially leads to unwanted results. |
NOMT | - Efficient page optimization reducing I/O amplification. - Effective parallel prefetching due to encoding. |
- Minimal support for historical data. - Less effective when keys have long prefixes (i.e., sparse data) due to page design. |
LETUS | - Removal of two-layered architecture significantly reducing I/O amplification. - Low data consumption due to delta-encoding. - Efficient page optimization reducing I/O amplification. - Efficient compaction due to the use of version-based indexing. |
- Complexity in retrieving the latest data due to the need to replay delta changes. - Lack of range queries. |
Verkles | - Smaller proof sizes than MPT-based enabling stateless clients. | - High complexity due to the use of expensive vector commitments. |
LVMT | - Constant time commitment updates. - Low I/O amplificiation. - Unlimited scalability due to the multi-AMT architecture. |
- High complexity due to the use of cryptographic AMT. - Requirements of node collaboration to generate proofs. |
Choosing which aspects to optimize depends on the compatibility degree each chain or rollup wants. Regarding Ethereum, many refer to EVM-compatibility as the ultimate compatibility with the Ethereum ecosystem, including the gas mechanism, EIPs, and tool reusability; while others only refer to the Solidity-compatibility alone (therefore, they can have more flexibility in designing authenticated storage).
The research space is vast, let's discuss more!!
Ethereum actually uses a modified version of MPTs but we use them interchangeably in this post. ↩︎
ChainKV: A Semantics-Aware Key-Value Store for Ethereum System ↩︎
LMPT: A Novel Authenticated Data Structure to Eliminate Storage Bottlenecks for High Performance Blockchains ↩︎
RainBlock: Faster Transaction Processing in Public Blockchains
↩︎
LETUS: A Log-Structured Efficient Trusted Universal BlockChain
Storage ↩︎