Blockchain State Management: A Review

Introduction

Efficient management of authenticated data has emerged as a critical research challenge in the blockchain space, particularly given the recent rise of parallel EVM.

This post reviews Merkle Patricia Trie-based authenticated storage (currently used in Ethereum)^[1] and analyzes some of its potential alternatives.

A typical blockchain manages its state via a two-layered architecture in which an authenticated data structure (ADS) is built on top of a backend database.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure. Typical two-layered blockchain design.

The ADS guarantees data integrity and is typically built using a Merkle tree-like structure. It also provides a mechanism to verify the presence of specific data within the state. Nodes in the ADS are usually connected via hashes, and each node is stored in the underlying database: leaf nodes hold actual key-value pairs, while internal nodes contain metadata (e.g., hashes) about their children.
The backend database helps persist data to disks. This backend database is often LSM-based: Ethereum uses LevelDB and Diem uses RocksDB.

A read to the state involves traversing the ADS from its root down to the target leaf node. During this traversal, the ADS must fetch the content of each node from the backend database (if not cached). Due to the LSM tree structure of the database, each query itself involves multiple disk accesses. Consequently, a single read operation often translates into many I/O operations. This amplification effect worsens as the state size increases.

Merkle Patricia Tries

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure. A simplified version of the Ethereum’s MPT (source: CSDN). All keys have the same a7 prefix so an extension node is created to save space and redundancy. The same goes for a77d337 and a77d397 which share the same d3 prefix.

A Merkle Patricia Trie (MPT) is a form of an ADS where nodes are linked by hashes and the root node represents the commitment to the underlying data. The MPT aims to reduce redundancy and I/Os by compressing nodes with a single child to form so-called extension nodes. As reads need to query all nodes along the path from the root down to the target leaf node, extension nodes help reduce the number of queries. Extension nodes are more effective when the underlying data is sparse. For example, around 90% of Ethereum accounts have 8-9 depths in the MPTs^[2]. Without extension nodes, accounts must make 64 database queries (assuming no cache).

Log-Structured Merge-Trees

Log-Structured Merge Trees (LSM trees), play a pivotal role as the foundational data structures underpinning widely adopted industrial key-value stores like Google LevelDB.

The core principle behind the LSM tree structure is to optimize write performance by initially buffering ingestions in memory, thereby avoiding the overhead of immediate random writes to disk. Once this buffer reaches capacity, the data is written to disks as immutable sorted runs (i.e., SSTables). Periodically, a background process merges multiple sorted runs, maintaining an organized structure that facilitates efficient data retrieval.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure. Overview of how LSM trees work. Writes are first buffered in memory (MemTable) and then flushed to disks (in the form of SSTables) when the buffer is full. Periodic compaction is performed to merge SSTables to save space.

Write

LSM trees employ a "log-first" strategy, where incoming writes are initially buffered in an in-memory data structure known as a memtable.

Upon reaching a predefined threshold, the memtable is flushed to disk as an immutable Sorted String Table (SSTable). SSTables are append-only files, storing key-value pairs in lexicographic order. This sequential write pattern aligns seamlessly with modern storage devices, optimizing write throughput and minimizing disk seek overhead.

Read

Reads in LSM trees involve a multi-step process. To perform a read operation, the system first checks the memtable. If the data is not found, it sequentially searches the SSTables on disk, starting from the newest to the oldest, until the data is located.

SSTables facilitate efficient range scans and point lookups. Each SSTable typically includes an index, such as a sparse index or a Bloom filter, to accelerate key searches. However, as the number of SSTables grows, read performance can degrade due to the need to scan multiple files.

Compaction

Compaction is the cornerstone of LSM tree performance, responsible for merging and consolidating SSTables to reclaim space, eliminate redundant data, and maintain read performance.

Leveling Compaction. The idea of leveling compaction is to keep a single sorted run per level. The SSTable in a lower tier is merged with the SSTable in a higher tier during compaction to form a larger SSTable. This strategy ensures that smaller, frequently accessed SSTables are compacted more aggressively, optimizing read performance and space amplification while incurring more write amplification.
Tiering Compaction. As opposed to leveling compaction, tiering compaction maintains a maximum of a predefined number of SSTables per level. This approach maintains a predictable number of SSTables at each level, balancing read and write amplification.

Table. Different types of compaction.

Properties Leveling Tiering

Read Performance Good Poor

Write Performance Poor Good

Space Performance Good Poor

Properties	Leveling	Tiering
Read Performance	Good	Poor
Write Performance	Poor	Good
Space Performance	Good	Poor

Apart from the compaction type, the eagerness of compaction creates a stringent cost contention between the costs of reads and writes. While compacting more eagerly means that queries need to search fewer components, it increases the amortized cost of writes and thus counter-balances the benefits of LSM tree in the first place.

Limitations

We summarize a few problems in designing effective state accesses in general blockchains and Ethereum specifically.

Read Amplification. Because data are stored in leaf nodes, each state access requires searching the MPT from root to leaf. Each node access in the MPT also requires multiple SSTable lookups in the backend database. As a result, queries incur $O (\log^{2} n)$ amplifications. On the Ethereum mainnet, this log-squared amplification is roughly 50-60.
Write Amplification. When updating a leaf node, it is required to update all nodes along its path from the root. The larger the depth of the MPT, the more amplifications it requires. Also, when updating multiple leaf nodes, similar nodes at higher levels are most likely to get updated multiple times.
Random Distribution of Data. Nodes in the MPT are indexed via their hashes (i.e., content-addressing). While this helps detect data tampering or malicious attacks, related nodes are randomly distributed making queries incur more searching. LSM-based KV stores often have their data sorted in predefined orders, if keys are hash-based, insertions will result in random positions given the current data spectrum. As a result, compaction will consume more time and I/O.
No Separation of Data. Moreover, non-state data and state data in Ethereum live in the same LSM-based KV store but they have different access patterns, leading to inefficient data access.
Unoptimized SSD Usage. The smallest unit of SSDs is a page and SSDs can’t read or write data in units smaller than a page. A page is usually 4KB in size. Most MPT implementations dedicate one page per node which only accounts for a few hundred bytes. This incurs wasteful storage consumption as the page size is many times larger than the node size. Besides, if we can cleverly organize multiple related nodes into a page, reads would be more efficient.
Heavy Compaction. To achieve acceptable read performance, compaction is periodically performed in the background. When the state gets larger, compaction will take more time and more I/O bandwidth.
State Growth. MPTs use extension nodes to compress nodes sharing the same prefix effectively reducing I/O lookups. As the tree grows larger, the benefits of using extension nodes decrease significantly. The likelihood of two leaf nodes having a long shared prefix in their keys is minimal.
CPU Under Utilization. During the waiting period for disk I/O operations to complete, resources such as CPU and memory remain underutilized and inactive throughout the transaction execution process.

A Review

In this section, we analyze a few attempts to design efficient authenticated data storage as well as improve Ethereum's MPT in the literature.

This is not an exhaustive list regarding state access.

ChainKV

ChainKV^[3] provides a state separation scheme to maintain the non-state data and state data individually to take advantage of access patterns by data types. This separation splits the original large LSM tree into two individual zones. Read operations in each zone incur fewer SSTable lookups than the original larger tree. For writes, compaction is performed per zone which incurs less amplification.

Another component is the Prefix MPT, an optimized MPT-to-KV transformation scheme that carefully makes use of the advantages of MPT and the characteristics of LSM-based KV stores. Prefix MPT allows retaining the original in-memory MPT node structure (and verification) while adding more data (i.e., prefixes) when persisting nodes in KV stores. By choosing which type of prefixes to add, Prefix MPT can choose to support temporal locality or spatial locality. While it helps reduce disk I/Os in querying state data, Prefix MPT does not affect non-state data queries. Furthermore, when the proportion of state data decreases, the performance gain by Prefix MPT also decreases.

Layered MPT

Layered MPT^[4] works by adding two smaller intermediary in-memory MPTs on top of existing on-disk MPT to reduce read and write amplification. The two small in-memory MPTs keep the data of the most frequently used accounts such as USDT or Uniswap, therefore, writes to and reads from these accounts would be performed very quickly. The on-disk MPT serves as a checkpoint for the state data.

To keep the two in-memory efficient enough to fit in memory, merge operations are periodically performed by flushing updates from these MPTs to the on-disk MPT.

LMPT also has a separate flat KV store for unauthenticated queries during block execution.

RainBlock

The RainBlock^[5] architecture offered a compelling solution to the I/O challenges that plague current blockchain systems. The core of its innovation lies in the decoupling of storage and computation, and the introduction of the DSM-Tree.

When executing transactions, the miners obtain needed data from multiple I/O Helpers and send the updates to multiple storage nodes. Each storage node maintains a shard of MPTs in memory. By pre-executing transactions, I/O Helpers can proactively fetch the necessary data and witnesses, reducing the latency associated with on-demand state accesses. This helps minimize I/O operations in the critical path of transaction processing, leading to significant throughput improvements without compromising the security or decentralization of the blockchain.

Data persistence is achieved through a combination of write-ahead logs and checkpoints. The process of navigating the Merkle tree is separated from hash computation; retrieving the next node is a simple pointer dereference.

However, while RainBlock reduces local storage I/O, it increases reliance on the network for state access. This makes the system vulnerable to network congestion and latency issues, which could impact transaction processing performance. Besides, decoupling storage and computation potentially increases the attack surface and makes the system more difficult to reason about and debug.

mLSM

Merkelized LSM (mLSM)^[6] replaces the two-layered architecture by maintaining multiple independent Merkle binary tries in an LSM way. mLSM envisions multiple levels, each containing several immutable Merkle trees. Each level also maintains a lookup cache and all tries are connected via a master root node.

Writes are batched in memory and written to level 0 as a new Merkle tree. Similar to LSM trees, mLSM also performs compaction when needed. Reads require inspecting multiple tries at multiple levels similar to an LSM-based database.

However, compaction in mLSM incurs its overhead (thus affecting write amplification). Furthermore, compaction is usually non-deterministic, which is not suitable for the blockchain case where nodes must agree on the mLSM state at any point in time.

JMT

A Jellyfish Merkle Trie (JMT)^[7] is quite similar to an MPT except for two aspects.

JMTs use a version-based indexing schema as opposed to the hash-based indexing one. This effectively enhances write and compaction performance as keys are prepended with versions making them sorted by default before storing in the backend database.
JMTs remove extension nodes used in MPTs.

NOMT

Nearly Optimized Merkle Tree (NOMT)^[8] builds an optimized binary Merkle trie on top of RocksDB. The Merkle trie is cleverly organized in such a way a certain rootless subtree of depth 6 fits into an SSD page. Whenever it reads a node, all sibling nodes (in the subtrees) required for hash recalculation are also fetched in a single I/O round trip.

NOMT also makes use of the internal parallelism of modern SSDs to prefetch data before execution.

LETUS

Similar to JMT, LETUS^[9] also adopts delta-encoding and a version-based indexing schema. Besides, LETUS breaks the two-layered architecture by the use of a so-called DMM-tree. DMM-Tree combines Merkle trees, delta encoding, and log-structured files to significantly reduce storage consumption and I/O amplification.

LETUS also groups related nodes (i.e., two-level 16-ary subtrees) into pages that align well with the SSD disk page size. This helps reduce I/Os when recomputing hashes for data updates.

Verkle Trees

Verkle Trees^[10] (Verkles for short) are a critical step on the roadmap to stateless Ethereum clients. Verkles are similar to MPTs except that they replace the hash with a vector commitment to commit to children nodes.

The key benefit of vector commitments is the size of the proof. Instead of hashing all sibling nodes (i.e., 15 nodes in MPTs) at each level along the path from the leaf node to the root, Verkles only need a constant proof size for each level. Therefore, Verkles allow for a larger branching factor as opposed to the branching factor of 16 used in MPTs. As a result, Verkles provide significantly smaller proofs compared to MPTs.

However, Verkles rely on cryptography vector commitments, which are complicated and computationally expensive (elliptic curve operations vs hashing). Therefore, real computation time will be a bottleneck.

LVMT

LVMT^[11] maintains multiple levels of AMTs^[12] (built upon vector commitments), each AMT has a 16-bit key-space. An AMT is a cryptographic vector commitment scheme that can update commitment in constant time. However, the constant ratio is large in practice.

To reduce the update time, LVMT employs a version-based approach. Keys are assigned versions. Updates to keys result in increments of the corresponding versions within AMTs. Thus, expensive elliptic curve multiplications are now reduced to multiplies of a precomputed elliptic curve point to the delta version (i.e., the difference between new version value and old version value), which can be seen as several additions (in case an update increases version by 1, this equals to a single addition).

LVMT introduces a proof sharding technique in which each node is responsible for generating proofs for a specific shard (indicated by 4-bit key prefixes).

Comparisons

Table. A comparison of different state access designs.

	Layers	Indexing	Backend DB	Sharded	Building Blocks
MPT-Based	2	Hash-based	LSM-based DB	❌	MPT, LSM-based KV Stores
Layered MPT	2	Hash-based	LSM-based DB	❌	Multiple MPTs, Flat KV Store
ChainKV	2	Hash-based	LSM-based DB	❌	State vs Non-state data separation, Prefix MPT
JMT-based	2	Version-based	LSM-based DB	❌	Version-based
RainBlock	2	In-memory pointer-based	Checkpoints	✅	I/O Helpers, Sharded proofs, Storage decoupling from miners
mLSM	1	Hash-based	LSM-based DB	❌	Multiple Merkle trees
NOMT	2	Hash-based	LSM-based DB	❌	Page optimization
LETUS	1	Version-based	Log-structured files	❌	Delta-encoding, Version-based, Page optimization
Verkles	2	Hash-based	LSM-based DB	❌	Vector commitments
LVMT	2	Version-based	LSM-based DB	✅	AMT, Version-based, Proof sharding

Table. Pros and cons of state access designs.

	Pros	Cons
MPT-Based	- Ethereum compatibility.	- High amplification due to two-layered architecture and hash-based indexing. - High data consumption.
Layered MPT	- Fast reads from/writes to hot data. - Fast execution by using a flat key-value store.	- Potentially expensive queries of authenticated cold data due to the need to access multiple trees. - Expensive compaction between multiple MPTs potentially converted to high write amplification.
ChainKV	- Separation of state and non-state data utilizing access patterns. - Prefix MPTs taking advantage of data locality (temporal and spatial).	- Low performance gain when the portion of state data is small.
JMT-based	- Efficient (possibly zero) compaction due to the use of version-based indexing. - Potentially low data consumption if delta-encoding is employed for keys.	- Slightly high read amplification due to two-layered architecture. - Removal of extension nodes leading to higher data consumption and slightly high read amplification.
RainBlock	- Separation of execution nodes and storage nodes. - Trading off local I/O for network I/O benefiting from the parallel I/O and in-memory storage.	- Potential high latency in transaction processing due to remote data fetch. - Requirements on the number of I/O Helpers and storage nodes. - Potentially high failure recovery time and space usage due to checkpoints.
mLSM	- Removal of two-layered architecture significantly reducing I/O amplification.	- Additional overhead introduced by compaction. - Compaction non-determinism potentially leads to unwanted results.
NOMT	- Efficient page optimization reducing I/O amplification. - Effective parallel prefetching due to encoding.	- Minimal support for historical data. - Less effective when keys have long prefixes (i.e., sparse data) due to page design.
LETUS	- Removal of two-layered architecture significantly reducing I/O amplification. - Low data consumption due to delta-encoding. - Efficient page optimization reducing I/O amplification. - Efficient compaction due to the use of version-based indexing.	- Complexity in retrieving the latest data due to the need to replay delta changes. - Lack of range queries.
Verkles	- Smaller proof sizes than MPT-based enabling stateless clients.	- High complexity due to the use of expensive vector commitments.
LVMT	- Constant time commitment updates. - Low I/O amplificiation. - Unlimited scalability due to the multi-AMT architecture.	- High complexity due to the use of cryptographic AMT. - Requirements of node collaboration to generate proofs.

Moving Forward

Choosing which aspects to optimize depends on the compatibility degree each chain or rollup wants. Regarding Ethereum, many refer to EVM-compatibility as the ultimate compatibility with the Ethereum ecosystem, including the gas mechanism, EIPs, and tool reusability; while others only refer to the Solidity-compatibility alone (therefore, they can have more flexibility in designing authenticated storage).

The research space is vast, let's discuss more!!

Blockchain State Management: A Review

Introduction

Merkle Patricia Tries

Log-Structured Merge-Trees

Write

Read

Compaction

Limitations

A Review

ChainKV

Layered MPT

RainBlock

mLSM

JMT

NOMT

LETUS

Verkle Trees

LVMT

Comparisons

Moving Forward

References