tl;dr: Efficiently maintaining authenticated state on disk is a research problem many blockchains are facing. Succinctly, Merkle trees do not fit in memory and maintaining them on-disk is rife with performance penalties. What can be done?
This blog discusses a few potential directions:
You can read ahead to better understand this problem and its potential solution(s).
Most blockchain systems maintain a large, append-only history of transactions. These transactions get executed, resulting in the blockchain's state, which is typically abstracted as a key-value store (KVS).
This KVS is often authenticated, often via Merkle trees, for three reasons.
Unfortunately, the KVS representing the state can become prohibitively-large in production blockchains: e.g., tens to hundreds of billions of key-value pairs.
Furthermore, authenticating such a large key-value store introduces its own overhead.
For example, building a perfect binary Merkle tree over ~8 billion key-value pairs will require ~256 GiB of memory to store the internal nodes of the tree[11]. Note that this is extra storage beyond the storage required for the key-value pairs themselves and we are not even accounting for the hashes of the leaves in the tree.
While some of these 256 GiB can be stored in RAM instead of on-disk, ultimately, a decentralized blockchain must assume most nodes do not have 256 GiB of RAM.
This motivates why such an authenticated key value store (AKVS)[12] must be stored on disk.
Certain choices of trees do not suffice for implementing an AKVS.
For example, simply appending key-value pairs left-to-right to a Merkle tree (e.g., a history tree [13]) does not give an authenticated KVS construction. This is because, in such a Merkle tree, one cannot create a succinct Merkle proof that a key's value is the most recent value.
Specifically, the naive solution of giving a Merkle path to the leaf with the latest value for that key is easily attacked: an attacker can give a different Merkle path to a leaf with a less recent value. From the verifier's perspective, there is no way to tell the recency of the value.
Suppose one has enough space to store such an AKVS on disk. The next challenge will be efficiently updating the AKVS when proposing & validating blocks.
First, let us consider block proposal. To propose a block, a block proposer needs to include the new digest of the AKVS so as to reflect the changes to the state after that block’s transactions have been applied.
In a Merkle prefix tree, the most popular data structure for a blockchain AKVS, computing the new digest involves expensive disk I/O:
Overall, block proposal requires reads and writes in an AKVS of size key-value pairs after a sequence of updates.
Second, let us consider block validation. Here, things look very similar: a validator must verify that the new digest included in the proposed block is correct. For this, the validator must re-execute steps (1) - (2) from above and check it obtained the same digest as included in the block. Lastly, the validator must execute step (3) from above in order for the next block’s validation to proceed against the new digest.
Therefore, block validation incurs the same disk cost.
In practice, large block sizes help reduce the impact of the factor in the expression above. But, nonetheless, this cost remains problematic.
Many blockchain systems implement their AKVS as a Merkle prefix tree and then store the nodes in a database (DB) such as LevelDB or RocksDB. This further amplifies the cost of reading (and writing) Merkle nodes from (and to) disk, depending on the peculiarities of the DB layer.
For example, in RocksDB, background compaction of levels will slow down I/O significantly. Furthermore, RocksDB stores a significant amount of metadata, which increases the size of a RocksDB-backed AKVS, which we estimated above[11:1].
Below, we summarize the main sources of overhead when persisting Merkle trees:
On-disk authenticated state management is a problem almost every decentralized blockchain with an efficient, fault-tolerant state synchronization protocol faces.
This problem could use more attention from the research community, whether in terms of systematizing prior related research, or in terms of generating new ideas.
Let's discuss!
PS: There are many more papers worth referecing in this post. I hope to slowly update it with them. Here's a few I ran into that I hope to look into:
An interesting line of related work uses enclaves such as Intel SGX to implement an on-disk AKVS[20][21][22][23][24]. It could be useful to leverage ideas from these works.
Big thanks to Alden Hu, Sital Kedia, Aaron Gao, Josh Lind and Zekun Li for schooling me in the performance bottlenecks of production blockchains and helping me refine some of these ideas.
Big thank you to Soujanya Ponnapalli, the coauthor of the Merkelized LSM[10:2] approach and RainBlock[9:2] for very useful discussions on why this approach could be better in practice.
Verkle Trees: V(ery short M)erkle Trees; by John Kuszmaul; in MIT PRIMES Conference '18; 2018; https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf ↩︎ ↩︎
Streaming Authenticated Data Structures; by Papamanthou, Charalampos and Shi, Elaine and Tamassia, Roberto and Yi, Ke; in EUROCRYPT 2013; 2013 ↩︎ ↩︎
Incremental Cryptography: The Case of Hashing and Signing; by Bellare, Mihir and Goldreich, Oded and Goldwasser, Shafi; in Advances in Cryptology –- CRYPTO '94; 1994 ↩︎ ↩︎
A New Paradigm for Collision-Free Hashing: Incrementality at Reduced Cost; by Bellare, Mihir and Micciancio, Daniele; in EUROCRYPT '97; 1997 ↩︎ ↩︎
Improving Authenticated Dynamic Dictionaries, with Applications to Cryptocurrencies; by Reyzin, Leonid and Meshkov, Dmitry and Chepurnoy, Alexander and Ivanov, Sasha; in FC'17; 2017 ↩︎ ↩︎
Edrax: A Cryptocurrency with Stateless Transaction Validation; by Alexander Chepurnoy and Charalampos Papamanthou and Yupeng Zhang; 2018; https://eprint.iacr.org/2018/968 ↩︎ ↩︎
Batching Techniques for Accumulators with Applications to IOPs and Stateless Blockchains, by Boneh, Dan and Bünz, Benedikt and Fisch, Ben, in CRYPTO'19, 2019 ↩︎ ↩︎
Pointproofs: Aggregating Proofs for Multiple Vector Commitments, by Gorbunov, Sergey and Reyzin, Leonid and Wee, Hoeteck and Zhang, Zhenfei, in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, [URL] ↩︎ ↩︎
RainBlock: Faster Transaction Processing in Public Blockchains; by Soujanya Ponnapalli and Aashaka Shah and Souvik Banerjee and Dahlia Malkhi and Amy Tai and Vijay Chidambaram and Michael Wei; in 2021 USENIX Annual Technical Conference (USENIX ATC 21); 2021; https://www.usenix.org/conference/atc21/presentation/ponnapalli ↩︎ ↩︎ ↩︎
mLSM: Making Authenticated Storage Faster in Ethereum; by Pandian Raju and Soujanya Ponnapalli and Evan Kaminsky and Gilad Oved and Zachary Keener and Vijay Chidambaram and Ittai Abraham; in 10th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 18); 2018; https://www.usenix.org/conference/hotstorage18/presentation/raju ↩︎ ↩︎ ↩︎
We’ll use the more abstract AKVS terminology, and refer to concrete constructions such as Merkle prefix trees or Merkle binary search trees when necessary. ↩︎
Efficient Data Structures for Tamper-Evident Logging; by Crosby, Scott A. and Wallach, Dan S.; in Proceedings of the 18th Conference on USENIX Security Symposium; 2009 ↩︎
Persistent Authenticated Dictionaries and Their Applications; by Anagnostopoulos, Aris and Goodrich, Michael T. and Tamassia, Roberto; in Information Security; 2001 ↩︎
Dynamic Merkle B-tree with Efficient Proofs; by Chase Smith and Alex Rusnak; 2020 ↩︎
Secure Untrusted Data Repository (SUNDR); by Li, Jinyuan and Krohn, Maxwell and Mazières, David and Shasha, Dennis and Shasha, Dennis; in Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6; 2004; http://dl.acm.org/citation.cfm?id=1251254.1251263 ↩︎
https://stackoverflow.com/questions/49744812/embedded-key-value-db-vs-just-storing-one-file-per-key ↩︎
Performance Comparison of Operations in the File System and in Embedded Key-Value Databases; by Jesse Hines and Nicholas Cunningham and German Alferez; 2020; https://knowledge.e.southern.edu/cgi/viewcontent.cgi?article=1110&context=crd ↩︎
Jellyfish Merkle Tree; by Zhenhuan Gao and Yuxuan Hu and Qinfan Wu; 2021; https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf ↩︎
FastVer: Making Data Integrity a Commodity, by Arvind Arasu and Badrish Chandramouli and Johannes Gehrke and Esha Ghosh and Donald Kossmann and Jonathan Protzenko and Ravi Ramamurthy and Tahina Ramananandro and Aseem Rastogi and Srinath Setty and Nikhil Swamy and Alexander van Renen and Min Xu, in Proceedings of the 2021 International Conference on Management of Data, 2021, [URL] ↩︎
Concerto, by Arvind Arasu and Ken Eguro and Raghav Kaushik and Donald Kossmann and Pingfan Meng and Vineet Pandey and Ravi Ramamurthy, in Proceedings of the 2017 {ACM} International Conference on Management of Data, 2017, [URL] ↩︎
Speicher: Securing LSM-Based Key-Value Stores Using Shielded Execution, by Bailleu, Maurice and Thalheim, Jörg and Bhatotia, Pramod and Fetzer, Christof and Honda, Michio and Vaswani, Kapil, in Proceedings of the 17th USENIX Conference on File and Storage Technologies, 2019 ↩︎
Authenticated key-value stores with hardware enclaves, by Kai Li and Yuzhe Tang and Qi Zhang and Jianliang Xu and Ju Chen, in Proceedings of the 22nd International Middleware Conference: Industrial Track, 2021, [URL] ↩︎
Authenticated Key-Value Stores with Hardware Enclaves, by Yuzhe Tang and Ju Chen and Kai Li and Jianliang Xu and Qi Zhang, 2019 ↩︎