Call for research: Fast, on-disk authenticated key-value stores for state management

tl;dr: Efficiently maintaining authenticated state on disk is a research problem many blockchains are facing. Succinctly, Merkle trees do not fit in memory and maintaining them on-disk is rife with performance penalties. What can be done?

This blog discusses a few potential directions:

We could investigate on-disk, high-arity Verkle trees^[1].
We could make the problem obsolete by designing homomorphic authenticated data structures^[2] whose digest can be updated directly without going to disk.
We could investigate new state synchronization designs that rely not on Merkle trees, but on other cryptographic primitives such as multiset incremental hash functions^[3],^[4].
We could borrow ideas from stateless validation^[5],^[6],^[7],^[8],^[9].
We could design new databases optimized for storing Merkle trees.
We could authenticate a database like RocksDB directly, instead of storing a Merkle tree in the database^[10].

You can read ahead to better understand this problem and its potential solution(s).

Motivation

Most blockchain systems maintain a large, append-only history of transactions. These transactions get executed, resulting in the blockchain's state, which is typically abstracted as a key-value store (KVS).

This KVS is often authenticated, often via Merkle trees, for three reasons.

First, to allow new nodes to efficiently state-sync when joining the network for the first time. (We call this state synchronization.)
Second, to allow slow nodes or offline-for-a-while nodes to easily catch up with state updates they have missed. This also falls into state synchronization.
Third, to cryptographically convince light clients that a particular key has a particular value in the state (e.g., Alice has 5 coins, or memory location
$x$ has value
$y$ in smart contract
$c$ ).

Blockchain state does not fit in memory

Unfortunately, the KVS representing the state can become prohibitively-large in production blockchains: e.g., tens to hundreds of billions of key-value pairs.

Furthermore, authenticating such a large key-value store introduces its own overhead.

For example, building a perfect binary Merkle tree over ~8 billion key-value pairs will require ~256 GiB of memory to store the internal nodes of the tree^[11]. Note that this is extra storage beyond the storage required for the key-value pairs themselves and we are not even accounting for the hashes of the leaves in the tree.

While some of these 256 GiB can be stored in RAM instead of on-disk, ultimately, a decentralized blockchain must assume most nodes do not have 256 GiB of RAM.

This motivates why such an authenticated key value store (AKVS)^[12] must be stored on disk.

Not just any kind of Merkle tree

Certain choices of trees do not suffice for implementing an AKVS.

For example, simply appending key-value pairs left-to-right to a Merkle tree (e.g., a history tree ^[13]) does not give an authenticated KVS construction. This is because, in such a Merkle tree, one cannot create a succinct Merkle proof that a key's value is the most recent value.

Specifically, the naive solution of giving a Merkle path to the leaf with the latest value for that key is easily attacked: an attacker can give a different Merkle path to a leaf with a less recent value. From the verifier's perspective, there is no way to tell the recency of the value.

Updating on-disk Merkle prefix trees is too slow

Suppose one has enough space to store such an AKVS on disk. The next challenge will be efficiently updating the AKVS when proposing & validating blocks.

First, let us consider block proposal. To propose a block, a block proposer needs to include the new digest of the AKVS so as to reflect the changes to the state after that block’s transactions have been applied.

In a Merkle prefix tree, the most popular data structure for a blockchain AKVS, computing the new digest involves expensive disk I/O:

First, Merkle proofs for all keys whose values have changed must be read from disk. (Naturally, this can be optimized to avoid reading nodes twice.)
Second, these proofs (together with the new values of the keys) are used to compute the new Merkle root (i.e., the new digest) for the updated state.
Third, all Merkle nodes computed in step (2) above must be written back to disk. This ensures Merkle proofs remain up-to-date.

Overall, block proposal requires

O (k \log n)

reads and writes in an AKVS of size

n

key-value pairs after a sequence of

k

updates.

Second, let us consider block validation. Here, things look very similar: a validator must verify that the new digest included in the proposed block is correct. For this, the validator must re-execute steps (1) - (2) from above and check it obtained the same digest as included in the block. Lastly, the validator must execute step (3) from above in order for the next block’s validation to proceed against the new digest.

Therefore, block validation incurs the same

O (k \log n)

disk cost.

In practice, large block sizes help reduce the impact of the

\log n

factor in the

O (k \log n)

expression above. But, nonetheless, this

O (k \log n)

cost remains problematic.

Overheads of databases when persisting Merkle trees

Many blockchain systems implement their AKVS as a Merkle prefix tree and then store the nodes in a database (DB) such as LevelDB or RocksDB. This further amplifies the cost of reading (and writing) Merkle nodes from (and to) disk, depending on the peculiarities of the DB layer.

For example, in RocksDB, background compaction of levels will slow down I/O significantly. Furthermore, RocksDB stores a significant amount of metadata, which increases the size of a RocksDB-backed AKVS, which we estimated above^[11:1].

Challenges and opportunities

Below, we summarize the main sources of overhead when persisting Merkle trees:

For a perfect binary tree of
$n$ nodes, we must persist
$n - 1$ internal nodes of 32 bytes each.
Updating the Merkle prefix tree requires a logarithmic number of disk reads and writes (as detailed above)
We are (more-or-less) randomly reading & writing paths in the tree
The database layer itself is not necessarily optimized for storing Merkle trees
The file system might further slow down the database layer
Maintaining past versions of the authenticated data structure^[14] introduces its own overheads. For example, pruning must be employed to delete old versions, which generates even more I/O.

Cryptographic-research opportunities

Can we use high-arity Verkle trees^[1:1]? This would decrease the number of internal nodes that must be stored in the DB. The reduced I/O could be worth the higher computation.
1. Using non-pairing curves with inner-product arguments (IPAs) would drastically improve performance.
2. A key challenge here would be computing proofs fast for light clients
3. Another would be figuring out the performance impact on state synchronization
Can we design more efficient streaming authenticated data structures^[2:1], specifically for key-value stores? This would preclude the need to read Merkle proofs (or any other auxiliary information) from disk in order to compute the new digest when proposing or validating a block.
1. A key challenge here would be to co-design a state synchronization mechanism for this
<< Your idea here >>?

Database research opportunities

Given that the only performant AKVS we know of are Merkle-based, can we design an efficient DB layer optimized for storing & updating Merkle trees?
Can we authenticate the DB layer itself, instead of storing an AKVS in the DB? Some relevant previous work includes:
1. Raju et al.^[10:1] investigate authenticating log-structured merge tree-based DBs such as RocksDB.
2. Smith and Rusnak^[15] investigate authenticating B-trees.
3. Li et al.^[16] authenticate a file system.
Can we investigate the overhead of the file-system itself on the database layer^[17],^[18]?
Can we optimize the layout of the data in our database? For example, previous work investigates storing a binary Merkle tree as a 16-ary Merkle tree on disk^[19].
<< Your idea here >>?

System design opportunities

Are Merkle trees even necessary for state-synchronization? (Let us ignore they might be necessary for light clients for a second.)
Can ideas from stateless validation^[5:1]^[6:1]^[7:1]^[8:1]^[9:1] be leveraged?
To what extent can incremental multi-set hash functions^[3:1],^[4:1] be used for state synchronization?
Can we efficiently co-design the AKVS together with the state synchronization protocol?
<< Your idea here >>?

Conclusion

On-disk authenticated state management is a problem almost every decentralized blockchain with an efficient, fault-tolerant state synchronization protocol faces.

This problem could use more attention from the research community, whether in terms of systematizing prior related research, or in terms of generating new ideas.

Let's discuss!

PS: There are many more papers worth referecing in this post. I hope to slowly update it with them. Here's a few I ran into that I hope to look into:

FastAD: An Authenticated Directory for Billions of Objects
Authenticated Relational Tables and Authenticated Skip Lists
Ideal hash trees, by Phil Bagwell
Fast And Space Eﬃcient Trie Searches, by Phil Bagwell
Skip Lists: A Probabilistic Alternative to Balanced Trees, by William Pugh

An interesting line of related work uses enclaves such as Intel SGX to implement an on-disk AKVS^[20]^[21]^[22]^[23]^[24]. It could be useful to leverage ideas from these works.

Acknowledgements

Big thanks to Alden Hu, Sital Kedia, Aaron Gao, Josh Lind and Zekun Li for schooling me in the performance bottlenecks of production blockchains and helping me refine some of these ideas.

Big thank you to Soujanya Ponnapalli, the coauthor of the Merkelized LSM^[10:2] approach and RainBlock^[9:2] for very useful discussions on why this approach could be better in practice.

References

Verkle Trees: V(ery short M)erkle Trees; by John Kuszmaul; in MIT PRIMES Conference '18; 2018; https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf ↩︎ ↩︎
Streaming Authenticated Data Structures; by Papamanthou, Charalampos and Shi, Elaine and Tamassia, Roberto and Yi, Ke; in EUROCRYPT 2013; 2013 ↩︎ ↩︎
Incremental Cryptography: The Case of Hashing and Signing; by Bellare, Mihir and Goldreich, Oded and Goldwasser, Shafi; in Advances in Cryptology –- CRYPTO '94; 1994 ↩︎ ↩︎
A New Paradigm for Collision-Free Hashing: Incrementality at Reduced Cost; by Bellare, Mihir and Micciancio, Daniele; in EUROCRYPT '97; 1997 ↩︎ ↩︎
Improving Authenticated Dynamic Dictionaries, with Applications to Cryptocurrencies; by Reyzin, Leonid and Meshkov, Dmitry and Chepurnoy, Alexander and Ivanov, Sasha; in FC'17; 2017 ↩︎ ↩︎
Edrax: A Cryptocurrency with Stateless Transaction Validation; by Alexander Chepurnoy and Charalampos Papamanthou and Yupeng Zhang; 2018; https://eprint.iacr.org/2018/968 ↩︎ ↩︎
Batching Techniques for Accumulators with Applications to IOPs and Stateless Blockchains, by Boneh, Dan and Bünz, Benedikt and Fisch, Ben, in CRYPTO'19, 2019 ↩︎ ↩︎
Pointproofs: Aggregating Proofs for Multiple Vector Commitments, by Gorbunov, Sergey and Reyzin, Leonid and Wee, Hoeteck and Zhang, Zhenfei, in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, [URL] ↩︎ ↩︎
RainBlock: Faster Transaction Processing in Public Blockchains; by Soujanya Ponnapalli and Aashaka Shah and Souvik Banerjee and Dahlia Malkhi and Amy Tai and Vijay Chidambaram and Michael Wei; in 2021 USENIX Annual Technical Conference (USENIX ATC 21); 2021; https://www.usenix.org/conference/atc21/presentation/ponnapalli ↩︎ ↩︎ ↩︎
mLSM: Making Authenticated Storage Faster in Ethereum; by Pandian Raju and Soujanya Ponnapalli and Evan Kaminsky and Gilad Oved and Zachary Keener and Vijay Chidambaram and Ittai Abraham; in 10th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 18); 2018; https://www.usenix.org/conference/hotstorage18/presentation/raju ↩︎ ↩︎ ↩︎
~
$2^{33}$ internal nodes
$\times$ 32 bytes per node
$= 256$ GiB ↩︎ ↩︎
We’ll use the more abstract AKVS terminology, and refer to concrete constructions such as Merkle prefix trees or Merkle binary search trees when necessary. ↩︎
Efficient Data Structures for Tamper-Evident Logging; by Crosby, Scott A. and Wallach, Dan S.; in Proceedings of the 18th Conference on USENIX Security Symposium; 2009 ↩︎
Persistent Authenticated Dictionaries and Their Applications; by Anagnostopoulos, Aris and Goodrich, Michael T. and Tamassia, Roberto; in Information Security; 2001 ↩︎
Dynamic Merkle B-tree with Efficient Proofs; by Chase Smith and Alex Rusnak; 2020 ↩︎
Secure Untrusted Data Repository (SUNDR); by Li, Jinyuan and Krohn, Maxwell and Mazières, David and Shasha, Dennis and Shasha, Dennis; in Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6; 2004; http://dl.acm.org/citation.cfm?id=1251254.1251263 ↩︎
https://stackoverflow.com/questions/49744812/embedded-key-value-db-vs-just-storing-one-file-per-key ↩︎
Performance Comparison of Operations in the File System and in Embedded Key-Value Databases; by Jesse Hines and Nicholas Cunningham and German Alferez; 2020; https://knowledge.e.southern.edu/cgi/viewcontent.cgi?article=1110&context=crd ↩︎
Jellyfish Merkle Tree; by Zhenhuan Gao and Yuxuan Hu and Qinfan Wu; 2021; https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf ↩︎
FastVer: Making Data Integrity a Commodity, by Arvind Arasu and Badrish Chandramouli and Johannes Gehrke and Esha Ghosh and Donald Kossmann and Jonathan Protzenko and Ravi Ramamurthy and Tahina Ramananandro and Aseem Rastogi and Srinath Setty and Nikhil Swamy and Alexander van Renen and Min Xu, in Proceedings of the 2021 International Conference on Management of Data, 2021, [URL] ↩︎
Concerto, by Arvind Arasu and Ken Eguro and Raghav Kaushik and Donald Kossmann and Pingfan Meng and Vineet Pandey and Ravi Ramamurthy, in Proceedings of the 2017 {ACM} International Conference on Management of Data, 2017, [URL] ↩︎
Speicher: Securing LSM-Based Key-Value Stores Using Shielded Execution, by Bailleu, Maurice and Thalheim, Jörg and Bhatotia, Pramod and Fetzer, Christof and Honda, Michio and Vaswani, Kapil, in Proceedings of the 17th USENIX Conference on File and Storage Technologies, 2019 ↩︎
Authenticated key-value stores with hardware enclaves, by Kai Li and Yuzhe Tang and Qi Zhang and Jianliang Xu and Ju Chen, in Proceedings of the 22nd International Middleware Conference: Industrial Track, 2021, [URL] ↩︎
Authenticated Key-Value Stores with Hardware Enclaves, by Yuzhe Tang and Ju Chen and Kai Li and Jianliang Xu and Qi Zhang, 2019 ↩︎