# Cosmos-SDK Versioned Store ## Context We propose an storage format that can serve the historical state queries while being more space efficient than iavl tree, the design is inspired by erigon[^1]. ## DB Structure ### Plain State > It's similar to fast node index of current IAVL tree. - `key -> value`, the latest state. ### History Index The set of block numbers that modified the key, encoded as a roaring bitmap. - `key -> set of block numbers` ### Change Sets Record the prior value for each modification ([alternative: store after value](#Store-After-Value)). - `(block number, key) -> value` ## Iterating Historical Versions Iterating the latest state is simple, just iterate the plain state like a normal kv db. The algorithm of iteration for a historical version would look like this: ```python it_plain = plain.iterator(start_key) it_history = history.iterator(start_key) def next(): k, v = it_plain.next() hk, bm = it_history.next() # stop iteration if both iterators finished if k is None and hk is None: raise StopIteration if hk is None or hk > k: # k doesn't exist in history index, it's available from genesis and not modified. # use latest state directly. return k, v else: # lookup the value of hk in changesets target, found = bm.seek(block) # seek returns the smallest number that >= block if not found and hk == k: # there's no change after the target block, use latest state return k, v return hk, changesets.get((target, hk)) ``` ## Bitmap Chunking To prevent the history bitmap of some hot keys become too big, we can split it into multiple chunks. So it becomes `(key, chunk_id) -> bitmap`, where `chunk_id` is either the maximum value in the chunk, or `^uint64(0)` if it's the last chunk. The difficulty for cosmos-sdk is our keys have variable length, simply append surfix to the key will confuse the iteration, it can be solved with some special features of certain db backends, like dupsort of lmdb/mdbx [^2]. For the other db backends, it can't support bitmap chunking. The other thing is roaring bitmap is mmap-able, so we can take advantage of mmap support of some db backends. ## Store After Value An alternative design is to store the after value in change set. The change set will includes all the states, and the plain state becomes an optional cache similar to the fast node index in iavl tree. It becomes easier to distribute the state through CDN, we only need to distribute the change set, nodes can rebuild the history index and latest state locally. We can also chunking the change set by block ranges, and upload to CDN chunk by chunk. Another observation is that in this way the change set becomes a direct equavalence to the file streamer output. ## Equivalence to IAVL Tree Assuming the iavl tree is modified in the order of key in each version, we can rebuild the IAVL tree from this versioned storage by replaying the modifications version by version. ## Aggregation It's possible to aggregate a range of blocks to a single version, dropping the intermidiate changes, it can save disk space, but won't be able to reconstruct the IAVL tree. ## Alternatives - Experimental rocksdb user-defined timestamp[^3] ## References [^1]: https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/db_walkthrough.MD#table-history-of-accounts [^2]: https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/dupsort.md [^3]: https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-(Experimental)