IAVL Archive Storage

# IAVL Archive Storage ## Changelog - 2022-11-21: First draft - 2022-11-22: Decision on node value compression strategy; - Use recsplit MPHF algorithm[^1] which seems to be more spece efficient and there are more implementations[^2]. - Put MPHF at the center of the design. ## MPHF Table Building a succinct hash table for a static set of keys. Build and save a MPHF for the keyset, store the values in a data file contiguous in the same order as the hashed values, if values have variable length, build a monotonically increasing offset table (ordered ints can be better compressed), items can referenced with the hash value instead of key itself, can lookup by key in `O(1)` time. ``` key -> perfect hashed value -> offset table -> seek and read in data file ``` Two use cases: - Query node by hash and deduplicate the children hashes in branch node. - Deduplicate the key field in both branch and leaf nodes. TODO ## IAVL Tree Contents - Nodes, `hash -> node` - Roots, `version -> hash` - Orphan records, fast nodes, metadata This proposal will migrate the first two parts into a new format, the third part is not touched. The new format is optimized for archive nodes, it's composed by a bunch of read-only custom file formats. The basic idea is dumping the node values into a file contiguously in the order of node hashes, with some extra structures for query and compressions. ## Node Compression Compression is very important to reduce db size, after some experiments, we identify two important redundencies in current format: - Branch nodes contains two children hashes, which are duplications with the node hash within the keys in kv-db. They can be replaced with node indexes, the hashes are instead fetched from `node_hash.dat`. - Branch nodes's key field is duplicated with one of the leaf nodes. And we find that even within leaf nodes, the key fields is greatly duplicated, so it's worthwhile to store deduplicated keys separately and just keep an index in the node. After appling the above deduplications to branch nodes, it only contains several integers which when encoded with varint are only a donzen bytes on average, don't need to apply further compression. For the leaf nodes, dictionary compression is effective because the node numbers are relatively smaller (leaf node to branch node ratio is 1:14 in our testnet's auth store), and the value field is very much compressible. The compressed node values are appended to a file, the file offsets are recorded in another file. ## Archive Format ### `node.dat` Serialized and compressed node values. ### `node_hash.dat` Node hashes dumped from the kv iteration, the items have fixed length and ordered, we can do binary search to locate the node index by hash. ### `node_offset.dat` Node values have variable lengths, so we need an offset table to access it, it's a set of monotonically increasing numbers. The below experiment shows it's around 2bytes per node when compressed with roaring bitmap, which is mmap-able and support random access, seems not bad. ``` >>> nnodes = 285064512 >>> avg_size = 50 >>> m = roaring64.BitMap64() >>> for i in range(nnodes): >>> m.add(i * avg_size) >>> len(m.serialize()) / nnodes 2.0061037131131916 ``` ### `root.dat` Contiguous root node indexes, version number is the array index. ### MPHF > Optional, we don't need this if binary search on node hashes is fast enough. MPHF(Minimal Perfect Hash Function)[^1] provides a way to query nodes faster than binary search, it maps the node hashes to an unique integer directly, which is further mapped to node offsets, assuming the hash function are fully loaded into ram, it only takes at worst three reads to read a node value by hash. - `MPHF.dat` The MPHF built from the node hashes. The theoretical lower bound of size is 1.44bits per key, practical implementation can make it 3bits per key, perfectly ok to fully load into ram. - `MPHF_index.dat` Convert the hash number to node index. ## Memory Footprint The `MPHF.dat` can be fully loaded into ram on startup, all the other files can be mmap-ed into address space, we don't need extra caches other than the OS page cache. The offset table might need to be fully loaded into ram if apply compression. ## Size Estimation Say there are `N` nodes, and `V` versions. - `node.dat`, the branch node size is dramatically reduced, preliminary experiment shows 13bytes on avg, leaf nodes have pretty good compression rate with dictionary compression(zstd), need more experiments to get more accurate numbers. - `node_offset.dat`: `N * 2`, assuming compressed with roaring bitmap. - `node_hash.dat`: `N * 32`, no compression at all. - `MPHF.dat`: practical implementations[^1] can do 2bits per key, which gives us: `N * 3 / 8`. - `MPHF_index.dat`: Out-of-order small integers, assuming `N * 2` with compression. So for `N=1,000,000,000`, it takes 30G to store node hashes, 1.8G to store node offsets, 2.2G to store MPHF stuff, 34G in total other than the node values. ## Build Building is done offline. - Iterate nodes and dump hashes to `node_hash.dat`, kv db make sure it's ordered, also output leaf node indexes to `leaf_bitmap.dat` encoded with roaring bitmap. - Using `node_hash.dat` and `leaf_bitmap.dat` to sample leaf nodes randomly to build compression dictionary(110k), save to `compression_dict.dat`. Memory usage: The size of the samples. - Iterate nodes again, compress node values and append to `node.dat`, also append the file offset to `node_offset.dat`. - For branch nodes, convert children hash to node index by bin-searching `node_hash.dat`, TODO optimization? - and convert the key field to leaf node index, TODO - For leaf nodes, compress with the `compression_dict.dat`. Memory usage: size of offset bitmap, hundreds of megabytes. - Build and save `MPHF` for `node_hash.dat`. Memory usage: depends on implementation. - Iterate `node_hash.dat`, lookup `MPHF` hash value and build the sorted list `[(MPHF hash value, node index)]` in memory, iterate the list and append the `node index` into `MPHF_index.dat`. Memory usage: `16 * N` plus the overhead of sorted tree, should be several hundreds of megabytes, if not enough memory, can do with a on-disk kv db. - Delete the iavl entries from the kv store and do a full compaction. ## Consequences ### Advantages - Smaller db size. - Faster query. - Lower memory footprint. - Non-breaking and relatively easy migration. - The new format can be read by multi-processes concurrently. ### Disadvantages - Building could take some time. - Need to build offline. - Two versions of iavl db format coexists, need extra route logic. - Pruning takes re-running the process. [^1]: https://github.com/thomasmueller/minperf/blob/master/src/test/java/org/minperf/simple/recsplit.md [^2]: https://github.com/ledgerwatch/erigon-lib/tree/main/recsplit [^3]: https://github.com/ledgerwatch/erigon-lib/tree/main/recsplit/eliasfano32