Here's a futuristic idea for a merkle tree.
We commit to an unordered binary merkle tree, where items are inserted to the tree based on a
deterministic ordering. Leaf node preimages provide a kind of run-length encoding for
proving nonexistence of keys, and deleted leaf nodes are overwritten to form a linked list of all
deleted free positions to be overwritten.
Leaf hashes are H(key ++ value hash ++ distance)
, where distance is the
the next key (lexicographical order) minus the key. Deleted leaves form a linked list.
Internal nodes begin with 00
, leaves begin with 01
, and tombstones (deleted leaves) begin with 11
.
The root hash is H(left child ++ right child ++ metadata)
where metadata is:
To prove existence, you open the commitment for a key and value.
To prove nonexistence, you open the commitment for the closest previous key, key' and
check that key' + distance < key. Or open the metadata and prove that the key is less than the previous key.
To prove a deletion, you first prove existence and then open the commitment for the previous key.
Check that prev_key + prev_distance = key_to_delete
. If so, write the previous last position of a deleted
item over the node, or a placeholder value if nothing exists. This forms the linked list of places to
fill in. Update the metadata.
To prove an insertion, you first prove nonexistence. Then, you prove the insertion point. If the
metadata has a deleted item location, then that location is used. Otherwise, this splits the first leaf from the left at
level floor(log2(n))
where n is the number of items in the tree.
If this is not the new first key, update the previous key with its new distance, set the distance
for the new key to previous distance - (new key - previous key)
. Hash up and update metadata.
upside:
downsides:
successor
and predecessor function.rough idea, may be improved upon. one idea is to not use a linked list of tombstones but just move
the rightmost leaf at ceil(log2(n))
to the deleted point. but then both deletions and insertions need 2 merkle paths.
I propose we build a "flat" b-tree style index, with an associated data store area, and we only persist bottom level branches and data. In memory, we will add additional search data structures on top of this, with O(log(n)) space consumption.
The main benefit of this flat index is that we avoid IOPs for writing higher-up branch nodes.
We assume that:
The database will consist of:
We will perform copy on write modifications to data / index pages, but only copy the first time a page is updated after a primary swap.
The index storage area is completely unordered, and meant to be read in sequentially. In memory, we keep a 32 bit integer for each branch node indicating its location on disk.
The branch node format is based on the previous log entry, but with a couple modifications:
Splitting Nodes
Take a new page from the free list or allocate one. Get a new page ID.
Merging
Similar to a regular BTree node merge.
This requires a deletion. To avoid the case where loading the index reads a deleted node, we first write a tombstone with the same ID to a fresh page, and then clear the old node page after the next sync. A tombstone is a page containing a full node header with the ID, sequence number, empty prefix, and empty separator key length, followed by all zeros.
Loading the index
We read in the entire index to memory sequentially. If there are two copies of a node with the same ID, we take the one with the higher sequence number unless it is greater than the sequence number in the manifest. In that case, we would take the lower of the two.
If we read a node and its tombstone, we skip the node if the tombstone sequence number is less than or equal to the manifest's sequence number and take the node otherwise. If we read a tombstone without a corresponding node, we ignore it.
This is a sequential read of about 15GB in the absolute worst case. That is, we expect an index over billions of items to give us a maximal branching factor of around 750-800, and a worst-case branching factor of 375-400 if all branches are half-full. This occupies about 10GB if all the branches are half-full. Plus some scratch. Most drives can read several gigabytes per second, so we're discussing at most a few seconds of read time on startup.
Coming back to BTrees.
The main problem with radix trees is that they are generally not robust against long shared prefixes. Even if we salt, it's reasonably likely to get at least a few clusters with prefixes much longer than expected.
A BTree with a bit-level prefix extraction would probably get us the fanout we need.
Something (logically) like a branch node layout of
n: u16 // item count
prefix_len: u8 // bits
separator_len: u8 // bits
prefix: var_bits
padding // to next whole byte
separators: [var_bits] // n * separator_len
padding // to next whole byte
node_pointers: [u32] // (n + 1) * 32
You can get fanouts with this approach of around 750, assuming your prefix is something like 28 bits with a separator length of another 8-12. also assuming 4K pages (32768 bits). memory usage for indexing very large numbers of keys might get between 5-10 GB.
Probably only the bottom level of BTree branches need to be stored on disk. Though this does need to be reconciled with shadow paging - perhaps storing two copies and having each branch additionally carry a positional index and a 48 bit sequence number. Load all the branches on startup. If you load two branches with the same positional index, you take the one with the higher sequence numnber. Unless the sequence number is greater than the primary sequence number in the manifest. Then take the lower.
The positional index could be determined based off of splits and merges. For example, you start with a single branch with position 0
. This can be split into [00]
and [01]
, with the [00]
node occupying the slot that was previously held by [0]
. Those in turn can be split into [000]
, [001]
, [010]
, [011]
. And so on. Something like that. Downside is that it only accommodates power-of-two numbers of branch nodes, whereas we expect to have fewer.
Looking further into this idea.
We are discussing two key pieces:
Note that if (2) is a B-Tree based index this is just a B+Tree.
With our assumptions from the previous entry, with 4TB of pages we would have 1 billion 4K pages and would need 4GB just to store the pointers. 8K pages gives us 8TB of pages, so enough for the foreseeable future.
Basically, we have a certain fixed number of leaf pages we want to index and the question is just how we do that. We are essentially considering options like some kind of logically compressed radix tree, but we can't afford to keep lots of memory pointers.
Let's say you keep a structure that looks like this:
b
(u8, [NodePointer])
encoding
u8
is a packed representation of the depth 4 sub-trees, each with 2 bits. i.e. an encoding of 0x00
would be followed by an array of 4 node pointers. But an encoding of 0x01001011
would be followed by 2 + 1 + 4 + 8 = 15 node pointers, with the first 2 going to the first sub-tree, the next going to the second sub tree, and so on.The idea is you start with a base radix of b
and then allow for different depths to emerge organically.
b
might be
Also
We should have a good story for blob storage: this proposal expects that values are short. Values larger than a certain size should be spilled out to overflow pages, ideally contiguous. This is necessary to support stuff like smart contract code in particular.
IOPS
I am looking into approaches for indexing all the data. Here is one approach, possibly of interest:
just keep (effectively) a giant sorted Vec<(KeySeparator, PagePointer)>
in memory and perform a binary search over it to find the likely pages to search. Record changes made to this index to a sequential file, and occasionally flush the whole thing to disk. This is the dumbest index possible, but it may make sense for us based on our assumptions.
Let's start from the constraints we have:
Let's work backwards from the amount of data we expect our database to use. That is, something like 3TB.
If we divide this up into 4K pages, that means we have 3 * 2^40 / 2^12 = 3 * 2^28 = ~800 million
pages to index. With 8K pages, we have half that: ~400 million.
Storing 4-byte page pointers for each of these alone would take 3 or 1.5 GB of memory, respectively.
We also want some index over these which we can search in memory.
If we additionally store a 4 byte separator with each page pointer, we would double our memory usage to 6 or 3 GB, respectively. Even if we store 2^36 items, we would have on average only 2^(36 - 32) = 2^4 = 16
items sharing the same 32 bit prefix. This likely all fits into one page.
The main observation here is that if there are multiple pages all containing keys which are greater than K but less than K+1, we can just fetch them all.
Alternatively. This is something like a radix tree with 256 children. The key difference is that radix trees only support a single pointer per child. We don't want a single-value per leaf construction but a multi-value-per-leaf construction.
Note that (2^8)^4 * 4 = 2^32 * 4 = 16GB
, that is, a radix tree differentiating the first 4 bytes of keys takes 4GB of memory.
What we could do is to have value pages form a linked list and have the radix tree point to the first page with a key with a given prefix. Then, loading the whole thing into memory is as simple as iterating the radix tree, loading that page, and then following the links, and then loading those page IDs into memory.
Furthermore, we don't really need to worry about the radix tree being packed efficiently into pages because we don't expect to fetch it on the fly. But I do think we could come up with some way of storing it in a relatively compact way and just read the whole thing into memory.
Related to the previous post. I think if we modify Substrate to note when it's writing through a shared prefix we can do this even better.
I've maybe cracked one of the key issues with variable-length keys.
With fixed-length keys, we keep the 32-byte key path and leaf hash underneath each leaf. But with variable length keys, we can't keep the full key there as it will likely not fit. The property that this gives us is that we can split leaves when inserting into the merkle trie.
If we add an additional requirement that variable length keys are always common prefixes followed by a hash, we can just keep the 32 bytes that follow the common prefix underneath the leaf as a discriminant instead of the full key. This tracks with Substrate storage APIs.
There is one issue: when there is only one key with the common prefix and the key isn't yet located at the common prefix. We can flag such cases with a marker value of e.g. all 1 bits. In that case, we'd have to look up the key with a btree seek. But if the common prefix is really that small, the extra round-trip won't impact performance meaningfully.
There are other solutions to explore in this design space, like making the key path equal to the hash of the key plus the first 16 bytes following the key.
I am looking into B-Tree designs for Key-Value Storage.
One insight is that, while leaf pages should always be 4K, the branch pages (so long as they are reasonably expected to be in memory) can be much larger and even of variable size.
The benefit here is that we can update branch pages on disk only occasionally, and always with sequential writes.
Consider an approach where we have 4K leaves and branches which grow up to 2MB. Branches are stored all together. With each GB of RAM we can store at least 512 such branches. We expect an extremely shallow tree, of depth 2/3 at most. We use a WAL to keep track of all key-value changes and compact summaries of branch node changes. We have a leaf-page store stores all leaves plus a single meta page. We have a kind of copy-on-write semantic for leaves, where leaves which have been updated since the last time the branches were fully written to disk are copied. Leaves are referred to by their 32-bit index in this store and a free-list head pointer is kept in the meta page.
The way an update works is that we just write to the WAL and then apply the random leaf writes to the leaf store. Once the WAL is big enough, we write all the branch nodes out to disk sequentially, fsync, increment the sequence number, and fsync again. After this is done, we can clean up old WALs and delete any old copy of the branch nodes. This approach lets us take advantage of fast sequential write speeds for WALs and all branch node updates, while ensuring that reads are still fast.
update: pep mentioned we could just keep all the changes in memory and should look into some way of amortizing the disk write costs to avoid huge batch writes. we should be able to do this with some kind of mapping table like the BW Tree has.
We discussed putting the values into the node-pages and whether this is worth it.
We have decided on a hash-table for pages and a BTree (mdbx) for key-value storage.
To discuss: keeping hash-table in sync with mdbx
To discuss: write speed measurements for mdbx
To discuss: synchronous state root is a current requirement but may be a bad idea.
We compared two approaches. The first is hash-db for node pages + mdbx for KV. The second is a B-Tree for all the items. We compared on a few metrics:
Approach 1 | Approach 2 | |
---|---|---|
implementation complexity | ✅ | |
read latency | ✅ | |
read bandwidth | ✅ | |
write bandwidth | ✅ | |
cache amplification | ✅ | |
user complexity | ✅ | |
keyspace iteration | - | - |
compact page format support | ? | ? |
bigger pages | - | - |
index update CPU parallelism | ✅ | |
uncertainty | ✅ |
We discussed cache amplification with SpaceJam.
When considering read performance vs a btree, as in the last few entries, the question is "how much memory do I need to amortize my disk accesses to 1" - this is cache amplification.
For a btree, this depends on the branch-to-value node ratio. That is, btrees require cache amplification to store all branches.
For our hash-map, it depends on the number of pages we keep.
The graph linked below shows that with a branch-to-value ratio of less than the amount of items we can store in each block of the hash-map, that the hash-map is more cache-efficient.
https://www.desmos.com/calculator/a12l9y66rc
We can also play around with using less memory to cache blocks which are nowhere near full. e.g. the expected amount of memory per block is based on the load factor. a map which is lightly allocated should take much less memory to store, by storing only the first
Of course, we would need to balance this with memory efficiency.
Revisiting yesterday's value hash-table discussion.
It seemed that triangular probing was fairly weak without in-memory metadata enabling skips, as probe distances could become large and lead to enormous amounts of SSD page queries in the long tail.
We can consider blocked cuckoo hashing as a possible algorithm. This is discussed in https://web.stanford.edu/class/cs166/lectures/11/Slides11.pdf (slide 242). To briefly describe here:
This provides useful properties:
Probing a key requires at most 2 (parallelizable) SSD page fetches.
As always, with cuckoo hashing, insertions can lead to long chains. However, note that the probability of a full block at a given load factor can be computed with the Poisson distribution by finding P(M >= N)
where M is the expected number of items per block (e.g. 0.8 * N
at load factor 0.8) and N is the number of buckets per block. The probability that a block has one free slot is P(M < N)
. See https://stattrek.com/online-calculator/poisson
Here is a table for N=107 (as below)
Load Factor | P(block full) | P(block full)^2 |
---|---|---|
<=0.7 | ~0 | ~0 |
0.75 | 0.0025 | ~0 |
0.8 | 0.0141 | .0002 |
0.9 | 0.149 | .022 |
P(block full)^2 is approximately our probability of needing to evict an item at all, i.e. the two chosen blocks for a key are both full. We will successfully move the evicted item to the next block with 1 - P(block full)
probability, so the probability of needing to do an eviction and that eviction requiring more than a single extra round-trip is P(block full)^3
.
This approach is also not vulnerable to attacks where user input drives worst-case write performance. Because the database chooses an item to evict randomly, write performance is independent from user input.
Today we discussed the possibility of using a hash-table for the flat key/value storage, as well as for the pages. We believe this should be faster than LSM and BTree architectures, and this is not just hubris to think we can optimize beyond mature software stacks. We don't need MVCC, rollbacks, multiple writers, and other similar features which DBs often do, so we can tailor to our use-case.
The hash-table for node-pages remains the same.
The key-value hash-table would involve 3 parts:
A bucket may be, for example, something like 38 bytes:
At 38 bytes, we could store 107 full buckets per 4096 byte page, with 30 bytes left over for metadata within the page.
Unlike the node-page hash-table, we may not be able to keep meta-bits in memory to avoid probing. The reason is that we expect to have many more values than node-pages. Perhaps keeping one bit of meta-information in memory per page is tractable, but with billions of KV pairs we would still require over 1Gi resident memory size.
However, we have seen from our experiments in triangular probing that even without in-memory metadata, at a load factor of 0.9, the mean number of probes required for absent keys is ~10 and ~3 for present keys. With 107 buckets per page, it is likely that a single SSD page read will contain the relevant bucket, as T(10) = 45 from the initial bucket. This gives an expectation that we would need to query only a single page from the SSD tp find the location of the value. As long as a bucket requires 20 probes or fewer (T(20) = 210), we would never need to load more than 2 pages from the SSD per read, but would also never load fewer.
After that, we would need to query only a single page from the drive to read the value information.
It is worth noting that the 99th percentile absent key probes reaches T(20) at a load factor of 0.7, and this may open up some possibility of sequencer DoS with timing attacks determining the slowest keys. We could recommend that sequencers resize their database fairly eagerly to mitigate this.
Because absent key probes are our main issue, we may look into bloom filters or alternative probing algorithms which tend to preserve bucket locality better than triangular probing.
The details of the "heap" allocator are currently uncertain.
I dug in to compare to mdbx with 4096-byte pages, for example.
The page overhead is 160 bytes, leaving 3936 bytes for nodes. Each node has 8 bytes of header information, plus the key material (in a branch). For 32-byte keys, a node is 40 bytes as a result. This gives an expeced dense branching factor for our workload in mdbx of 98. The first 3 layers encompass ~1M pages and ~3.5Gi of space and are plausibly all resident in the OS page cache. log_98(1,000,000,000)
is ~4.5, meaning that in MDBX we'd also expect to do ~1.5 additional SSD page loads on average per key with 1 billion keys. log_98(10B)
is 5.022. Furthermore, this is also the worst-case performance, unlike our hash-table.
In Berlin. Created a separate log to discuss design decisions: https://hackmd.io/wL-w03mWSMCpu9_C1JKa0A
Looking into crash recovery.
Modern filesystems provide some crash recovery guarantees, but can experience failures. Specifically, when we refer to crashes, we have two kinds in mind:
In a process crash, anything which has been flush()
ed will eventually be written to disk, as it has been passed to kernel space already. This can lead to some loss of data, but we won't deal with issues like partial writes, atomicity, etc.
In a kernel crash (panic, loss of power, etc.), we can encounter further inconsistencies. This paper (https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf) contains some information on the atomicity and ordering of operations on common filesystems. These were determined by analysis of the block-level execution traces emitted by the filesystem.
Almost all modern filesystem provide same-file append/append ordering and multi-block prefix append atomicity. This includes ext4 in journal, nodealloc, and ordered (default) modes, xfs, and reiserfs. This does not include ext4 in writeback mode.
Suggestion: use these properties to provide crash tolerance these filesystems and disclaim support for ext4 data=writeback
Same-file append/append ordering means that two writes appending to the same file will never be reordered after a crash. Multi-block prefix append atomicity means that a large append call will always result in some continuous prefix of the data being written.
And notably, single-block writes have not been found to be atomic after a crash.
I am not 100% sure that this source can be trusted
Basically: sequential WALs will be our bread-and-butter for crash tolerance, even when committing to our hash-table.
Here is a possible strategy for taking advantages of these properties to preserve crash-tolerance in our DB:
This approach is crash-resilient on modern filesystems if the above source is to be trusted. When recovering from a crash, there are three possibilities:
The multi-block prefix append atomicity property lets us distinguish whether Step (2) completed.
It is worth noting that this strategy does incur some additional write amplification and read amplification (only in the event of a crash). This write amplification is especially when moving data from one bucket to another: if writing to a bucket requires in expectation X writes, then we perform 2X writes - the first X sequentially (step (2)) and the second X randomly (step (3)).
At high load factors, this write amplification can be quite significant. The plots in https://github.com/rphmeier/hash-psl-simulation will give an idea of how much data must be copied per commit at different load factors. As an example, when inserting a new value at a load factor of 0.8 into a robin-hood table, we expect 9 additional writes on top of the write for the inserted value.
Expanding on the bitfield idea for Robin Hood Hashing.
We can keep a bitfield of N bits per bucket, where it has the following semantics:
0
means that the bucket is empty1
means that the bucket is full, and the PSL is 2^N - 1 or greater.e.g. with 3 bits, a value of 1 means the bucket is full with a PSL of 1 (the key stored in the bucket hashed there originally)
or a value of 4 means the bucket is full with a PSL of 4. or a value of 7 means the bucket is full with 7 or greater PSL.
This can help somewhat with lookups: e.g. if we hash to a bucket B and the bucket is full with a non-1 PSL, we know it's not our value. Or if it is full, but the next bucket B+1 has a PSL != 2, it's also not our value. This reduces the amount of disk probing we might have to do within clusters.
It also helps with insertions and deletions, in knowing which buckets we might have to load. I want to simulate PSLs for the occupancy of buckets at different levels of load so we can get a reasonable estimate on bits needed.
The memory usage also seems acceptable. If we have e.g. 3 bits per bucket and 2^31 (2Gi) buckets, we would keep a bitfield of 3 * 2^28
bytes, or 768MiB. Storing 1Gi pages at 50% occupancy, we could have billions of values and the pages themselves would occupy 4TiB of storage space. Any node willing to dedicate terabytes of storage space to pages seems likely to have a couple GiB of RAM to spare.
This bitfield does need to be kept on disk. As we know, writing a single bit to the SSD requires writing the whole page. Therefore, we should tend to batch updates to the hash-table, for example, when committing a write-ahead-log. Some analysis of the write amplification effects of storing this bitfield:
1 / (4096 / N)
write amplification per page (N bits per bucket in the bitfield).With Robin Hood Hashing's expected PSL of 3 at 80% occupancy, we can likely filter out many of our disk queries with such a bitmap effectively to improve access times, while trading off some memory and some write amplification.
I am looking into different hash table families for our implementation.
We have some requirements:
One useful property of our workload that we can take into account, is that whenever we write a page, we know with certainty whether it is fresh or whether it is being updated. Another is that whenever we delete a page, we must have read it already and are certain of its location in the hash-table and the probes required.
I considered two potential algorithms .
One is Cuckoo Hashing.
n_items * fingerprint_size
. Our fingerprint size may need to be 64 or more bits.Second is Robin Hood Hashing.
On regular linear/quadratic hashing: these are known to perform worse in practice than Robin Hood Hashing.
I did not look deeply into Extendible Hashing but I do plan to.
I have been looking over this SILT paper which contains some very useful insights.
1. Use append-only storage to optimize for Sequential Writes
I think this would be something very useful to integrate into our hash-table design. The idea would be to still keep the hash-table, but store it along with a series of append-only files which make commits fast. Whenever we initially write a page to disk, we write it to append-only storage. We would also keep information about pages which have recently been written to the append-only store in RAM so we can look them up from their respective position in that store. For a lookup, it's still a random read, but we commit data to disk faster.
We would periodically need to clear out old append-only files and update the hash-table. But crucially, we only need to write the most stale pages to the hash-table. Ones which have been written more recently to the append-only storage could just stay resident there.
2. Keep indexes in RAM to reduce disk lookups
We want to have a hierarchy of page-stores based on how hot the data is. Stores at the top of the hierarchy use more RAM but are faster to query. The hierarchy can look like this:
We technically don't need to keep any bits in RAM per page for the hash-table and can still get O(1) but the SILT paper shows that we can keep a few bits in RAM per page and reduce the likelihood of collisions substantially.
Revisit architecture now that we have a working prototype.
1. Further specializing update logic
We currently use a cursor for performing updates. We can improve update logic quite a lot by specializing the routine for updates.
The first issue we have now is in allocating new pages. When we encounter a fresh page, we currently query the cache, which falls back to the disk store, which finds no page, and then only after performing I/O do we then allocate. This is all happening in a critical section. In fact, during update we already know exactly which pages must be allocated fresh: any pages beyond one of the visited terminals. Update currently spends around 10-14% of its time allocating pages in a randw workload and 16-20% of its time performing I/O for 26-34% in total.
Recommendation: build specialized update logic that can allocate fresh pages without querying the store
A second issue is that we query pages during update logic from the broader shared page cache. These pages have already been fetched as part of the warm-up procedure and could be provided directly to the update logic. This would also eliminate the need for any tracking of dirtied pages. Update currently spends around 6-10% of its time querying the page cache and tracking dirtied pages.
Recommendation: have warm-up workers efficiently prepare the set of pages for use in update
Lastly, the update operation is embarassingly parallel and even potentially pipelined.
All this together, it would seem we can reduce the overhead of the update operation by around 30-40% on a single thread, and then massively parallelize and pipeline the remaining 70% of work. Parallelization should work perfectly at every level up to the root. After that, updates will primarily be bottlenecked on hashing.
2. Background Flushing and Cache Redesign
Writing pages to disk is a heavy I/O operation and is a major bottleneck in NOMT. In a 25k random write operation over 30,000 pages are updated, for 117MB of total I/O. Preparing the database transaction and performing the write together consume around 60% of the time it takes to commit_and_prove
.
Even under the unrealistic assumption that writing would take a single round-trip, with 1M IOPS it would take over 29ms to finish flushing the writes of all these pages. Flushing in the background, and even flushing only occasionally, would remove this I/O from the critical path. With our RocksDB backend, the total commit_and_prove
time for 25k randw workload is 826ms. It is not easy to distinguish how much of this time is spent waiting on page fetches to conclude vs. the commit itself.
Flushing should be removed from the critical path. The user should be able to receive the state root and witness very quickly and have the option to wait for a WAL to be populated with the key-value pairs of the commit.
Instead, I recommend that we establish the following architecture:
We maintain one active page overlay, which is eventually frozen after it reaches a certain size or enough time has passed. Once it is frozen, a background task flushes the contents to disk and reclaims the memory.
This is inspired by the design of RocksDB, which keeps one active "memtable" along with several frozen ones which are being flushed to disk.
In the RocksDB backend, I recommend that we maintain one page overlay and simply commit
to RocksDB in the background. We allow the user to wait upon this operation, but also to receive the witness and new state root quickly. It is also not clear whether a dedicated page cache is beneficial in the RocksDB backend, given that RocksDB maintains its own caches. We will need to measure to say for sure.
3. Lock-Free Data Structures