NOMT Log

07-03

Here's a futuristic idea for a merkle tree.

We commit to an unordered binary merkle tree, where items are inserted to the tree based on a
deterministic ordering. Leaf node preimages provide a kind of run-length encoding for
proving nonexistence of keys, and deleted leaf nodes are overwritten to form a linked list of all
deleted free positions to be overwritten.

Leaf hashes are H(key ++ value hash ++ distance), where distance is the
the next key (lexicographical order) minus the key. Deleted leaves form a linked list.

Internal nodes begin with 00, leaves begin with 01, and tombstones (deleted leaves) begin with 11.

The root hash is H(left child ++ right child ++ metadata) where metadata is:

  1. the position of the last deleted item, or a placeholder value if no tombstones exist.
  2. the first key in the tree
  3. the number of keys in the tree

To prove existence, you open the commitment for a key and value.
To prove nonexistence, you open the commitment for the closest previous key, key' and
check that key' + distance < key. Or open the metadata and prove that the key is less than the previous key.

To prove a deletion, you first prove existence and then open the commitment for the previous key.
Check that prev_key + prev_distance = key_to_delete. If so, write the previous last position of a deleted
item over the node, or a placeholder value if nothing exists. This forms the linked list of places to
fill in. Update the metadata.

To prove an insertion, you first prove nonexistence. Then, you prove the insertion point. If the
metadata has a deleted item location, then that location is used. Otherwise, this splits the first leaf from the left at
level floor(log2(n)) where n is the number of items in the tree.
If this is not the new first key, update the previous key with its new distance, set the distance
for the new key to previous distance - (new key - previous key). Hash up and update metadata.

upside:

  • extremely tight bound on tree depth makes it possible to prove existence or nonexistence for any value
    in one disk access. With all hashes at e.g. 2^27 in memory (4GB assuming 2^5 bytes per hash), you can reach everything
    in one disk access for up to 2^33 items if each page gives 6 layers.
  • space amplification can be tightly limited due to predictable tree structure
  • worst case behavior is tightly bounded, broadly speaking.
  • dense tree node pages which can be allocated dynamically.
  • may be adaptable to variable length keys, or at least longer keys.

downsides:

  1. main database needs to store locations for keys (<64 bits) and support successor and predecessor function.
  2. insertions require a nonexistence proof as well as the path to the position they are inserted.
    this is heavier when there are deleted items to fill in.
  3. tree operations are non-commutative.
  4. leaf hashes may not be 64 bytes (i.e. not hash-compression-friendly).

rough idea, may be improved upon. one idea is to not use a linked list of tombstones but just move
the rightmost leaf at ceil(log2(n)) to the deleted point. but then both deletions and insertions need 2 merkle paths.

07-01

I propose we build a "flat" b-tree style index, with an associated data store area, and we only persist bottom level branches and data. In memory, we will add additional search data structures on top of this, with O(log(n)) space consumption.

The main benefit of this flat index is that we avoid IOPs for writing higher-up branch nodes.

We assume that:

  1. Keys are fixed length, have high entropy, and are small (32/48 bytes).
  2. Data are small with some user-defined threshold for spilling over to overflow areas. Contract storage, for example, is always 32 bytes, but an Ethereum-style account structure is around 110 bytes and the code can be up to 24KB.
  3. Data can be stored along with preimages.

The database will consist of:

  1. Data storage area (measured in TB)
  2. Index storage area (measured in GB)
  3. Manifest containing sequence number, free lists for the storage areas, and next branch ID number. Two copies of each for the purposes of shadow paging.
  4. Overflow data storage area (for large data)

We will perform copy on write modifications to data / index pages, but only copy the first time a page is updated after a primary swap.

The index storage area is completely unordered, and meant to be read in sequentially. In memory, we keep a 32 bit integer for each branch node indicating its location on disk.

The branch node format is based on the previous log entry, but with a couple modifications:

  1. We store a unique 64 bit ID in the header, incrementing sequentially.
  2. We store a 64 bit sequence number in the header, based on the number of commits to the database.

Splitting Nodes

Take a new page from the free list or allocate one. Get a new page ID.

Merging

Similar to a regular BTree node merge.

This requires a deletion. To avoid the case where loading the index reads a deleted node, we first write a tombstone with the same ID to a fresh page, and then clear the old node page after the next sync. A tombstone is a page containing a full node header with the ID, sequence number, empty prefix, and empty separator key length, followed by all zeros.

Loading the index

We read in the entire index to memory sequentially. If there are two copies of a node with the same ID, we take the one with the higher sequence number unless it is greater than the sequence number in the manifest. In that case, we would take the lower of the two.

If we read a node and its tombstone, we skip the node if the tombstone sequence number is less than or equal to the manifest's sequence number and take the node otherwise. If we read a tombstone without a corresponding node, we ignore it.

This is a sequential read of about 15GB in the absolute worst case. That is, we expect an index over billions of items to give us a maximal branching factor of around 750-800, and a worst-case branching factor of 375-400 if all branches are half-full. This occupies about 10GB if all the branches are half-full. Plus some scratch. Most drives can read several gigabytes per second, so we're discussing at most a few seconds of read time on startup.

06-29

Coming back to BTrees.

The main problem with radix trees is that they are generally not robust against long shared prefixes. Even if we salt, it's reasonably likely to get at least a few clusters with prefixes much longer than expected.

A BTree with a bit-level prefix extraction would probably get us the fanout we need.

Something (logically) like a branch node layout of

n: u16 // item count
prefix_len: u8 // bits
separator_len: u8 // bits
prefix: var_bits
padding // to next whole byte
separators: [var_bits] // n * separator_len
padding // to next whole byte
node_pointers: [u32] // (n + 1) * 32

You can get fanouts with this approach of around 750, assuming your prefix is something like 28 bits with a separator length of another 8-12. also assuming 4K pages (32768 bits). memory usage for indexing very large numbers of keys might get between 5-10 GB.

Probably only the bottom level of BTree branches need to be stored on disk. Though this does need to be reconciled with shadow paging - perhaps storing two copies and having each branch additionally carry a positional index and a 48 bit sequence number. Load all the branches on startup. If you load two branches with the same positional index, you take the one with the higher sequence numnber. Unless the sequence number is greater than the primary sequence number in the manifest. Then take the lower.

The positional index could be determined based off of splits and merges. For example, you start with a single branch with position 0. This can be split into [00] and [01], with the [00] node occupying the slot that was previously held by [0]. Those in turn can be split into [000], [001], [010], [011]. And so on. Something like that. Downside is that it only accommodates power-of-two numbers of branch nodes, whereas we expect to have fewer.

06-27

Looking further into this idea.

We are discussing two key pieces:

  1. All key-value data, logically sorted in data pages.
  2. Some index into those data pages.

Note that if (2) is a B-Tree based index this is just a B+Tree.
With our assumptions from the previous entry, with 4TB of pages we would have 1 billion 4K pages and would need 4GB just to store the pointers. 8K pages gives us 8TB of pages, so enough for the foreseeable future.

Basically, we have a certain fixed number of leaf pages we want to index and the question is just how we do that. We are essentially considering options like some kind of logically compressed radix tree, but we can't afford to keep lots of memory pointers.

Let's say you keep a structure that looks like this:

  • a "base arity" of a radix tree b
  • repeated (u8, [NodePointer]) encoding
    • the u8 is a packed representation of the depth 4 sub-trees, each with 2 bits. i.e. an encoding of 0x00 would be followed by an array of 4 node pointers. But an encoding of 0x01001011 would be followed by 2 + 1 + 4 + 8 = 15 node pointers, with the first 2 going to the first sub-tree, the next going to the second sub tree, and so on.

The idea is you start with a base radix of b and then allow for different depths to emerge organically.

b might be

Also

We should have a good story for blob storage: this proposal expects that values are short. Values larger than a certain size should be spilled out to overflow pages, ideally contiguous. This is necessary to support stuff like smart contract code in particular.

IOPS

06-26

I am looking into approaches for indexing all the data. Here is one approach, possibly of interest:
just keep (effectively) a giant sorted Vec<(KeySeparator, PagePointer)> in memory and perform a binary search over it to find the likely pages to search. Record changes made to this index to a sequential file, and occasionally flush the whole thing to disk. This is the dumbest index possible, but it may make sense for us based on our assumptions.

Let's start from the constraints we have:

  1. We expect to run on a consumer SSD, probably 4TB max.
  2. We want to index all the data in memory on a regular node, with e.g. 16GB of memory, such that we typically require only one round-trip to index it all.
  3. We need fast random point lookups but don't care much about ordered iteration.
  4. We expect to handle a few tens of billions of items at most, or something between the scale of 2^34 and 2^36.
  5. We expect that keys are hashes and have high entropy in their initial bits. Even use-cases like preimages, separated storage for pallets/contracts/etc. only require fixed-length keys. We assume that iteration, as necessary, would be the responsibility of the commitment data structure.
  6. We allow for arbitrary-length values but generally expect these to be short, with a few larger pieces like smart contract code.

Let's work backwards from the amount of data we expect our database to use. That is, something like 3TB.
If we divide this up into 4K pages, that means we have 3 * 2^40 / 2^12 = 3 * 2^28 = ~800 million pages to index. With 8K pages, we have half that: ~400 million.

Storing 4-byte page pointers for each of these alone would take 3 or 1.5 GB of memory, respectively.
We also want some index over these which we can search in memory.

If we additionally store a 4 byte separator with each page pointer, we would double our memory usage to 6 or 3 GB, respectively. Even if we store 2^36 items, we would have on average only 2^(36 - 32) = 2^4 = 16 items sharing the same 32 bit prefix. This likely all fits into one page.

The main observation here is that if there are multiple pages all containing keys which are greater than K but less than K+1, we can just fetch them all.


Alternatively. This is something like a radix tree with 256 children. The key difference is that radix trees only support a single pointer per child. We don't want a single-value per leaf construction but a multi-value-per-leaf construction.

Note that (2^8)^4 * 4 = 2^32 * 4 = 16GB, that is, a radix tree differentiating the first 4 bytes of keys takes 4GB of memory.

What we could do is to have value pages form a linked list and have the radix tree point to the first page with a key with a given prefix. Then, loading the whole thing into memory is as simple as iterating the radix tree, loading that page, and then following the links, and then loading those page IDs into memory.

Furthermore, we don't really need to worry about the radix tree being packed efficiently into pages because we don't expect to fetch it on the fly. But I do think we could come up with some way of storing it in a relatively compact way and just read the whole thing into memory.

06-21

Related to the previous post. I think if we modify Substrate to note when it's writing through a shared prefix we can do this even better.

06-16

I've maybe cracked one of the key issues with variable-length keys.

With fixed-length keys, we keep the 32-byte key path and leaf hash underneath each leaf. But with variable length keys, we can't keep the full key there as it will likely not fit. The property that this gives us is that we can split leaves when inserting into the merkle trie.

If we add an additional requirement that variable length keys are always common prefixes followed by a hash, we can just keep the 32 bytes that follow the common prefix underneath the leaf as a discriminant instead of the full key. This tracks with Substrate storage APIs.

There is one issue: when there is only one key with the common prefix and the key isn't yet located at the common prefix. We can flag such cases with a marker value of e.g. all 1 bits. In that case, we'd have to look up the key with a btree seek. But if the common prefix is really that small, the extra round-trip won't impact performance meaningfully.

There are other solutions to explore in this design space, like making the key path equal to the hash of the key plus the first 16 bytes following the key.

06-13

I am looking into B-Tree designs for Key-Value Storage.
One insight is that, while leaf pages should always be 4K, the branch pages (so long as they are reasonably expected to be in memory) can be much larger and even of variable size.

The benefit here is that we can update branch pages on disk only occasionally, and always with sequential writes.

Consider an approach where we have 4K leaves and branches which grow up to 2MB. Branches are stored all together. With each GB of RAM we can store at least 512 such branches. We expect an extremely shallow tree, of depth 2/3 at most. We use a WAL to keep track of all key-value changes and compact summaries of branch node changes. We have a leaf-page store stores all leaves plus a single meta page. We have a kind of copy-on-write semantic for leaves, where leaves which have been updated since the last time the branches were fully written to disk are copied. Leaves are referred to by their 32-bit index in this store and a free-list head pointer is kept in the meta page.

The way an update works is that we just write to the WAL and then apply the random leaf writes to the leaf store. Once the WAL is big enough, we write all the branch nodes out to disk sequentially, fsync, increment the sequence number, and fsync again. After this is done, we can clean up old WALs and delete any old copy of the branch nodes. This approach lets us take advantage of fast sequential write speeds for WALs and all branch node updates, while ensuring that reads are still fast.

update: pep mentioned we could just keep all the changes in memory and should look into some way of amortizing the disk write costs to avoid huge batch writes. we should be able to do this with some kind of mapping table like the BW Tree has.

06-11

We discussed putting the values into the node-pages and whether this is worth it.

  • benefit: we get small values with a single round-trip
  • detriment: complexity (node relocation), larger IOQ usage.

We have decided on a hash-table for pages and a BTree (mdbx) for key-value storage.

To discuss: keeping hash-table in sync with mdbx

  • we use a joint WAL for this. it contains the changes to key/value pairs as well as the changes to pages.
  • during a crash, it may be that MDBX falls back to the previous state, or it may be that MDBX has concluded its commit while the hash-table has not completed it. In both cases, we roll forward to the next state by applying the WAL
  • mdbx does copy-on-write so this sucks

To discuss: write speed measurements for mdbx
To discuss: synchronous state root is a current requirement but may be a bad idea.

We compared two approaches. The first is hash-db for node pages + mdbx for KV. The second is a B-Tree for all the items. We compared on a few metrics:

Approach 1 Approach 2
implementation complexity
read latency
read bandwidth
write bandwidth
cache amplification
user complexity
keyspace iteration - -
compact page format support ? ?
bigger pages - -
index update CPU parallelism
uncertainty

06-10

We discussed cache amplification with SpaceJam.

When considering read performance vs a btree, as in the last few entries, the question is "how much memory do I need to amortize my disk accesses to 1" - this is cache amplification.

For a btree, this depends on the branch-to-value node ratio. That is, btrees require cache amplification to store all branches.

For our hash-map, it depends on the number of pages we keep.

The graph linked below shows that with a branch-to-value ratio of less than the amount of items we can store in each block of the hash-map, that the hash-map is more cache-efficient.

https://www.desmos.com/calculator/a12l9y66rc

We can also play around with using less memory to cache blocks which are nowhere near full. e.g. the expected amount of memory per block is based on the load factor. a map which is lightly allocated should take much less memory to store, by storing only the first

λ+ϵ items from each bucket, where
λ
is the expected amount of items per block and
ϵ
is a safety margin of extra padding for buckets that have more than expected items.

Of course, we would need to balance this with memory efficiency.

06-09

Revisiting yesterday's value hash-table discussion.

It seemed that triangular probing was fairly weak without in-memory metadata enabling skips, as probe distances could become large and lead to enormous amounts of SSD page queries in the long tail.

We can consider blocked cuckoo hashing as a possible algorithm. This is discussed in https://web.stanford.edu/class/cs166/lectures/11/Slides11.pdf (slide 242). To briefly describe here:

  1. Each key has 2 blocks randomly chosen with 2 independent hash functions.
  2. Each block is a full SSD page, which contains buckets for N items.
  3. During an insertion, we insert into the first free bucket in either of those blocks or randomly choose an element to displace from either one.

This provides useful properties:

  1. worst-case reads: 2 (parallelizable SSD page fetches). 1 in expectation at reasonable load factors.
  2. un-influencable worst-case write performance.

Probing a key requires at most 2 (parallelizable) SSD page fetches.
As always, with cuckoo hashing, insertions can lead to long chains. However, note that the probability of a full block at a given load factor can be computed with the Poisson distribution by finding P(M >= N) where M is the expected number of items per block (e.g. 0.8 * N at load factor 0.8) and N is the number of buckets per block. The probability that a block has one free slot is P(M < N). See https://stattrek.com/online-calculator/poisson

Here is a table for N=107 (as below)

Load Factor P(block full) P(block full)^2
<=0.7 ~0 ~0
0.75 0.0025 ~0
0.8 0.0141 .0002
0.9 0.149 .022

P(block full)^2 is approximately our probability of needing to evict an item at all, i.e. the two chosen blocks for a key are both full. We will successfully move the evicted item to the next block with 1 - P(block full) probability, so the probability of needing to do an eviction and that eviction requiring more than a single extra round-trip is P(block full)^3.

This approach is also not vulnerable to attacks where user input drives worst-case write performance. Because the database chooses an item to evict randomly, write performance is independent from user input.

06-08

Today we discussed the possibility of using a hash-table for the flat key/value storage, as well as for the pages. We believe this should be faster than LSM and BTree architectures, and this is not just hubris to think we can optimize beyond mature software stacks. We don't need MVCC, rollbacks, multiple writers, and other similar features which DBs often do, so we can tailor to our use-case.

The hash-table for node-pages remains the same.

The key-value hash-table would involve 3 parts:

  1. a "heap" where values are actually stored.
  2. an allocator for placing values into the heap
  3. a fixed number of buckets, which are small and store the location in the heap where the value is kept.

A bucket may be, for example, something like 38 bytes:

  • 32 bytes of key information
  • 6 bytes of value heap location information

At 38 bytes, we could store 107 full buckets per 4096 byte page, with 30 bytes left over for metadata within the page.

Unlike the node-page hash-table, we may not be able to keep meta-bits in memory to avoid probing. The reason is that we expect to have many more values than node-pages. Perhaps keeping one bit of meta-information in memory per page is tractable, but with billions of KV pairs we would still require over 1Gi resident memory size.

However, we have seen from our experiments in triangular probing that even without in-memory metadata, at a load factor of 0.9, the mean number of probes required for absent keys is ~10 and ~3 for present keys. With 107 buckets per page, it is likely that a single SSD page read will contain the relevant bucket, as T(10) = 45 from the initial bucket. This gives an expectation that we would need to query only a single page from the SSD tp find the location of the value. As long as a bucket requires 20 probes or fewer (T(20) = 210), we would never need to load more than 2 pages from the SSD per read, but would also never load fewer.

After that, we would need to query only a single page from the drive to read the value information.

It is worth noting that the 99th percentile absent key probes reaches T(20) at a load factor of 0.7, and this may open up some possibility of sequencer DoS with timing attacks determining the slowest keys. We could recommend that sequencers resize their database fairly eagerly to mitigate this.

Because absent key probes are our main issue, we may look into bloom filters or alternative probing algorithms which tend to preserve bucket locality better than triangular probing.

The details of the "heap" allocator are currently uncertain.

I dug in to compare to mdbx with 4096-byte pages, for example.
The page overhead is 160 bytes, leaving 3936 bytes for nodes. Each node has 8 bytes of header information, plus the key material (in a branch). For 32-byte keys, a node is 40 bytes as a result. This gives an expeced dense branching factor for our workload in mdbx of 98. The first 3 layers encompass ~1M pages and ~3.5Gi of space and are plausibly all resident in the OS page cache. log_98(1,000,000,000) is ~4.5, meaning that in MDBX we'd also expect to do ~1.5 additional SSD page loads on average per key with 1 billion keys. log_98(10B) is 5.022. Furthermore, this is also the worst-case performance, unlike our hash-table.

06-07

In Berlin. Created a separate log to discuss design decisions: https://hackmd.io/wL-w03mWSMCpu9_C1JKa0A

05-29

Looking into crash recovery.

Modern filesystems provide some crash recovery guarantees, but can experience failures. Specifically, when we refer to crashes, we have two kinds in mind:

  1. Process Crash
  2. Kernel Crash

In a process crash, anything which has been flush()ed will eventually be written to disk, as it has been passed to kernel space already. This can lead to some loss of data, but we won't deal with issues like partial writes, atomicity, etc.

In a kernel crash (panic, loss of power, etc.), we can encounter further inconsistencies. This paper (https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf) contains some information on the atomicity and ordering of operations on common filesystems. These were determined by analysis of the block-level execution traces emitted by the filesystem.

Almost all modern filesystem provide same-file append/append ordering and multi-block prefix append atomicity. This includes ext4 in journal, nodealloc, and ordered (default) modes, xfs, and reiserfs. This does not include ext4 in writeback mode.

Suggestion: use these properties to provide crash tolerance these filesystems and disclaim support for ext4 data=writeback

Same-file append/append ordering means that two writes appending to the same file will never be reordered after a crash. Multi-block prefix append atomicity means that a large append call will always result in some continuous prefix of the data being written.

And notably, single-block writes have not been found to be atomic after a crash.

I am not 100% sure that this source can be trusted

Basically: sequential WALs will be our bread-and-butter for crash tolerance, even when committing to our hash-table.

Here is a possible strategy for taking advantages of these properties to preserve crash-tolerance in our DB:

  1. Prepare the commit operation, only reading from the hash-table.
  2. Serialize the commit operation to disk, making a copy of any buckets which are planned to be overwritten
  3. Apply the commit to the hash-table. When data is being moved from one bucket to another, read only from the serialized commit from (2)

This approach is crash-resilient on modern filesystems if the above source is to be trusted. When recovering from a crash, there are three possibilities:

  1. Steps (2) and (3) both completed successfully.
  2. Step (2) completed, but step (3) did not. In this case, we can just run step (3) again.
  3. Step (2) did not complete. The commit has been lost, but data integrity from the previous commit has been maintained.

The multi-block prefix append atomicity property lets us distinguish whether Step (2) completed.

It is worth noting that this strategy does incur some additional write amplification and read amplification (only in the event of a crash). This write amplification is especially when moving data from one bucket to another: if writing to a bucket requires in expectation X writes, then we perform 2X writes - the first X sequentially (step (2)) and the second X randomly (step (3)).

At high load factors, this write amplification can be quite significant. The plots in https://github.com/rphmeier/hash-psl-simulation will give an idea of how much data must be copied per commit at different load factors. As an example, when inserting a new value at a load factor of 0.8 into a robin-hood table, we expect 9 additional writes on top of the write for the inserted value.

05-24

Expanding on the bitfield idea for Robin Hood Hashing.

We can keep a bitfield of N bits per bucket, where it has the following semantics:

  1. All 0 means that the bucket is empty
  2. All 1 means that the bucket is full, and the PSL is 2^N - 1 or greater.
  3. Any other value V means that the bucket if full, and the PSL is exactly V.

e.g. with 3 bits, a value of 1 means the bucket is full with a PSL of 1 (the key stored in the bucket hashed there originally)
or a value of 4 means the bucket is full with a PSL of 4. or a value of 7 means the bucket is full with 7 or greater PSL.

This can help somewhat with lookups: e.g. if we hash to a bucket B and the bucket is full with a non-1 PSL, we know it's not our value. Or if it is full, but the next bucket B+1 has a PSL != 2, it's also not our value. This reduces the amount of disk probing we might have to do within clusters.

It also helps with insertions and deletions, in knowing which buckets we might have to load. I want to simulate PSLs for the occupancy of buckets at different levels of load so we can get a reasonable estimate on bits needed.

The memory usage also seems acceptable. If we have e.g. 3 bits per bucket and 2^31 (2Gi) buckets, we would keep a bitfield of 3 * 2^28 bytes, or 768MiB. Storing 1Gi pages at 50% occupancy, we could have billions of values and the pages themselves would occupy 4TiB of storage space. Any node willing to dedicate terabytes of storage space to pages seems likely to have a couple GiB of RAM to spare.

This bitfield does need to be kept on disk. As we know, writing a single bit to the SSD requires writing the whole page. Therefore, we should tend to batch updates to the hash-table, for example, when committing a write-ahead-log. Some analysis of the write amplification effects of storing this bitfield:

  • The worst-case write amplification is where we write 1 bitfield-page per page. We would have an additional 1 added to our write amplification per page in this setting. This would be common with many small updates to the hash-store.
  • The best case is where we fully update a bitfield-page as part of a batch, adding only 1 / (4096 / N) write amplification per page (N bits per bucket in the bitfield).
  • With updates that are relatively small against the total size of the store (e.g. 100Ki page update vs 1Gi buckets), our write amplification is likely closer to the worst-case than the best-case.

With Robin Hood Hashing's expected PSL of 3 at 80% occupancy, we can likely filter out many of our disk queries with such a bitmap effectively to improve access times, while trading off some memory and some write amplification.

05-23

I am looking into different hash table families for our implementation.

We have some requirements:

  1. We optimize for page read times more than page writes. We keep in mind both the average and worst-case probe sequence lengths (PSL) as a result.
  2. We optimize for having in-memory data structures that make queries, on average, faster.
  3. We would like to avoid any need for urgent re-sizing of the hash table, preferring to slowly degrade performance.

One useful property of our workload that we can take into account, is that whenever we write a page, we know with certainty whether it is fresh or whether it is being updated. Another is that whenever we delete a page, we must have read it already and are certain of its location in the hash-table and the probes required.

I considered two potential algorithms .

One is Cuckoo Hashing.

  1. Cuckoo Hashing has an expected PSL of 2 and a worst-case of 2. Insertion and deletion can be bounded to expected-worst-case time, except in the case where the table is full (see point 3)
  2. An in-memory Cuckoo Filter for pages can make queries faster, on average, as it can be tuned to a low false-positive rate. However, it's quite memory intensive for our use-case and will not scale well to billions of entries. The amount of memory needed is given by n_items * fingerprint_size. Our fingerprint size may need to be 64 or more bits.
  3. Cuckoo Hashing can lead to urgent re-sizing of the hash table. Although it performs well at less than a 50% occupancy rate, it degrades quickly at higher occupancy rates.

Second is Robin Hood Hashing.

  1. Robin Hood Hashing gives an expected PSL of 1.5 at occupancy 50% and 3 at occupancy 80%. Its worst-case PSL is O(n), similar to linear probing. In practice, it's known to reduce PSL due to its practical optimizations.
  2. Robin Hood Hashing for lookups can be somewhat optimized with a giant in-memory occupancy bit-map of all buckets - with 1 bit per possible page, we can tell with certainty when a page is not in the table. For example, 2^30 pages would only require 2^30 bits or 128MB of memory. Due to our "fresh or not" property, it also tells us how many pages we might need to read to perform an insertion or deletion, allowing us to capitalize on faster sequential read properties in the case that probing sequences are long.
  3. Robin Hood Hashing performance degrades more slowly at occupancies over 50%, giving ample time to perform a resizing in the background.

On regular linear/quadratic hashing: these are known to perform worse in practice than Robin Hood Hashing.

I did not look deeply into Extendible Hashing but I do plan to.

05-05

I have been looking over this SILT paper which contains some very useful insights.

1. Use append-only storage to optimize for Sequential Writes

I think this would be something very useful to integrate into our hash-table design. The idea would be to still keep the hash-table, but store it along with a series of append-only files which make commits fast. Whenever we initially write a page to disk, we write it to append-only storage. We would also keep information about pages which have recently been written to the append-only store in RAM so we can look them up from their respective position in that store. For a lookup, it's still a random read, but we commit data to disk faster.

We would periodically need to clear out old append-only files and update the hash-table. But crucially, we only need to write the most stale pages to the hash-table. Ones which have been written more recently to the append-only storage could just stay resident there.

2. Keep indexes in RAM to reduce disk lookups

We want to have a hierarchy of page-stores based on how hot the data is. Stores at the top of the hierarchy use more RAM but are faster to query. The hierarchy can look like this:

  1. RAM-based cache: 0 disk lookups, 4096 bytes in RAM per page. For the hottest few thousand pages.
  2. Append-only store: 1 disk lookup, a few bytes in RAM per page. For the next hottest few tens of thousands of pages.
  3. Hash-table: O(1) disk lookups, a few bits in RAM per page. For the rest of the pages.

We technically don't need to keep any bits in RAM per page for the hash-table and can still get O(1) but the SILT paper shows that we can keep a few bits in RAM per page and reduce the likelihood of collisions substantially.

04-22

Revisit architecture now that we have a working prototype.

1. Further specializing update logic

We currently use a cursor for performing updates. We can improve update logic quite a lot by specializing the routine for updates.

The first issue we have now is in allocating new pages. When we encounter a fresh page, we currently query the cache, which falls back to the disk store, which finds no page, and then only after performing I/O do we then allocate. This is all happening in a critical section. In fact, during update we already know exactly which pages must be allocated fresh: any pages beyond one of the visited terminals. Update currently spends around 10-14% of its time allocating pages in a randw workload and 16-20% of its time performing I/O for 26-34% in total.
Recommendation: build specialized update logic that can allocate fresh pages without querying the store

A second issue is that we query pages during update logic from the broader shared page cache. These pages have already been fetched as part of the warm-up procedure and could be provided directly to the update logic. This would also eliminate the need for any tracking of dirtied pages. Update currently spends around 6-10% of its time querying the page cache and tracking dirtied pages.
Recommendation: have warm-up workers efficiently prepare the set of pages for use in update

Lastly, the update operation is embarassingly parallel and even potentially pipelined.

  1. Even before all warm-ups have completed, it is possible to edit terminal nodes for pages which have been loaded and allocate new pages below them.
  2. Any terminal nodes which are edited in non-overlapping pages can be updated in parallel.
  3. All sub-trees which expand into newly-allocated pages will not contend over those pages with other threads.
  4. All compact_up operations eventually contend while moving up towards the root, but the page tree is extremely wide and should allow changes to propagate in parallel.

All this together, it would seem we can reduce the overhead of the update operation by around 30-40% on a single thread, and then massively parallelize and pipeline the remaining 70% of work. Parallelization should work perfectly at every level up to the root. After that, updates will primarily be bottlenecked on hashing.

2. Background Flushing and Cache Redesign

Writing pages to disk is a heavy I/O operation and is a major bottleneck in NOMT. In a 25k random write operation over 30,000 pages are updated, for 117MB of total I/O. Preparing the database transaction and performing the write together consume around 60% of the time it takes to commit_and_prove.

Even under the unrealistic assumption that writing would take a single round-trip, with 1M IOPS it would take over 29ms to finish flushing the writes of all these pages. Flushing in the background, and even flushing only occasionally, would remove this I/O from the critical path. With our RocksDB backend, the total commit_and_prove time for 25k randw workload is 826ms. It is not easy to distinguish how much of this time is spent waiting on page fetches to conclude vs. the commit itself.

Flushing should be removed from the critical path. The user should be able to receive the state root and witness very quickly and have the option to wait for a WAL to be populated with the key-value pairs of the commit.

Instead, I recommend that we establish the following architecture:

  1. The page cache, which always reflects the on-disk pages.
  2. Page overlay(s), which contain changed pages.
  3. KV change lists

We maintain one active page overlay, which is eventually frozen after it reaches a certain size or enough time has passed. Once it is frozen, a background task flushes the contents to disk and reclaims the memory.

This is inspired by the design of RocksDB, which keeps one active "memtable" along with several frozen ones which are being flushed to disk.

In the RocksDB backend, I recommend that we maintain one page overlay and simply commit to RocksDB in the background. We allow the user to wait upon this operation, but also to receive the witness and new state root quickly. It is also not clear whether a dedicated page cache is beneficial in the RocksDB backend, given that RocksDB maintains its own caches. We will need to measure to say for sure.

3. Lock-Free Data Structures