IAVL fixes

Locality

Problem

TL;DR - we use no data locality. Every entry into the database is keyed on database by Hash(human readable key), so every read is a new I/O read. Also every Get walks the tree. So we currently do 20+ SSD access per Get, instead of 1 SSD access (worst case) to RAM/L3 access avg case. And for iteration, we do SSD reads, instead of an L3 cache read (due to prefetching). Diagram for performance drawdown: scale

Solution

Basically lets index data twice. One by hash (so we don't have think much about safety, set works exactly as before), second index by raw key string. Then make all gets / iteration operate on strings.

Fixing this should improve gets to 1SSD access worst case (20x improvement), and average case RAM read / L3 cache read (100x+) improvement.

This should also change iteration from 1 SSD file open/read per next() call, to RAM-L2 reads. (L2 potentially due to prefetching), again 100x+ improvement.

IAVL Versioning

Currently IAVL versions every historical change. So imagine version numbers being synonmyous with block heights. This is a second thing we need to think about in making this second data index.

Currently IAVL has "Live keys" as roughly LIVE_KEY_PREFIX/HASH in the db, and the node object stored maintains when it was last written to. (Its version) Orphaned nodes are stored as ORPHAN_KEY_PREFIX/VERSION_GOT_DELETED_AT/VERSION_GOT_WRITTEN_AT/KEY

Orphaned reads of key k work at the moment by you specify a version you want to read at. You first walk tree for key k, and when you get to the leaf you see if its written at a version earlier than the one you query at. If so done, else query orphaned db.

Achieving goal without being bogged in work

Lets build our second query for querying "semi-live" state, AKA only re-index live state, not history. We could do this for history, but that will take more architecting. Since we can make our value struct store the version, just like the current method does, we should get no degradation in performance for the things we care most about.

To keep things simple (though not ideal design), we should re-use the existing nodeDB, and just do a key prefix in there.

We do this by making a key in the nodedb for FAST_KEY_READ_PREFIX/KEY_YOU_GET_WITH. The value then is value, version written at. If version written at > {tree version}, then after reading, go with the old (slower) read methods.

My understanding is this should be fine for mutexes for now, and should be fine after we remove the ABCI mutex on app execution. The only issue with this plan would come if we try to engineer queries to also happen while we're committing. I think for now that should be fine to ignore. (We need to be able to process queries while executing though.)

TODO's for: Base operations being faster

FastNode

The current node struct has many extra fields. We should make our node for this faster cache be simpler, e.g:

type FastNode struct {
    versionLastUpdatedAt   int64
    value     []byte
    // TODO: Look into if this would help with proof stuff.
    leafHash  []byte
}

Three options regarding deleted flag:

  • No migration logic, then when querying fast node for live state, if its not there, then I don't know whether it exists or not. So I still have to query against old logic as well.
    • This seems probably fine
  • With migration logic: if its not in the fast node, we can guarantee its not in live state.
    • But what if you make a query in history?
    • With deleted flag: If query height > deleted at height, you have no need to go to old IAVL logic.
      • This is a lot of extra state because deleted keys, still have to exist in the leveldb. This may make traversing worse as well, not super sure how the leveldb structure works here.
    • Without deleted flag: Then you can't know whether it existed at that height or not and must use old IAVL logic.

Seems like maybe we shouldn't have a deleted flag. Most of the value is in live state, and anyways you don't query deleted things that often.

IAVL ops

  • IAVL.Get(key) - Index by key directly, maintain locality.
  • IAVL.Traverse(key_start, key_end) - Iterate over store, by keys directly, incredibly cheap due to locality
    • New Comments:
    • For iterating over live state, we can just do iteration over the leveldb keys in the fast cache directly. We maybe need to expose new functions for this, no reason to walk the entire tree.
    • Locality is preserved by level db doing its job correctly, since the keys you iterate over maintain locality.
    • We should ensure that the TM-DB implementation for iterators on leveldb also does sane things. (It does not for memdb for instance)
    • OLD COMMENTS / getting significant speed improvements for proofs:
    • Basically build a trie for how we store on disk.
    • Do this, by changing GetLeftNode, GetRightNode to actually get it, via some method that preserves data locality.
    • This method is saying "suppose my current node has hash/index FOO". Then there is a pointer to left child at key "FOO/LEFT" and right child at "FOO/RIGHT".
    • This will already be in the "cache" that is there when the current node was gotten.
  • IAVL.Set(key, value) - Do old logic. But also, we need to add resulting value to the fast store.
  • Make migration for data store. - See relevant section

If building a migration command is annoying, then we just store the value twice, and double all write I/O size.

No migration logic for v1

Lets just not do any migration logic. Anything that gets written to, will be available to the fast method. All queries will check the fast method first, if it fails, then do the prior slow method.

We can gets and sets working without it. It will help with traverse.

Migration methods:

  • Lazy migration
    • Iteration can't be made faster.
      • This makes it a non-option
    • But gets & sets are just leagues better already
  • All at once migration
    • Maybe a command we expose on the binaries? We basically build a function that does it in the golang code, which will iterate the sub-tree thats relevant, and put all leaves into the new faster method.
      • Its a question of do we do it as soon as you use the new IAVL version, or do we only do it when trigerred by a command. Seems sensible to me to automate it, if we do logging?
    • All at once migration will be necessary if we end up doing any deeper data structure changes for Proof speed improvements.

UPDATE: We have decided to do all at once migration, and it happens automatically.

IAVL Cache [x] - https://github.com/cosmos/cosmos-sdk/pull/10561

Literally just increase the IAVL cache size parameter 100x. Its ridiculously small. (32KB atm)

https://github.com/cosmos/cosmos-sdk/issues/1714

LevelDB Compaction [x] - https://github.com/marbar3778/tm-db/pull/1

Whenever we prune, we should force a compaction to take place. https://twitter.com/valnodes/status/1435465250803945472

I suggest we add a method to the TM-DB interface for "ForceCompaction", and whenever we prune in tendermint block storage and IAVL state storage, just call that.

Select a repo