TL;DR - we use no data locality. Every entry into the database is keyed on database by Hash(human readable key)
, so every read is a new I/O read. Also every Get walks the tree. So we currently do 20+ SSD access per Get, instead of 1 SSD access (worst case) to RAM/L3 access avg case. And for iteration, we do SSD reads, instead of an L3 cache read (due to prefetching). Diagram for performance drawdown: scale
Basically lets index data twice. One by hash (so we don't have think much about safety, set works exactly as before), second index by raw key string. Then make all gets / iteration operate on strings.
Fixing this should improve gets to 1SSD access worst case (20x improvement), and average case RAM read / L3 cache read (100x+) improvement.
This should also change iteration from 1 SSD file open/read per next()
call, to RAM-L2 reads. (L2 potentially due to prefetching), again 100x+ improvement.
Currently IAVL versions every historical change. So imagine version numbers being synonmyous with block heights. This is a second thing we need to think about in making this second data index.
Currently IAVL has "Live keys" as roughly LIVE_KEY_PREFIX/HASH
in the db, and the node object stored maintains when it was last written to. (Its version) Orphaned nodes are stored as ORPHAN_KEY_PREFIX/VERSION_GOT_DELETED_AT/VERSION_GOT_WRITTEN_AT/KEY
Orphaned reads of key k
work at the moment by you specify a version you want to read at. You first walk tree for key k
, and when you get to the leaf you see if its written at a version earlier than the one you query at. If so done, else query orphaned db.
Lets build our second query for querying "semi-live" state, AKA only re-index live state, not history. We could do this for history, but that will take more architecting. Since we can make our value struct store the version, just like the current method does, we should get no degradation in performance for the things we care most about.
To keep things simple (though not ideal design), we should re-use the existing nodeDB, and just do a key prefix in there.
We do this by making a key in the nodedb for FAST_KEY_READ_PREFIX/KEY_YOU_GET_WITH
. The value then is value, version written at
. If version written at > {tree version}
, then after reading, go with the old (slower) read methods.
My understanding is this should be fine for mutexes for now, and should be fine after we remove the ABCI mutex on app execution. The only issue with this plan would come if we try to engineer queries to also happen while we're committing. I think for now that should be fine to ignore. (We need to be able to process queries while executing though.)
The current node struct has many extra fields. We should make our node for this faster cache be simpler, e.g:
type FastNode struct {
versionLastUpdatedAt int64
value []byte
// TODO: Look into if this would help with proof stuff.
leafHash []byte
}
Three options regarding deleted flag:
Seems like maybe we shouldn't have a deleted flag. Most of the value is in live state, and anyways you don't query deleted things that often.
If building a migration command is annoying, then we just store the value twice, and double all write I/O size.
Lets just not do any migration logic. Anything that gets written to, will be available to the fast method. All queries will check the fast method first, if it fails, then do the prior slow method.
We can gets and sets working without it. It will help with traverse.
Migration methods:
UPDATE: We have decided to do all at once migration, and it happens automatically.
Literally just increase the IAVL cache size parameter 100x. Its ridiculously small. (32KB atm)
https://github.com/cosmos/cosmos-sdk/issues/1714
Whenever we prune, we should force a compaction to take place. https://twitter.com/valnodes/status/1435465250803945472
I suggest we add a method to the TM-DB interface for "ForceCompaction", and whenever we prune in tendermint block storage and IAVL state storage, just call that.