Try   HackMD

Rethinking Gossamer state trie

This document aims to showcase the differences between gossamer's implementation of the state trie and the implementation made by Parity and discussing improvements for Gossamer to address existing challenges / issues

All described below is based on the Polkadot spec

State Storage Trie

Definition

Each polkadot host implements a hash table storage where the keys are used to access each data entry.
This key and data could be of arbitrary size but we know the length limit based on the algorithm used to hash the key and encode the values.

Structure

In order to ensure the integrity of the state of the system, the stored data needs to be re-arranged and hashed in a radix tree, which hereafter we refer to as the State Trie or just Trie. This rearrangement is necessary to be able to compute the Merkle hash of the whole or part of the state storage, consistently and efficiently at any given time.
The trie is used to compute the Merkle root of the state, whose purpose is to authenticate the validity of the state database

Each key-value identifies a unique node in the trie.

Encoding

Each node is encoded using the following structure:

\(Node Header∣∣Partial Key∣∣Node Subvalue\)

Node header:

The node header has at least 1 byte that specifies the node variant and the partial key length.
The available node variants are:

\[ {v}={\left\lbrace\begin{matrix}{01}&\text{Leaf}&{p}_{{l}}={2}^{{6}}\\{10}&\text{Branch Node with }\ {k}_{{N}}\notin{\mathcal{{K}}}&{p}_{{l}}={2}^{{6}}\\{11}&\text{Branch Node with }\ {k}_{{N}}\in{\mathcal{{K}}}&{p}_{{l}}={2}^{{6}}\\{001}&\text{Leaf containing a hashed subvalue}&{p}_{{l}}={2}^{{5}}\\{0001}&\text{Branch containing a hashed subvalue}&{p}_{{l}}={2}^{{4}}\\{0000}{0000}&\text{Empty}&{p}_{{l}}={0}\\{0000}{0001}&\text{Reserved for compact encoding}&\end{matrix}\right.} \]

Partial key

The partial key is the node key while the full key is the concatenation of each partial key of the nodes above the specific one and their child indexes.

Eg:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

In the image above each node contains its partial key in its own representation while the full key is the concatenation of the partial key of the root + its child index.

Note: In case of root node partial key == full key

Node Subvalue

The node subvalue is the value of the node, if the node is a leaf it will only contain the scale encoded stored value but if the node is a branch its subvalue is the scale encoded concatenation of 3 components:

  • ChidlrenBitmap: To know in which indexes this branch contains nodes
  • StoredValue: Scale encoded value contained in this node, it only makes sense if the node header is 11 or 0001
  • Children: Concatenation of scale encoded hashes of encoded nodes

\[ {s}{v}_{{N}}\:={\left\lbrace\begin{matrix}\text{StoredValue}_{{\text{SC}}}\\\text{Enc}_{{\text{SC}}}{\left(\text{ChildrenBitmap}{\left({N}\right)}{\left|{\left|\text{StoredValue}_{{\text{SC}}}\right|}\right|}\text{Enc}_{{\text{SC}}}{\left({H}{\left({N}_{{{C}_{{1}}}}\right)}\right)},\ldots,\text{Enc}_{{\text{SC}}}{\left({H}{\left({N}_{{{C}_{{n}}}}\right)}\right)}\right)}\end{matrix}\right.} \]


Given the previous definition I'm going to detail how Parity implemented their storage and compare it with gossamer

Parity's Trie

https://github.com/paritytech/trie

Description

They describe their implementation as:

A generic implementation of the Base-16 Modified Merkle Tree ("Trie") data structure, provided under the

And that's what it is, a DB-backed radix-16 Merkle trie that supports different hashing algorithms and configurations.

Some of its features include:

Generic encode and decode

Parity's trie defines a generic way to encode and decode nodes by implementing the following trait:

pub trait NodeCodec: Sized {
    const ESCAPE_HEADER: Option<u8> = None;
    type Error: Error;
    type HashOut: AsRef<[u8]>
        + AsMut<[u8]>
        + Default
        + MaybeDebug
        + PartialEq
        + Eq
        + hash::Hash
        + Send
        + Sync
        + Clone
        + Copy;

    fn hashed_null_node() -> Self::HashOut;

    fn decode_plan(data: &[u8]) -> Result<NodePlan, Self::Error>;

    fn decode<'a>(data: &'a [u8]) -> Result<Node<'a>, Self::Error> {
        Ok(Self::decode_plan(data)?.build(data))
    }

    fn is_empty_node(data: &[u8]) -> bool;

    fn empty_node() -> &'static [u8];
    
    fn leaf_node(partial: impl Iterator<Item = u8>, number_nibble: usize, value: Value) -> Vec<u8>;
    
    fn extension_node(
        partial: impl Iterator<Item = u8>,
        number_nibble: usize,
        child_ref: ChildReference<Self::HashOut>,
    ) -> Vec<u8>;
    
    fn branch_node(
        children: impl Iterator<Item = impl Borrow<Option<ChildReference<Self::HashOut>>>>,
        value: Option<Value>,
    ) -> Vec<u8>;
    
    fn branch_node_nibbled(
        partial: impl Iterator<Item = u8>,
        number_nibble: usize,
        children: impl Iterator<Item = impl Borrow<Option<ChildReference<Self::HashOut>>>>,
        value: Option<Value>,
    ) -> Vec<u8>;
}

Generic Hasher

They defined a trait that describes an object that can hash a slice of bytes.
It contains a single hash(bytes) method that returns an Out type

pub trait Hasher: Sync + Send {
	type Out: AsRef<[u8]>
		+ AsMut<[u8]>
		+ Default
		+ MaybeDebug
		+ core::cmp::Ord
		+ PartialEq
		+ Eq
		+ hash::Hash
		+ Send
		+ Sync
		+ Clone
		+ Copy;

	type StdHasher: Sync + Send + Default + hash::Hasher;

	const LENGTH: usize;

	fn hash(x: &[u8]) -> Self::Out;
}

Trie Layout

Given the previous definitions they created a trait to hold the Trie configuration and be able to create any kind of radix-16 trie using it.

pub trait TrieLayout {
    const USE_EXTENSION: bool;
    const ALLOW_EMPTY: bool = false;
    const MAX_INLINE_VALUE: Option<u32>;

    type Hash: Hasher;
    type Codec: NodeCodec<HashOut = <Self::Hash as Hasher>::Out>;
}

HashDB

This trait is used to define a DB that will be used as the source of truth for the Trie.

pub trait HashDB<H: Hasher, T>: Send + Sync + AsHashDB<H, T> { fn get(&self, key: &H::Out, prefix: Prefix) -> Option<T>; fn contains(&self, key: &H::Out, prefix: Prefix) -> bool; fn insert(&mut self, prefix: Prefix, value: &[u8]) -> H::Out; fn emplace(&mut self, key: H::Out, prefix: Prefix, value: T); fn remove(&mut self, key: &H::Out, prefix: Prefix); }

For this, and to be able to re build the trie structure, what is stored in this database are the trie nodes encoded.

Since the V1 of the trie also supports storing values, these values will also be stored in this DB when the size of the value is greater or equal to 32 bytes.

Key Value
Hash(encoded_node) encoded_node
Hash(value) value

Note: to prevent collisions we should prefix the keys with the node prefix.
key = node prefix + hash

So to be able to build the trie, it's only necessary to know the hash of the root node hash(encoded_root), get this node from the db, and then decode the following nodes from it.

TrieDB

It is a wrapper over a DB that allows creating a "virtual" trie, supporting the same operations as a conventional trie but using a DB as the source of truth.

pub struct TrieDB<'db, 'cache, L>
where
    L: TrieLayout,
{
    db: &'db dyn HashDBRef<L::Hash, DBValue>,
    root: &'db TrieHash<L>,
    cache: Option<core::cell::RefCell<&'cache mut dyn TrieCache<L::Codec>>>,
    recorder: Option<core::cell::RefCell<&'cache mut dyn TrieRecorder<TrieHash<L>>>>,
}

Since reading or writing directly on the DB would generate a lot of overhead, this structure takes care of accumulating write operations and applying them only when executing the commit() method, it also uses a cache to optimize read times for both nodes and values.

Lazy loading

Given the structure above, it's easy to generate an algorithm that looks for a value in the DB knowing the root of the tree hash(encoded_root_node).

Blank diagram (10)

In case we have a cache its easy to add checks before each db access to check if the node or value is contained in the cache and return it from there.

Gossamer Trie

Description

The type of trie used in Gossamer complies with the structure of a radix-16 Merkle trie, so it is compatible with the Substrate implementation. However, it has the disadvantage that gossamer trie is 100% in memory.

So basically a Trie in gossamer is defined by a root Node where every node in the trie has the following structure:

type Node struct {
    PartialKey   []byte
    StorageValue []byte
    IsHashedValue bool
    Children []*Node
    Descendants uint32

    // Internal used fields:
    Generation uint64
    Dirty bool
    MerkleValue []byte
}

This means that, unlike Parity where there is the TrieDB struct acting as a wrapper over a DB, Gossamer, in contrast, has both the trie structure and its data always loaded in memory.

The only time it interacts with a database is when storing the entire trie using the method WriteDirty(), which encodes and stores the entire trie in the DB, and the Load() method, which given a db and a rootHash reconstructs the entire trie by reading the data from the db and creating the nodes accordingly.

Disadvantages

The main disadvantage with this approach is related with the memory consumption and CPU usage when it loads the trie, but also with the flexibility and extensibility.

Memory consumption

Since the entire trie is always in memory, we must store not only the data necessary to maintain the trie structure (nodes, children) but also the values.

As a consequence, until a block is finalized, we cannot execute a pruning of nodes from the trie, so normally we must store more than one trie at the same time, which further increases memory consumption.

Startup time

Since we must load the trie from our DB when starting a node using the last finalized block, it is necessary to go through the database executing the trie loading process, decoding the nodes, and loading them into memory.

This process is very slow, and as a consequence, if we want to start a node using a database snapshot, the startup time is proportional to the size of the snapshot.

Flexibility and extensibility

This trie does not support a generic configuration; currently, the hash algorithm it uses is "hardcoded" in the trie implementation.

The same happens with the trie encoding and decoding algorithm; it can be extended in terms of functionalities but it is difficult to create a different implementation and start using it for some tries and not for others.

Proposed changes

Since replicating Parity's trie implementation would necessitate almost replacing our entire codebase with their logic, and considering that all these changes are interconnected, this would require creating a single, large PR encompassing all the modifications. This approach could demand a significant amount of time not only from the implementer but also from the reviewers. Moreover, after developing the library, we would need to modify other parts of our code to start utilizing it. Therefore, I propose reusing as many parts of our existing code as possible, such as the encode and decode functions for nodes, merkle proof generation, and verification, while replacing the necessary components to add lazy loading. For example: loading the trie from the database, modifying trie and node structures, and introducing new structures.

Furthermore, to streamline the review process and deliver value in a short period, I propose implementing these changes incrementally, following this plan:

Phase 0

In order to be ready for the further changes, I believe we should do a little refactor first:

PR #1:

  • Move all trie-related code to a single package in /pkg/trie.

PR #2:

  • Create a Trie interface with the minimum necessary methods for a trie. Such as: get, insert, remove, contains, root
  • Rename the current Trie struct to InMemoryTrie.

Phase 1

PR #3

To improve node handling, I propose replacing our current Node structure with one oriented towards enums to more easily and structurally differentiate nodes.

Along with this, we would replace the current StorageValue []byte and IsHashedValue bool fields with an enum NodeValue as follows:

NodeValue:

type ( // Value bytes as stored in a trie node Inline struct { Bytes []byte } // Containing a hash used to lookup in db for real value Hashed struct { Hash []byte } )

Node:

type ( // Leaf always contains values Leaf struct { PartialKey []byte Value NodeValue } // Use pointer to describe optional child / value Branch struct { PartialKey []byte Children [16]*Node Value *NodeValue } )

Phase 2

In this phase, we would add lazy loading, so we need to introduce the concept of TrieDB that implements the Trie and uses a db as a data source.

Consider the following naive implementation assuming that:

  1. Every write writes directly to the database.
  2. Every read reads directly from the database.

In this case, what we would need to do is the following:

Stop building the in-memory trie

We must modify our node enum so that the Children list of a branch are not another node but since it contains scale encoded hashes of encoded nodes we can replace it with a list of node hashes instead.

Branch struct { PartialKey []byte Children [16]*Hash //Hash == Hash(encoded_node) Value *NodeValue }

Where Hash is the type of data returned by the hash algorithm we are using.

Lookup

We can create a Lookup structure responsible for traversing the tree looking for a node given its fullKey and returning the value following the algorithm showed before:

Blank diagram (10)

TrieDB

Given that now our concept of Trie is not in fact a 100% pre built trie but instead it refers to the way we read, traverse, and store the keys and values of our DB, we need to introduce the concept of TrieDB which is a wrapper over a DB based on a rootHash and using a trieLayout as configuration holder similar to Parity's implementation.

type TrieLayout interface { MaxInlineValue() *int }
type TrieDB struct { db DB root Hash trieLayout TrieLayout }

This structure will implement all the methods defined in the Trie interface introduced in Phase 0 so we could easily replace the version used in our code from InMemoryTrie to TrieDB and all the tests should still working.

We can break down this implementation into different steps:

Step 1 - Read only TrieDB

To start with the minimum required logic the first step is to create a read only TrieDB capable of lazy load nodes from DB and being compliance with the Trie interface but without an specific implementation for write methods (put, delete, etc)

This is an easy way to introduce TrieDB without adding too much complexity

Step 2 - Read only TrieDB with cache

Since the previous change will be reading and decoding nodes directly from the DB it could be slow even using an SSD so we have to introduce a caching mechanism to speed up the read operations.

This should be a shared cache between different TrieDB instances.

We can easily add a cache as part of the TrieDB structure and use it when necessary, mainly modifying the Lookup structure but potentially also using it in methods where we need to know the trie structure to add, modify, or delete nodes.

type TrieDB struct { db DB root Hash trieLayout TrieLayout cache TrieCache }

Step 3 - Add write capabilities to TrieDB

Now that we test and improve the read operations we can focus on write operations. We have to add every put , delete method for key-values and child tries.

This writes will be made directly to the DB, we are going to improve this later

Step 4 - Improve writings

To prevent performing all writes directly to the database every time we need to change something we are going to introduce a mechanism to accumulate changes and commit them only when we need.

To continue complying with the Trie interface, we must call this function in a method that is already defined in that interface.

I propose using the same strategy as Substrate and executing the commit() in the root() method.

func (tdb *TrieDB) Root() Hash { tdb.commit() return tdb.root }

Note: This TrieDB is not responsible for handling the state transactions, we should solve that in a different layer on top of this one. That layer will accumulate and apply temporal "change sets" and when a transaction commit is executed the resulting change set will by applied using this TrieDB API.

Phase 3

To finish the migration we are going to change every place where we are currently using in-memory tries to start using the new TrieDB with lazy loading and test if all works ok.

Bonus track

Benchmarking

Would be great if we can measure the improvement in different aspects:

  • Memory usage should be lower using lazy loading than in-memory tries
  • Read and Write performance shouldn’t be considerable worst with the new implementation
  • Node startup time using lazy loading should be better than using in-memory tries

Use generic hash type

We can change our TrieDB and our structure to receive a generic type representing the hash algorithm output type

TrieDB[H]

type TrieDB[H Hash] struct { db DB[H] root H trieLayout TrieLayout[H] cache TrieCache[H] }

NodeValue[H]:

type ( // Value bytes as stored in a trie node Inline struct { Bytes []byte } // Containing a hash used to lookup in db for real value Hashed[H Hash] struct { Hash H } )

Node[H]:

type ( // Leaf always contains values Leaf[H] struct { PartialKey []byte Value NodeValue[H] } // Use pointer to describe optional child / value Branch[H] struct { PartialKey []byte Children [16]*H Value *NodeValue[H] } )

Hasher interface

Would be great if we can add a hasher interface to create different hashers for the keys in our database. This will allow use to easily change the hash algorithm in the future.

type Hasher[H Hash] interface { Length() int Hash(value []byte) H FromBytes(value []byte) H }

NodeCodec interface

We can implement an interface similar to Parity's NodeCodec trait to be able to use different encode and decode strategies in the future

type NodeCodec[H hashdb.HashOut] interface { HashedNullNode() H Hasher() hashdb.Hasher[H] EmptyNode() []byte LeafNode(partialKey nibble.NibbleSlice, numberNibble int, value Value) []byte BranchNodeNibbled( partialKey nibble.NibbleSlice, numberNibble int, children [16]ChildReference[H], value *Value, ) []byte Decode(data []byte) (Node, error) }

TrieLayout changes

Then we can add the generic Codec in our TrieLayout and use it in our TrieDB with a few changes in the codebase.

type TrieLayout[H Hash] interface { MaxInlineValue() *int Codec() NodeCodec[H] }

Conclusions

With these changes, it should be sufficient to support lazy loading while maintaining compatibility with the rest of our code, without necessitating the replacement of 100% of our implementation.

Furthermore, this approach helps us to add value across different iterations, as well as simplifying the coding and review process.