256+ trie. Definition

*Evaldas Drąsutis IOTA Foundation* # 256+ trie. Definition ## Introduction We describe a deterministic trie structure which can be built from any collection of key value pairs. We call it **256+ trie** because while each node has 256 children, it still has to commit to the vector of 257 values. The 257th is a commitment to the terminal value or its absence. The 256+ trie is represented as a collection of key/value pairs with variable length keys. ## Peliminaries *Strings* are sequences of bytes, possibly empty (zero-length). $s[i]$ is an integer value of *i*-th byte of the string $s$. Zero-length string is denoted as $''$. $'c'$ is a one byte -long string with the integer value $c$ of its only byte. *Key/value pair* is denoted as $[k, v]$, where *key* $k$ is a string, possibly empty, and *value* $v$ is a non-empty string. A *key/value collection* (also, a *collection*) is a set $S = \{[k_i,v_i]|i=0\dots n-1 \}$ where each key $k_i$ is **unique** in the collection, i.e. the collection can't contain two pairs with same key. We assume each collection is **ordered lexicographically** by its keys. Each key/value collection is usually constructed using sequence of operations $SET(S,k,v)$ which adds new or modifies existing key/value pair $[k,v]$. By definition, operation $SET(S, k, nil)$ removes key/value pair with key $k$ from the collection. It can be denoted as $DEL(S, k) = SET(S, k, nil)$ $GET(S, k)$ will retrieve value $v$ from key/value pair $[k,v]$ by the given key $k$ if it exists in the collection. Any key/value collection $S$ is a **deterministic** structure in a sense that $S$ does not depend on the sequence of operations how it was constructed, i.e. only final result (the *state*) matters. Appending string $s$ to the prefix $p$ (concatenation) is denoted as $p \oplus s$. $a\ominus p$ is the result of removing prefix $p$ from $a$, if $a$ has the prefix $p$, otherwise undefined. Always true the following: $a=(p \oplus a) \ominus p$ For a collection $S$, the $p\oplus S = \{[p\oplus k, v] | [k,v]\in S\}$ is a new collection of key/value pairs which is a result of prefixing each of its keys with $p$. Similarly, notation $S\ominus p$ means removing a common prefix $p$ from all keys, if it exists. We denote $prefix(S)$ as **longest common prefix** of all keys in the collection of key/value pairs $S$. $prefix(S)$ is an empty string if keys do not have common prefix. The following all is true: * $S = prefix(S) \oplus (S\ominus prefix(S)$ * if $S = prefix(S) \oplus S'$ then $prefix(S') = ''$ and $S' = S \ominus prefix(S)$. ## Trie Let's say we have an arbitrary collection of key/value pairs $S$, a *state*. We aim to build a *trie*, another collection of key/value pairs $trie(S)$. The $trie(S)$ is build deterministically, along changes in the $S$. There's 1-to-1 corrrespondence between ties and states. This fact can be derived from the way we construct $trie(S)$ described below. ### Tree and nodes The trie will represent a *tree* which will be used to calculate *commitment* to $S$. Each key/value pair in the trie represents a *node*, a vertice of the tree, which can be retrieved by its key: $[k, node]$. The trie will contain *nodes* as values of key/value pairs $[k, node]$, in the form of binary representation (serialized). Further on we will not draw a distinction between a node and its binary representation. The *root* node of the non-empty trie is always at the empty key $''$: i.e.: * $root(trie(S)) = GET(trie(S), '')$ * $rootPair(trie(S)) = ['', root(S)]$ Each node is a triplet: $node(PF, CHILDREN, C_T)$, where: * $PF$ is a string, called *path fragment* (see below) * $CHILDREN$ is a 256-long vector of commitments to its children $(C_0, C_1, \dots , C_{255})$. $C_j=nil$ means child is absent. * $C_T$ is a commitment to the terminal value $v$ if present, otherwise $nil$ (see below) The commitment $C_T=C(v)$ to the terminal value $v$ (a binary data) is a hash of it, usually adjusted to the modulus of a finite field. The *vector commitment* to the node $C(k) = C(GET(S, k)) = C([k, node]) = C(node) = C(V)$, where $V=(C_0, C_1, \dots , C_{255}, C_T)$, is a 257-long vector of values (in our implementation of the *verkle trie* $C(V)$ is a point on the eliptic curve). Any value $C_i$ and $C_T$ may be $nil$ if there's no commitment to the corresponding node or value. We will use $NIL$ for the 256-long vector with no commitments, i.e. $(nil, nil, \dots, nil)$ For the definition of the trie it is not important how exactly we calculate the commitment $C$ to the vector. **Commitment to the state** (a key/value collection) $S$ is a commitment to the root node of its trie: $$ C(S) = C(root(trie(S))) $$ ### Defining the trie for the state Let's say a collection $S$ is a state. We will define $trie(S)$ recursively, by splitting state into sub-states and defining *trie* for them. #### Base 1. Let's define $trie(\emptyset)=\emptyset$ 2. Let's say $S$ contains exactly one key/value pair: $S=\{[k,v]\}$. Then $trie(S)$ will also contain one node, which is also the root node: $$ trie(S) = \{['', node(k, NIL, C(v))]\} $$ The only node (a key/value pair) above will have the empty string $''$ as its key. $NIL$ above means the node commits to no children. The terminal committment $C_T=C(v)$ is a commitment to the value $v$, usually hash of it. #### Recursion Now let's say $S$ has more than one key/value pair. We define its trie by splitting the collection $S$ into non-intersecting sub-states and recursively defining trie for each of them as the children of the root node of $S$. The keys of the key value pairs remain encoded into the keys and path fragments of the trie nodes. *Step 1* We split the collection $S$ into non-intersecting collections the following way: 1. we take $prefix(S)$, the longest common prefix of all keys. It may or may not be an empty string 2. we remove the $prefix(S)$ from all keys in $S$. The result is a new collection $S\ominus prefix(S)$ 3. If the collection $S\ominus prefix(S)$ contains a pair with empty key $['', v_T]$, the terminal value v, we remove that key from the collection, so let's define $S^{sub} = \{[k, v] \in S\ominus prefix(S)|k \not= ''\}$. 4. we split $S^{sub}$ into non-intersecting collections by the value of the first byte of its key (it always exist). This results in a 256 sub-collections $S^{sub}_0,\dots,S^{sub}_{255}$. Each $S^{sub}_0$ has $'i'$ as it's first byte of its key (and a common prefix for all keys). Thus $S^{sub}_{42}$ will contain all key/value pairs of with value $42$ of its first byte. Now let's remove that first byte: $S'_i = S^{sub}_i \ominus 'i'$. We will have 256 collections $S'_0, \dots, S'_{255}$ Let's define $S'$, a collection split into 256 non-intersecting sub-collections: $$ S' = S'_0 \cup S'_1 \cup \dots \cup S'_{255} $$ *Step 2* We define the root of $trie(S)$ as: $$ root(S) = node(prefix(S), CC(S), C(v_T)) \\rootPair(S) = ['', root(S)] $$ where $$ CC(S) = (C_0(S), C_1(S), \dots, C_{255}(S)) \\C_i(S) = C(root(S'_i)) \\v = GET('', S) $$ So, the root node of $trie(S)$ will contain common prefix as a *path fragment*, vector of commitments to its children and a commitment to the terminal value at key $''$, if any. <img src="https://i.imgur.com/K18eHlM.jpg" width="500"> *Step 3* Lets define child tries. $$ CHILD_i(S) = (prefix(S)\oplus 'i') \oplus trie(S'_i), i=0 \dots 255 $$ So, the trie of the *i*-th child sub-trie has common pefix of $prefix(S)\oplus 'i'$, the result of appending $'i'$ to the end of the $prefix(S)$. By the very construction, the key is unique in the *trie*. *Step 4* Finally, $trie(S) = \{rootPair(S)\} \cup CHILD_0(S) \cup \dots \cup CHILD_{255}(S)$ ### Examples ### Example collection Let's say we have a collection of key pairs: | Key | Value | | -------- | ------- | | 'abc' | 'd1' | | 'abc1' | 'd2' | | 'abc123' | 'd3' | | 'abc134' | 'd4' | | 'abc2' | 'd5' | | 'abdefgh12' | 'd6' | | 'abdefgh23' | 'd7' | | 'abefgh23' | 'd8' | #### Keys An axample of recursively splitting keys of the colection: Common prefixes are colored, path fragments are bolded. <img src="https://i.imgur.com/ZF1OG2f.jpg" width="200"> #### Trie Commitmens to nodes and terminal values are bolded. | node # | Key | Path fragment | Child commitments | Terminal commitment | | -- | ------- | --- | --- | --- | | #0 | '' | 'ab' | 'c'-**#1** 'd'-**#6** 'e'-**#9** | $nil$| | #1 | 'abc' | '' | '1'-**#2** '2'-**#5** | **'d1'** | | #2 | 'abc1' | '' | '2'-**#3** '3'-**#4** | **'d2'** | | #3 | 'abc12' | '3' | $NIL$ | **'d3'** | | #4 | 'abc13' | '4' | $NIL$ | **'d4'** | | #5 | 'abc2' | '' | $NIL$ | **'d5'** | | #6 | 'abd' | 'efgh' | '1'-**#7** '2'-**#8** | $nil$ | | #7 | 'abdefgh1' | '2' | $NIL$ | **'d6'** | | #8 | 'abdefgh2' | '3' | $NIL$ | **'d7'** | | #9 | 'abe' | 'fgh23'| $NIL$ | **'d8'** | #### Tree <img src="https://i.imgur.com/CgzFYFs.png" width="800"> ## Operations on the trie $S' = SET(S, k, v)$ mutates the state $S$ by adding new key value pair $[k,v]$ to the state $S$ or modyfying its value. We want to define mutation of the $trie(SET(S, k, v))$ as efficient as possible. Same appplies to the opretation $DEL(S, k)$ ### Properties of the trie *(the assertions above are kind of obvious, can be proven by somebody else :) )* *Assertion 1. There's exactly one proof path for any terminal value in the state* $S$ For any key/value pair string $[k,v] \in S$, there's exactly one path of nodes, i.e. a sequence $proofPath(k, v) =(n_0, \dots , n_{p-1})$, $p>0$, where $n_i \in trie(S)$, $n_i = [k_i, node(PF_i, CHILDREN_i, C^T_i)]$ is a node in the trie with the property: * $C(root(S))=C(n_0)$ * $C(v) = C_{p-1}^T$ * for $i>0$ the following all is true: * $P_i=k_{i-1} \oplus PF_i$ is a prefix of $k$ * $CHILDREN_{i-1}[length(P_{i-1})] = C(n_i)$, i.e the commitment at the index $length(P_{i-1})$ (byte value in the key at the position next after the prefix $P_i$ ) is equal to the vector commitment to the correspoinding child node. So, the proof path is a sequence of nodes from the node corresponding to the key $k$ an each node contain chain of committments starting from the value $v$ up to the root of the trie where next commitment commits to the previous one. *Assertion 2.* For non-empty $s$, there's exactly one point of insertion $insert(S, s) \in trie(S)$ for any key/value pair $[s,v]$. If $S \not= \emptyset$, for any string $s$ let's define the *point of insertion* as: $insert(S, s)=[k, node(PF, CHILDREN, C_T)] \in trie(S)$ where exactly one of two is valid: * *option 1*. either $k\oplus PF =s$ * *option 2*. or $k$ is a prefix of $s$ and $k\oplus PF \not=s$ and $[s,v] \notin S$ In other words, an arbitrary string either already has a node in the trie where to put (or update existing) its terminal value, or otherwise the key is not presented in $S$. The node $insert(S,k)$ can be found by starting from the root and following the path along keys, *pathFragments* and child indices. The search ends in one of two: * Seach ends with the *option 1* at the node the commitment to the terminal value (it may be $nil$ in case of absence of commitment). In this case the nodes along the path proofs the existence or absence of the current terminal value in the state * otherwise it ends with *option 2* i.e. at the node where we can insert a new node. ### Insert/update operation Let's we want to insert a key/value pair $[k,v]$ into the collection (state) $S$. If $insert(S, s)=[k, node(PF, CHILDREN, C_T)] = INS$ satisfies condition of *option 1*, we only need to update existing $C_T$ with the new commitment $C(v)$ in the node and propagate this change along commitment back to the root. Otherwise, $k$ is prefix of $s$, however $k \oplus PF \not= s$. It also implies key $s$ is not in the state (otherwise we'd end up with *option 1*) and we want to insert the new key/value pair. *(The operation results into modifying node $INS$ and adding one or two new nodes to the trie. Then propagating commitments back to the root. We skip details here)* The insert/update operation: * require modification of 1 node * require adding 0, 1 or 2 new nodes (new keys) * does not require any removal of existing keys from the trie * the complexity of the operation is proportional to the depth of the tree, i.e. $O(log_{256}(card(S)))$ ### Delete operation Let's say we want to delete $[k,v] \in S$ and we to modify $trie(S)$ accordingly. Search along the tries gives us $INS=insert(S,k)=[k, node(PF, CHILDREN, C_T)]$ with the *option 1*. It means, to update the trie we need: * set $C_T=-nil$ (no terminal commitment) * if the vector $CHILDREN$ contains 0 commitments, we delete the node $INS$ and propagate update in commitments back up to the root * if the vector $CHILDREN$ contains exactly 1 commitment at index $i$, we: * retrieve the corresponding i*th* child $INS_i$ * put all commitments from $INS_i$ into $INS$ * update $PF$ with concatenation of $PF$ * the *pathFragment* from the $INS_i$. Then we delete $INS_i$. * otherwise we do not reorg the trie. Note that deleting a key/value pair may result in depetion of up to 1 node and does not require additional nodes. * the complexity of the operation is proportional to the depth of the tree, i.e. $O(log_{256}(card(S)))$ ## Implementation considerations The prototype implementation can be found in the [GitHub respository](https://github.com/lunfardo314/verkle). Some observations: * Assuming random enough tails of long keys, the tail is kept in the path fragment and keys of the trie should be quite short * Keys of the trie expected to be short due to the very wide tree. According to emulations, if keys are random, the average key length in the trie can be 3-4, in more realistic setting may be 5-6. * By its construction, it is deterministic * Fully balanced tree depth 4 is able to hold up to $2^{32} > 4 \times 10^{9}$ terminal values. * Path in the trie corresponding to a key/value pair is collected recursively by augmenting the key with path fragment and child index. * We can calulate commitments to sub-state, represented by some prefix of its keys. Even if the prefix does not exactly represent a ley of a trie node, the commitment to the sub-state is the commitment of the last node found by traversing along the key path. The node structure allows efficient encoding of it. One node requires: * 1-2 bytes for the lengh of the *pathFragment*. * bytes for *pathFragment*. It is expected to be at the average $keyLength / pathLength$, where $pathLength = O(log_{256}(card(S)))$. * most of commitment vectors are expected to be sparce, i.e. with just few non-nil values. Thus, the vector can be compressed very much by adding 256 one bit flags (32 bytes) * since a lot (90%+) of nodes will contain only terminal commitment, the children flags may be compressed even further * a node with empty pathFragment and the only terminal commitment (32 byte hash) will only require 34 bytes to encode node. Repeating patern of the node "empty pathFragment and the only terminal commitment" would allow further optimisation and compression of the trie without increasing computational complexity.