owned this note
owned this note
Published
Linked with GitHub
# Efficient record storage in the Holochain DHT
One of the aims of Holochain is to replace a conventional database as the primary storage for web applications. One thing that that databases excel at is dealing with large numbers of records and returning them in smaller pages. For example in MySQL retrieving a subset of posts to a blog could be done by calling:
`SELECT * FROM posts LIMIT 50,100;`
This will return posts from 50 to 100. The table may contain millions of other posts which can be largely ignored.
## Naïve approach in Holochain
There are several problems in replicating similar behaviours in the Holochain DHT. At the moment the most commonly implemented way of implementing a similar query is through the use of anchors and links. A `posts` anchor (B for Base) could act as the base and when a user creates a post (e for entry) it is added as a link.
```graphviz
digraph {
B -> {e1, e2, e3}
}
```
This has two crippling problems as the number of posts becomes large:
- calling `get_links` on the anchor must retrieve all of the posts. This may require millions of network requests as well as unbounded memory requirements (in the WASM runtime).
- Whichever agents happen to hold the anchor in their neighbourhood also have to store all the links. This leads to 'hotspots' in the DHT where there is vastly different storage requirements for different hash neighbourhoods.
A real solution to this problem should meet the following requirements:
- No base should have to store more than $N$ links
- Can return pages of results of size $n < N$
- Must be able to retrieve all records by iterating over pages
- Must work under network partitioning
## Slightly less Naïve (but still disfunctional) approach
This candidate solution has been referred to as the 'bucket chain'. This splits the monolithic anchor into an increasing number of buckets which should hold $n \leq N$ entries.
```graphviz
digraph R {
{ rank=same b1 b2 b3 }
B -> b1 -> b2 -> b3
b1 -> {e1, e2, e3}
b2 -> {e4, e5, e6}
b3 -> {e7, e8}
}
```
The processes for inserting and retrieving are as follows:
### Inserting
1. Find the base/anchor hash
1. follow the link to the next bucket
1. Check if next bucket exists. If yes follow link and repeat 3.
1. We are now at the end of the bucekt chain. Check number of entries linked
- If < $N$ add link to new entry. done.
- Else commit new bucket, link to current bucket and link new entry to new bucket
### Retrieving
1. Find the base/anchor hash
1. Follow links to next bucket and call get_links to retrieve entries
1. Repeat 2. until all entries have been retrieved (or the number you want)
The retrieval process can be modified to begin at a non-root bucket as well facilitating paging.
### Problems
While at first glance this may seem to have solved the problem it does not hold up in a multi-agent setting. There is no guarantee that in-between checking how many entries are in a bucket and commiting your entry that someone else has not also added their entry. This would result in $N+1$ entries in a bucket and therefore this scheme does not meet the first requirement.
There is also a more malicious attack where an agent could create a new node that is offline, add $N$ unique entries to the first bucket and then rejoin the network. Upon rejoining their entries would be merged with the existing ones. Repeating this procedure with new agents means there could be an unbounded number of entries on any given bucket.
It may seem that proper validation could prevent the above attack but there is no way to discern an attacker from a valid network partition thus validation cannot help in this case (Also validation that makes use of DHT state, like number of links on a base, is a bad idea anyway.) These issues mean this approach is largely uselss in a multi-aget hash table.
## Agent-centric approach
As with all things in Holochain the answer is agent centricity! It is not possible to perform deterministic validation on DHT state but it is possible to validate based on the local chain. Therefore to use validation to ensure that no more than $N$ entries can be linked to a given base we need to enforce via validation that only a particular agent can link to that base. This means that all of the entries/add_links are in this agents local chain storage and validation can use this to enforce a count.
In this scheme we have a single bucket chain per agent and a query must traverse the tree of all bucket chains to retrieve all the entries. This data structure has the added bonus of being filterable by author. Let $Abi$ be the $i^{th}$ bucket belonging to agent A and $Aei$ be the $i^{th}$ entry created by agent A.
```graphviz
digraph R {
{ rank=same Ab1 Ab2 Ab3 }
{ rank=same Bb1 Bb2 }
root -> {Ab1, Bb1}
Ab1 -> Ab2 -> Ab3
Ab1 -> {Ae1, Ae2, Ae3}
Ab2 -> {Ae4, Ae5, Ae6}
Ab3 -> {Ae7, Ae8}
Bb1 -> Bb2
Bb1 -> {Be1, Be2, Be3}
Bb2 -> {Be4}
}
```
The following validation rules ensure the data structure remains valid:
- An agent can only add links to their own buckets
- An agent cannot add more than $N$ links to a bucket
### Inserting
The insertion algorithm is the same as above but an agent most follow the validation rules above
### Retrieving
Retrieval is also the same for a single agents bucket chain. If all agents entries are required then this must be repeated for each agent bucket chain.
### Problems
Let us evaluate how this approach meets the requirements defined at the start. The validation ensures that for all bases, except for the root, it is guaranteed that there will be less than $N$ links. If the agent buckets are named in a consistent way then it is possible to jump to a particular bucket (in fact the between bucket links are not required) which enables paging queries. Furthermore it is guaranteed that all but the last bucket will not change between calls (a great property for pagination).
The other third requirement of being able to retrieve all records is met (provided there is no network partition but this is always the case in Holochain) although it may be much less efficient to retrieve all the records if there are many agents with underfilled buckets. And finally this approach is eventually consistent provided there is not more than one agent with the same ID (which would be a disaster anyway).
The main limitation of this approach is the root anchor. There is no limit to the number of agents that may join and thus this may become a hotspot in the DHT. This could be resolved by not including it and having some other way of discovering agent IDs such as social network graphs or methods outside of the application. This is good practice for scalability anyway and it follows that you then only need to retreive data for agents you are connected to.
## Conclusion
The above agent-centric bucket tree approach appears to provide a good solution to the data retrieval and pagination problem for Holochain. There may be something I have not considered so any feedback is appreciated. It seems especially well suited to social network type applications where the network can be used to discover agents without requiring a root agent anchor.