qDHT: Using fenwick trees to compute and store aggregate data about spacetime regions

# qDHT: Using fenwick trees to compute and store aggregate data about spacetime regions ## Approach ### DHT topology: spacetime cylinder We can think of the entire DHT topologically as a cylinder, with the space axis wrapping around itself, and time extending along its length, with the base fixed at the beginning of the network, and the other circular edge forever extending into the future. All ops exist somewhere on this cylinder. For the purposes of talking about the necessary concepts, it helps to unravel this cylinder into a simple plane, but just remember that topologically, the ends of space wrap around: ![](https://hackmd.io/_uploads/rJjIaHe5t.png) The X's in the figures are where ops are located in space and time. (ha, I just noticed the typo, "opology", which is also appropriate, but that should be "topology") Our DHT storage arc is simply a 1-D space interval, which describes a lengthwise strip on the cylinder, or a vertical strip of spacetime (two strips if the arc spans the origin). Thus, the storage arc describes a large rectangular region of spacetime. Our approach to gossip can be thought of like this: - find the rectangle in spacetime specified by our storage arc, - partition it into a number of rectangles of varying sizes, - take the hash of each rectangle which is combination of all of the hashes of all of the ops contained in that region. Call this hash the "fingerprint" of a region. - send all of these fingerprints, along with some extra data like the total size of ops contained in each region, to our gossip partner, receiving the same from them - compare the collection of fingerprints sent by our partner to our own (which may be of different shapes than the ones we sent), - if the diff is too large, ask our partner to send fingerprints of smaller regions for the mismatched areas - repeat until the diff is small enough or it can be determined that there is no feasible way to reduce the diff, and then send the diff as a collection of ops. ![](https://hackmd.io/_uploads/ryOr6Bx9K.png) In order to efficiently compute and lookup hashes of arbitrary regions of spacetime, we need a suitable data structure. 2D Fenwick trees over the XOR operator are ideal for this. ### About Fenwick trees A Fenwick tree is typically used to compute a cumulative sum over an array of values, starting from index 0. It can be extended to higher dimensions, computing cumulative sums over areas and volumes starting at index 0. This can be used to compute the sum of any arbitrary region by a linear combination of cumulative sums. For instance, `sum(20, 30)` is equal to `sum(0, 30) - sum(0, 19)`. We can do the same thing with hashes if we consider that the XOR of two hashes acts as both a sum and a difference simultaneously. Given a properly constructed "cumulative XOR" tree, we can compute the XOR fingerprint of arbitrary regions of spacetime by taking the XOR of the largest region (all regions begin at `(0, 0)`) and then XOR the extraneous regions to remove them from the picture, leaving the XOR of the bounded region we want. In 2D, every region calculation involves 4 queries, where the largest region hash is XOR'd with the smaller 3 contained within it. Note that our use of Fenwick trees *absolutely depends* on being able to use a reversible combining operator like XOR, so that we can subtract out regions to form the specific regions we want. If XOR or similar becomes intractible, we have to rethink all of this. There is a basic 2D sparse Fenwick tree implementation here: https://github.com/holochain/sparse-fenwick-rs ### Quick note on XOR collisions It's not clear how probable XOR collisions are under this scheme. If there is any doubt on our end that the probability is not negligible, we can mitigate it in a few ways: #### Time-based nonces (Props to @freesig for this idea) If there is a small probability of collisions, we can periodically reshuffle all the hashes by including some nonce. We can either include the nonce into each hash before we XOR, or we can simply make the "zero" value of each node the hash of a nonce rather than all zeroes. The nonce can be time-based, based on broad periods of time measured starting from the genesis epoch, for instance once every 24 hours. When nodes are gossiping, they advertise what nonce they are using, either explicitly in the initial handshake, or implicitly via the max timestamp of ops they are sharing. If the nonces do not match, this implies either griefing, or the edge condition where we are close to the time boundary and one node has crossed over and the other hasn't. In both cases, the nodes can simply decline to gossip with each other, knowing that if both are honest, very soon both will be on the same side of the time boundary and can gossip again in the near future. ## Quantization Another thing we need to be careful of is the level of quantization we use when talking about regions of spacetime. The 2D Fenwick tree corresponds to successive layers of subdivision of spacetime. Every layer of depth down the tree splits spacetime in half, either along the space or time dimension. So, the more specific we want to be about our regions, the more nodes we have to store and compute in the tree. Initial experiments show that if we want to store arbitrary regions all the way down to the precision of $2^{32}$, we need to store on the order of 200 tree nodes for every op hash we want to add, for large numbers of ops. This is infeasible to hold in memory. It's also completely unnecessary to store data at this level of detail. One nice thing about sparse Fenwick trees is that you only need deep nodes to describe finer subdivisions of space, and sparse trees only create nodes that are needed, so if we only refer to regions to a certain level of quantization, our tree is guaranteed to only grow to a certain depth. Moreover, if we decide we don't need to talk about regions above a certain size, we can also disallow the tree from storing nodes shallower than a certain level. By bounding the depth of nodes we will store from both above and below, we can achieve a reasonable memory overhead. ## Choosing time and space intervals for regions There are a lot of ways to split up a rectangle into smaller rectangles. Here are some criteria we use to decide how to do so: ### Time intervals In the time dimension, we want to use larger intervals for older times, because we expect historical data to be changing less often than recent data, so we can typically send fingerprints that cover wide swaths of time, and ask for more precise fingerprints in the rare case that we encounter a mismatch. The general principle is that **the time interval we choose for a given region should be inversely proportional to the probability of encountering new ops within that region during a gossip round**. Since this probability decreases over time, older regions grow longer in time. However, the probability of change can also be a factor for recent data: if a network sees a surge in activity, the time window for recent data can decreased, or inversely, increased during a lull. Either way, the time window for historical data scales up with more time passing. ### Space intervals The general principle here is that **the space interval for a given region should be chosen based on the density of data in that region**. By density we actually mean the size, in bytes, of all ops in a region. The number ops is not important in defining the density, since any number of ops can be represented by a single 32 byte XOR fingerprint, but the cost of a fingerprint mismatch is that either the entirety of data in that region must be sent, or another roundtrip must be completed to exchange more detailed fingerprints to reduce the size of ops sent. Note that we still do care about the number of ops in a region, because that can give us a hint as to whether it is useful to do the work of splitting the region into smaller pieces, which may involve a gossip roundtrip. A region with a large number of ops is more likely to be evenly splittable in half, whereas a region with a small number of ops is less likely to be splittable. A region with only 1 op is of course completely unsplittable. So, there is an interplay between size and count when dealing with regions that are large in size. Another factor in choosing space intervals is that the storage arc must be completely covered by in the space dimension by the selection of fingerprints. So, the arc size needs to be selected based on the density of data within it, especially around its edges. This means the arc needs to be quantized to some degree. The choice of arc size may have some influence in setting the space length of some of the regions around the edges, but I think it may work out such that the arc size is purely a function of the regions selected, and will grow in quantum leaps purely determined by the data density of the regions it is growing or shrinking into. ## What is sent during gossip The description of a region is essentially thus: ``` struct SpacetimeRegion { xor_fingerprint: [u8; 32], coords: SpacetimeCoords, size: u32, count: u32, } struct Coords { height: u32, offset: u32, } struct SpacetimeCoords { space: Coords, time: Coords, } ``` The spacetime coordinates uniquely map a region of spacetime to calculation on the 2D Fenwick tree. Without spelling it out here, there is a mapping between `SpaceTimeCoords` and `((DhtLocation, DhtLocation), (Timestamp, Timestamp))`, i.e. any rectangular region of spacetime can be represented as a pair of coords. The coordinates need to be sent so that the recipient can compare the hash to their own hash at the same coordinates. Note that the coordinates can be compressed to some degree. For instance, we could allocate some odd number of bytes for each coordinate, perhaps even different numbers of bytes for time and space. For instance, if the time quantum is 5 minutes, 24 bits lets us represent 638 years, so we'd only need 5 bits for the height and 24 bits for the offset of the time coordinate, using only 29 bits instead of 64. ## "Time dilation" We know that we want to send regions larger in the time dimension as we talk about ops further back in history. The most aggressive time dilation scheme we can perform is: starting from the right of the tree, select either 2 or 3 nodes at the deepest level. Then walking left, select progressively higher nodes to represent larger swaths of time. By the time we reach the left end of the tree, we have selected the left child of the root node to represent the entire "older half" of history. By this scheme, we can start to drop nodes off of our tree as time progresses. Our tree will always be deepest towards the right, and for most nodes in the tree, the right subtree will be deeper than the left subtree. For instance, here's how a tree might look after the first 80 minutes of an app running. The numbered nodes are roughly the ones we would use to calculate our regions. This is a time tree, so we actually send multiple regions at these time intervals, at various other space intervals. I'm glossing over some other details, like the fact that the "5" nodes here are actually at the root of a tree that's potentially 32 levels deep, but it's easy to see that most of the rest of the tree is going to be null, and only gets filled in gradually as times goes on. ![](https://hackmd.io/_uploads/SyVL27-5F.png) That formula might be a bit off too, $H$ is more like "the height of the highest node which has two non-null children". Let's write that out in $\LaTeX$. Let's call it $H'$ to be clear that it's a "special" notion of height. The closed form of this series is: $$ \begin{align} N_{dropped} &= \sum_{h=1}^{H'} 2^h - 2 \\ &= 2(2^{H'}-1) - 2H' \end{align} $$ and the inverse, the actual number of nodes in the tree, is found by subtracting this from the total number of "significant" nodes, $N_{total} = 2^{H'+1} - 1$: $$ \begin{align} \\ N_{remaining} &= N_{total} - N_{dropped} \\ &= (2^{H'+1} - 1) - (2(2^{H'}-1) - 2H') \\ &= 2^{H'+1} - 2^{H'+1} + 2 - 1 + 2H' \\ &= 2H' + 1 \end{align} $$ This is not the actual total number of nodes, because we need to include the nodes above the node at $H'$ which contain only one non-null child. There are $H_R - H'$ of these, where $H_R$ is the height of the root of the entire tree. So, the total number of nodes we need to store in the entire tree is only: $$ H_R - H' + 2H' + 1 = H_R + H' + 1 $$ Not bad! Even for a full tree of depth 32, that's only 65 nodes. I'm shocked, actually! ## Bounded spatial resolution To save further on the memory overhead of fenwick trees, we may choose to limit both the minimum and maximum resolution of space intervals. The reason for this is that we expect most networks to have a relatively even distribution of data density, and space resolution is primarily a function of density in a region. So, for a perfectly balanced network, we only need to use one space interval to measure out space, and it's no use computing and storing hashes of regions with smaller or larger space intervals. In reality, there will be variations in data density, both within a single agent's view of the DHT, and between two different views of two different agents who are gossiping with each other, so in reality we will need to store a range of space intervals. However, there will be a reasonable upper and lower bound beyond which we can omit computation and storage. As a rough first-pass optimization, we can just omit any nodes above a depth $D_0$ or below a depth $D_1$. The total number of nodes in a full binary tree is $n(d) = (2^d - 1)$ where $d$ is the depth. The savings of cutting out nodes above the min depth is $S_1 = n(D_0 - 1)$. The savings of cutting out nodes below the max depth is $S_2 = 2^{D_1 + 1} \cdot n(D - D_1 - 1)$ This bounded "tree" is really a collection of trees of depth, $D_1 - D_0$, of which there are $2^{D_0}$ many. Therefore, the total number of nodes remaining in this bounded tree is: $$ \begin{align} N = 2^{D_0} \cdot (2^{(D_1 - D_0)} - 1) = 2^{D_1} - 2^{D_0} \end{align} $$ Another way to say this is in terms of the depth below $D_0$ we wish to store, which would be written $H = D_1 - D_0$. In these terms, $$ N = 2^{D_0 + H} - 2^{D_0} = 2^{D_0}(2^{H} - 1) $$ This form makes it a little clearer how the number of nodes grows. Whenever we either expand the bounds by one level of depth, or we shift an existing pair of bounds down a layer, we are doubling the number of nodes we must store if the tree is full (non-null) down to $D_1$. However, in the sparse case, the more helpful thing to realize is this: For every op we add, each space tree only grows by $\Delta n$, which is bounded by $0 \le \Delta n \le D_1 - D_0$. That is to say, the bounded region grows *at most* by a number of nodes equal to the height of the bounded region. Whether it grows by a smaller or larger number of nodes depends on how densely packed the bounded region already is.