owned this note
owned this note
Published
Linked with GitHub
$$
\newcommand{\id}[1]{\text{id}\left(#1\right)}
\newcommand{\nei}[1]{\text{deg}\left(#1\right)}
\newcommand{\neimax}[1]{\text{deg}_{\text{max}}\left(#1\right)}
\newcommand{\neiout}[1]{\text{deg}_{\text{out}}\left(#1\right)}
\newcommand{\neimin}[1]{\text{deg}_{\text{min}}\left(#1\right)}
\newcommand{\neioutk}[2]{\text{deg}_{\text{out}}^{(#1)}\left(#2\right)}
\newcommand{\neimink}[2]{\text{deg}_{\text{min}}^{(#1)}\left(#2\right)}
\newcommand{\neimaxk}[2]{\text{deg}_{\text{max}}^{(#1)}\left(#2\right)}
\newcommand{\neirep}[0]{\delta}
\newcommand{\idlep}[1]{c_{\text{idle}}^{(#1)}}
$$
# The Anatomy of a Block Protocol
Data exchange in P2P networks typically happens through a "block exchange" protocol which, at the simplest level, is a protocol that allow peers to request and provide data in blocks of well-defined sizes to other peers. Blocks might be uniquely identifiable within the network^[Though you might not necessarily be able to _look up_ individual blocks efficently.], and a block request consists of one or more such identifiers, while responses consist of the block contents itself.
Other types of block protocols which are not based on request/response do exist, like P2P pubsub protocols in which blocks are pushed to a set of subscribers, but we will not concern ourselves with those.
While the basic structure of a request/response block protocol might simple, designing it so that it scales and works efficiently under a dynamic, unreliable, and open network with potentially malicious nodes is not. In this document, we attempt at characterizing what "working" means for a block protocol, as well as define its building blocks.
## Assumptions and Definitions
To narrow the problem down, we set forth a few assumptions and auxiliary definitions.
**Content type and size.** We assume that all blocks are part of files. In our context, a file $F = \{b_1, \cdots, b_n\}$ can be represented as an ordered sequence of blocks. This is without loss of generality as a block can be represented as a single-block file $F_b = \{b\}$. We can assign a unique identifier to a file $\id{F}$ by, say, taking the Merkle root of the tree built over its blocks. A block can then be identified by a tuple $\left(\id{F}, i\right)$; $1 \leq i \leq n$. We further assume that files can, and will often be, large; i.e., over one gigabyte in size.
**Network.** We consider a temporally varying network $N = \{p_1, \cdots, p_m\}$ of $m$ peers (or, interchangeably, nodes). Peers in this network might join, leave, or crash; i.e., we work with a crash-recover model. Every peer $p \in N$ keeps a set of neighbors, $\nei{p} \subset N$ which corresponds to the peers $p$ keeps active connections to, up to a maxium of $\neimax{p}$ connections.
**Storage providers and clients.** Our network might contain a set $S_F$ of distinguished peers called the _storage set_ of $F$; a peer $p \in S_F$ is referred to as a _storage provider_ (SP). Such peers store $d$ equally-sized, disjoint partitions of $F$ such that $\bigcup U_{i=1}^d = F$ and their only role is to serve data to peers that might be interested in $F$. We denote the set of such partitions as $U_F$. Peers in $S_F$ are equally assumed to crash, join, or leave the network, though in practice they are incentivized to be more stable than other peers.
The remaining peers in $L_F = N - S_F$ are referred to as _storage clients_ (SC). We assume all SCs wish to obtain all blocks in $F$. Moreover, we assume SCs are in general willing to serve the blocks they have obtained to other SCs, should those be requested.
**Downloads.** Each storage client $p \in L_F$ maintains a temporally varying set $C_t$ containing the blocks it has obtained at time $t$, and a pending set $P_t = F - C_t$ of blocks it still needs to acquire. When $C_t = F$ (or, equivalently, when $P_t = \emptyset$), we say that $p$ has _completed_ the download for $F$^[This is actually stronger than what we need in practice, but is good enough for our purposes here.]. The earliest $t_c$ time at which $P_{t_c} = \emptyset$ is $p$'s _completion time_.
If we let $t_i$ to be the time instant at which $p$ joined $N$, then we can define the _download time at $p$_ to be $t_d=t_c - t_i$.
## Problem
Informally, a block protocol must ensure that all blocks in $F$ eventually reach all nodes in $L_F$. To make things simpler, we focus on a single block $b \in F$. Since we have a dynamic network in which not all nodes hold all blocks, we have two possible scenarios:
1. a source for $b$ not ever be available;
2. some source for $b$ is available.
Case (1) is uninteresting as we cannot solve the download problem in it. For case (2), our expectation is that the block will reach every node in $N$ in some finite amount of time $\tau$, provided that the source for $b$ remains available for long enough.
A protocol therefore _solves the download problem_ if and only if there is a block source $p_s$ and a _finite_ time bound $\tau > 0$ such that:
1. $p_s$ is available in the time interval $\left[t, t + \tau\right]$;
2. $b$ is delivered to all correct nodes that remain in $L_F$.
Note that $\tau$ might be a very large number. What this tells us is that the protocol is not allowed to get stuck; i.e., it cannot take an unbounded amount of time for the block to be delivered to all correct nodes when there is a block source available in the network. If this holds for every block $b \in F$, then it follows that every node in $L_F$ will eventually complete their download.
## Downloads in Codex Networks
A necessary condition for a block to disseminate from a source (which could be either an SP or an SC) to an SC is that there exists a _stable temporal path_ in the dynamic overlay that allows $b$ to flow from source to SC.
A useful proxy to stable temporal paths is if can guarantee that our network overlay always forms a strongly connected graph; i.e., it has no partitions.^[Of course this is not enough to _guarantee_ that such paths exist, but keeping the overlay _always_ strongly connected will usually ensure that such paths eventually form.] Existing literature^[Gossip-Based Peer Sampling: [https://dl.acm.org/doi/10.1145/1275517.1275520](https://dl.acm.org/doi/10.1145/1275517.1275520)] shows that dynamic random graphs are exceptionally well-suited in that regard in that they will remain connected even under extreme churn. The approximations built by Bittorrent^[Understanding the Properties of the BitTorrent Overlay: https://inria.hal.science/inria-00162088/document] also exhibit good resilience, with losses of up to $40\%$ of nodes still leading to connected overlays.
This could make a case for building an overlay like Bittorrent does, except our case is slightly different as the primary block sources in the network, the nodes in $S_F$, do not store the whole file, and will not download blocks that do not belong to their partition or share them.
To see why this will cause issues, it helps to look at a few different scenarios. In the following, we [simulate the construction of Bittorrent overlays](https://github.com/gmega/block.protocol/blob/a9dc2e2fefda524c69e8f6ee356a70428d6f5eef/R/overlay-game.R#L1) as we vary:
1. the maximum number of outgoing connections $\neiout{p}$; i.e., the maximum number of connections that a node will actively attempt to establish at bootstrap;
2. the maximum total number of connections $\neimax{p}$ that a node is willing to take;
3. the relative sizes of $|S_F|$ and $|L_F|$.
Fig. 1 illustrates the benign case when $|L_F| \gg |S_F|$. Note that we get a large connected random graph, with the storage providers embedded into it. There are paths connecting every SC to every SP, so that we can expect this overlay to allow downloads to complete successfully.
The overlay in Fig. 2 is already a problem - we have a SC $p$ that bootstraps but does not see one of the storage providers. This could happen because the SP was unreachable when the $p$ boostrapped, or because $\neiout{p} < |S_F|$.

**Figure 1.** An overlay in which $|L_F| \gg |S_F|$ with storage nodes (purple) embedded in the larger overlay.
<center>
<img src="https://hackmd.io/_uploads/Sy_XZOV6yl.png">
</center>
**Figure 2.** Pathological case when $|L_F| < |S_F|$ and $\neiout{p} < |S_F|$ with a storage node (purple) disconnected from the rest.
Finally, a more subtle pathological case is presented in Fig. 3. Let $G = (N, E)$ represent our network overlay with $N$ as the vertices $E$ the connections among them. Since we are simplifying the neighborhood relationship to symmetrical; i.e, if $a$ is a neighbor of $b$, then $b$ is a neighbor of $a$, we have that $G$ is undirected.
Now define the directed _block flow graph $G_b = (N, E_b)$_ as follows:
* the vertices in $G_b$ is the set of peers, as in $G$;
* there can be a _directed edge_ between two neighboring peers $p_i$ and $p_j$ if and only if:
* $p_i$ and $p_j$ are neighbors;
* $p_i$ and $p_j$ are not both storage nodes.
A block flow graph represents the set of paths over which a block can actually propagate -- since storage providers can only download or forward their own blocks, blocks effectively cannot flow "into" a storage provider. Note that unless storage clients rewire their connections, this overlay will not allow downloads to complete as some blocks will simply never reach some of the storage clients.
<center>
<img src="https://hackmd.io/_uploads/r1A83dNTyg.png">
</center>
**Figure 3.** A case in which $|L_F| \sim |S_F|$, $c_o < |S_F|$, and the overlay is strongly connected, but the block flow graph is not.
<center>
<img src="https://hackmd.io/_uploads/BkzAn_Na1e.png">
</center>
**Figure 4.** A weakly connected block flow graph.
With that in mind, we can refine the sufficient condition for a static network in Codex that enables downloads, namely:
**Theorem 1.** Let $G = (V,E)$ be a Codex network, $G_b = (N, E_b)$ be its blockflow graph. For downloads to complete, the following must be true:
1. Every node in $S_F$ has at least one neighbor;
2. the nodes in $L_F$ forms a strongly connected component in $G_b$.
**Proof.** Since SPs (nodes in $S_F$) can only have neighbors in $L_F$ as they do not connect to each other, condition (1) implies that every node in $p_s \in S_F$; i.e., every SP, has at least one neighbor in $p_l \in L_F$. But $L_F$ forms a strongly connected component, meaning there are directed paths from $p_l$ to every other node in $L_F$. It follows that there are paths from $p_s$ to every other node in $L_F$. $_\blacksquare$
We will not attempt to characterize or prove this more formally but, for a dynamic network, the graph needs intuitively be _temporally strongly connected_, with constrains on the stability of temporal connectivity that depend on the bandwidth assumptions made on each node, as well as network characteristics (e.g. latency).
## Protocol
Knowing the above, the question becomes that of making sure our protocol generates temporally connected blockflow graphs that satisfy the conditions above; namely, a (temporally) strongly connected component of storage clients to which storage providers are also (temporally) weakly connected to.
### Blind Resampling
Resampling, or re-bootstrapping, seems like the obvious tool to explore. If a peer can replace some of their neighbors by new neighbors coming from a fresh tracker sample, then we can ensure that at least $G_b$ will be strongly temporally connected, eventually.
We refer to the "resample when we run out of neighbors to download from" idea that was vented many times as we worked through this problem as _blind resampling_.
Blind resampling can be costly and ineffective. To understand why, consider a storage client $p_l$ participating in a hypothetical network with $|S_F|$ storage nodes; i.e. $|N| = |S_F| + 1$. Assume that, at each resample, $p_l$ draws $k < |S_F|$ candidate neighbors and learns the blocks they have. It is clear that, in such situation, $p_l$ will need to see each of the $|N| - 1$ storage nodes _at least once_ if it is to complete its download, regardless of other protocol details.
We can then ask ourselves the question: under blind resampling, how many sampling rounds, on average, will it take $p_l$ to see all of the nodes in $|S_F|$?
Let $D_{N,k}$ be a random variable representing the required number of draws of size $k$. We can approximate this problem as an instance of the coupon's collector problem with multiple drawings^[The collector's problem with group drawings: [https://doi.org/10.2307/1427566](https://doi.org/10.2307/1427566)], for which we can write the expectation for the number of draws $D_{N,k}$ as:
$$
\mathbb{E}(D_{N,k}) = \sum_{s=1}^{N - 1}\frac{(-1)^{s+1}\binom{N - 1}{s}}{1-\binom{N-s-1}{k}/\binom{N - 1}{k}}
$$
Fig. 5 shows the growth in the expected number of samples required as a function of $|S_F|$ when $k = 10$. We estimate the overhead by fitting linear model onto that curve, which yields a slope of $0.69$ (though the curve is clearly not linear). This means that finding all storage nodes by sampling in a network with $|N| = |S_F| + 1$ nodes will require roughly $0.69 \cdot |N|$ sampling rounds of size $10$, i.e., we will need to contact about $6.9\cdot |N|$ nodes in total, which represents an inefficiency factor of $6.9$ over simply contacting the nodes sequentially.
Letting $a_k$ be the slope obtained with the method just described, we can in general express the total number of nodes we need to contact for a given setting of $k$ as $\alpha_k = a_k\cdot |N|$. We denote $\alpha_k$ as the _amplification factor_ for samples of size $k$.

**Figure 5.** Expected number of required draws of size $k = 10$ before all nodes in a network of size $|N|$ is observed.
We can then extend this analysis to larger network sizes and different values of $k$. The values for $a_k$ are shown in Fig. 6, whereas the values for $\alpha_k$ are shown in Fig. 7.

**Figure 6.** Number of sampling rounds of size $x$ required for a network of size $N$, as a fraction of $N$.
Perhaps unintuitively, efficiency appears to _decrease_ as we increase $k$, even if slightly. The number of required sampling rounds also decreases, though there is a clear diminishing returns effect and using $k = 50$ does not buy much over $k = 30$, whereas it halves the expected number of sampling rounds when going from $10$ to $20$.

**Figure 7.** Amplification factor $\alpha_k$ as a function of sample size ($k$).
**Decision predicate.** Resampling also requires a _decision predicate_; i.e., a local predicate that a peer should evaluate to decide when it is time to resample. For Bittorrent, this predicate is triggered when the number of neighbors for a peer drops below a minimum threshold: this is enough to guarantee that an overlay in Bittorrent remains connected.
In Codex, we cannot use that same predicate. Indeed, we need to ensure that:
1. storage providers always have at least one neighbor;
2. storage clients maintain a strongly connected component.
When there are not enough nodes to maintain those properties, we need to ensure that:
1. storage clients are connected to block sources for long enough;
2. if there are inevitable partitions, then a storage client must spend enough time in each partition.
### Multiswarms
A simple idea which could allows us to avoid the cost of blind sampling while relying on a simple decision predicate is allowing peers to deliberately choose which slot they wish to download, and create a logical swarm for each. Such _multiswarms_ might allow us to be efficient in the usage of connections, while allowing us to maintain the properties required by Theorem 1.
**Disjoint swarms.** As a stepping stone towards multiswarms that would be easier to build and explain, one could consider constructing a swarm per slot. Each peer determines how many slots they are willing to download at a time under the form of a parameter $l$. The peer then determines a set of _active slots_ it wants to download, $U_{\text{a}} \subseteq U_F$, such that $|U_a| = l$.
It then joins $l$ different swarms, one per slot. Let $\neimaxk{i}{p}$, $\neimink{i}{p}$, and $\neioutk{i}{p}$ represent the maximum, minimum, and maximum outgoing connections that $p$ should make onto the $i^{\text{th}}$ swarm. For disjoint swarms, we set:
* $\neimaxk{i}{p} = \neimax{p}/l$
* $\neimink{i}{p} = \neimin{p}/l$
* $\neioutk{i}{p} = \neiout{p}/l$
As the peer completes the download for a slot $i$ it removes it from $U_a$, leaves its swarm, and selects the next slot to download restarting the process.
One of the main drawbacks with this approach is that, whenever $l < |U|$, the peer will end up holding more slots on disk than the ones that can fit in its active slot set as slots get completed, but it cannot serve them. Indeed, this rigid scheme does not allow a peer to serve such slots without overshooting $\neimax{p}$.
Disjoint swarms are also wasteful: since connections are reserved upfront for each swarm, spare capacity (undersubscription) in one of the swarms cannot be used to allow more resources to be devoted to another. For instance, if we had $\neioutk{1}{p} = 5$ but the swarm for slot $1$ only had one node, we could allow the number of total connections in some other slot $j$ to grow $\neimaxk{j}{p}/l$; i.e., we could allow oversubscription, provided that no new neighbors for slot $1$ showed up.
**Multiswarms.** We can make the protocol above more efficient by pooling the connection availability in $\neimax{p}$ and attempting to exploit correlations and allocating resources dynamically. We call this a _multiswarm_, as it behaves more like a merge of swarms than a group of independent ones.
In a multiswarm, we still have a parameter $l$ which represents the slots that a node _actively wants to obtain_. We then keep $l$ _neighbor pools_ $G =\{G_1, \cdots, G_l\}$, $G \subseteq 2^N$, which might or might not be disjoint; i.e., $G_i \cap G_j \neq \emptyset$ is allowed.
We define the _idle capacity_ $\idlep{p}$ for a node $p$ as:
$$\idlep{p} = \neimax{p} - \sum_{G_i \in G} |G_i|$$
Each neighbor pool $G_i$ might be in one of four states:
* **critical**, meaning that the pool has too few neighbors;
* **undersubscribed**, meaning that the pool has less than its target amount of neighbors;
* **on-target**, meaning that the pool is operating within its target range;
* **oversubscribed**, meaning that the pool has more neighbors than needed.
<center>
<img src="https://hackmd.io/_uploads/BJkMQ-ha1l.png"/>
</center>
**Figure 8.** States for a multiswarm neighbor pool.
Whenever a slot $U_i$ joins the active set, the node will attempt to bring its corresponding pool on-target by opening $\neioutk{i}{p}$ connections to other nodes also sharing $U_i$. The node will then allow inbound connections to happen freely: if so many nodes are interested in $U_i$ that $p$ ends up with more than $\neimaxk{i}{p}$ in $U_i$'s pool, that is fine. $p$ will allow the pool to grow as long as there is idle capacity, i.e., as long as the total number of neighbors taken for all slots in $F$ does not go above $\neimax{p}$.
**Capacity stealing.** The key to efficient allocation is capacity stealing. Say we have an oversubscribed pool $G_i$ and an undersubscribed pool $G_j$ such that the current idle capacity for $p$ is zero. Then, suppose some node $q$ attempts to connect to $p$ because it wants to download the slot in pool $j$. If $p$ had idle capacity, it would simply accept $q$. Since it does not, $p$ will find its most oversubscribed pool and drop a neighbor at random. It will then allow $q$ as a neighbor in pool $G_j$, respecting the overall neighbor limit.
In a nutshell, any pool that is not oversubscribed will cause neighbor slots to be stolen from oversubscribed pools. This will also happen if the node is attempting to replenish neighbors for a pool $G_i$ that is in critical state: $p$ will actively seek new neighbors to fill $G_i$ and if there is no idle capacity, $p$ will make room by dropping neighbors from oversubscribed pools.
**Non-active slots.** Under this scheme, nodes are allowed to continue sharing slots that are not in their active set. The difference is that neighbors interested in those slots go into a special pool, the _idle pool_, which is always considered to be oversubscribed. In case the node needs to draw idle capacity, those are the first neighbors it will drop.
**Optimizations.** It is clear that if pools $G_i$ and $G_j$ are not disjoint, then we can reuse the existing connection of a peer $q \in G_i \cap G_j$ to download the two slots corresponding to each pool. We can account for this by making a simple adjustment to the idle capacity so it accounts for such "shared" neighbors:
$$\idlep{p} = \neimax{p} - \left|\bigcup_{G_i \in G} G_i\right|$$
Nodes should always strive to fill pools with _diverse_ neighbors, however, as that is what will ultimately guarantee the overlay remains resilient.
**The parameter $l$.** A pain point of both disjoint and multiswarms is the need to select a parameter $l$. Intuitively, we should be able to get rid of this: we could simply select the maximum number of connections that the protocol is allowed to use; i.e., $\neimax{p}$, and figure out how large the active slot set can be based on $\neimaxk{i}{p}$; i.e., we would have $l = \neimax{p}/\neimaxk{i}{p}$. In the end, $l$ is just a proxy to setting $\neimaxk{i}{p}$.
Setting a good value for $\neimaxk{i}{p}$, however, is still something we do not know how to do. Simulations and empirical evidence will be our best friends here, meaning we should also gather enough data from the network to help us decide whether the values we are setting are good enough or not.
[^1]:
[^2]: