owned this note
owned this note
Published
Linked with GitHub
# Swarm Participant Upper Bounds of DHT
Description:
> We need to have a clear understanding of what the upper bounds of our proposed solution to replace bittorrent in status-go will be. In particular, replacing a single message archive torrent (index + ever-growing data) with an index CID and a CID for each message archive dataset will cause DHT strain. Every 7 days, a new archive and updated index will be uploaded, and replicated to all members of the swarm. What is the upper bound of community members and average number of communities per node that can be supported by the DHT for such a system?
Some more Context: [Guiliano's write-up](https://hackmd.io/tw3oYZ10S8Kg1EoFJdRLpw)
This bolis down to answering/measuring two things:
- **What is the upper bound of community members?** meaning, how many members can a community contain without straining the DHT.
- **What is the average number of communities per-node?** meaning, given that we expect each community to contain x number of members, how many communities can a node be part of?
minor note here: the term "DHT strain" a bit vague to me!
So how to go about analyzing this? I like to think in steps so we can follow these steps:
- Start with understanding the DHT traffic first, [Chrys's write-up](https://hackmd.io/tzafQHRgRfKcKAfQ-xBm3Q) categorizes these so we can start there.
- The parameters or variables involved in each of these traffic categories, and the formulas to estimate traffic (for each category) based on these params/variables.
- Based on our use-case (the status community archive) we can select which traffic is mostly expected and what ranges the variables in + measurements of msg sizes (to estimate bandwidth).
- Run some calculations (on expected variable ranges) to estimate the traffic + bandwidth.
- Based on these numbers, what is acceptable traffic for a node?, and then estimate upper bounds.
## DHT Traffic
Let's look at the source of traffic in DHTs. I will use some of the terms and symbols from [Chrys's write-up](https://hackmd.io/tzafQHRgRfKcKAfQ-xBm3Q) for consistency. I'm restating it here with more elaboration mainly for (me) understanding.
**Maintenance Traffic**
The traffic overhead to keep the overlay network healthy.
- Generated by: All DHT participants regardless of whether it stores or queries content.
- Purpose: DHT network topology - each node periodically:
- Pings its neighbors (keep-alives).
- Refreshes its routing table.
- Handles churn (detecting failed peers & replacing them).
- Scaling: depends on the numbe of nodes `N` and i think it should be proportional to $log_2(N)$ (because routing tables are organized in k-buckets). Doesn't depend on content pieces `C`.
Normally maintainance traffic is small fraction of the overall traffic as also pointed out by Chrys's calculation. We should still include this in our analysis, but it wouldn't play a major role or so i assume.
**Content Traffic**
The traffic generated when nodes publish “I can provide content X”. This is in Codex as i understand `addProvider(nodeId/CID, SPR)`
- Generated by: nodes storing content pieces that must be discoverable (provider nodes only).
- Purpose: "I have CID_X"
- Scaling: Proportional to:
- Number of providers `P`
- Content pieces `C`
- TTL/Refresh frequency `T`
- churn rate of providers `λ`.
This is the major cause of traffic in our use-case as pointed out by Guiliano, this is proportional to the number of members in the community and grows endlessly with time as more archive files are created.
**Query Traffic**
The traffic from nodes asking “Who has content X?”, in Codex this is `getProviders(nodeId/CID)`
- Generated by: Any node needing content, in our case, members needing the archive e.g. new members.
- Purpose: "Who has CID_X"
- Scaling: Proportional to average query rate per content piece `Q`
In a stable state there are no queries, aside from the weekly distribution from the control node since every member has the archive. However, there are two additional cases in which nodes/community members would query the archive data (or parts of it): (1) When joining the community you request the whole archive. (2) When a node goes offline for a while and then comes back then the request would be only for parts of the archive, how many parts (CIDs) depend on how long it has been offline. The first situation is probably easier to estimate than the second.
**What is main source of traffic in our case?**
Since the majority of nodes in the status archive use-case are seeders, the main cost is around refreshing the DHT tracker (Content Traffic) + some maintainance cost (Maintenance Traffic) + Query traffic (depending on the factors described above).
## Variables
Let's look at the variables involved in our use-case. Recall that the archive is generated by the control node which initially the only one publishing the content, then enveryone in the community replicate and becomes a provider as well. I'm ignoring the initial distribution part here, and will assume all members of a community will store the archive. Note, I'm not saying here all **nodes**, I'm saying all **members**.
- `C` = number of content pieces to index (number of archive files per member/provider)
- `P` = average providers per content (number of community members)
- `Q` = average queries per content piece per second
- `N` = number of DHT nodes
- `K` = DHT replication factor (typically 3-20)
- `T` = TTL/update interval (provider refresh period, seconds)
- `λ` = provider churn rate (providers joining/leaving per unit time)
if we want to measure the DHT strain/traffic per-community we can split `C` to:
$C_{com} = H + 1$: number of archive files per-community, where `H` is community age in weeks.
$C_{all} = \sum_0^k (H + 1)$: number of archive files for all `k` communities a node is part of, where `H` is community age in weeks.
Note: we add 1 above for the index file.
## Some Calculations/Analysis
### Formulas
For each of the traffic categories, we can try to come up with formulas to measure it. We need to measure traffic per-node, but we can measure the RPC or bandwidth (bytes). Let's go with bandwidth for now, knowing that:
$$
\text{RPC/s } \times \text{ msg size} = \text{bandwidth (bytes/s)}
$$
We can get some estimate on the sizes of these rpc messages from the Codex implementation.
For **maintainance traffic**:
Based on the [Codex DHT](https://github.com/codex-storage/nim-codex-dht) we can observe the for DHT maintainance we have:
(1) Keep-alive / ReValidation:
- We have two parameters: `RevalidateMin = 5s` and `RevalidateMax=10s`
- Ping & Pong: Every $\tau_{keepalive} = 7.5s$ on average (`(RevalidateMin+RevalidateMax)/2`), ping 1 peer.
- $S_{ping}$: ping msg size, measured value is 15 bytes
- $S_{pong}$: pong msg size, measured value is 35 bytes
(2) Refresh:
FindNodeMessage & NodesMessage:
- Every $\tau_{refresh} = 5 min$ perform one random lookup to find `k` closest nodes.
- Cost per-lookup: $C_{lookup} = \alpha \log_2(N)$
- i.e. send $C_{lookup}$ requests (FindNodeMessage) and receive $C_{lookup}$ responses (NodesMessage).
- $\alpha = 3$
- K = 16 neighbours
- FindNodeMessage ($S_{find}$): measured msg size is 20 bytes
- NodesMessage ($S_{nodes}$) = 5 + N × (record size + 1). Record size is 300 bytes avg. $S_{nodes}$ depends on the number of SPRs in the response (MAX is 16 split into multiple responses each with MAX 3 SPRs), we can expect the MAX 16 meaning size is around 4800 bytes
As I see it we need to calculate two things: Keep-alives and refresh/lookups:
(1) Keep-alives (per-node)
$$
\text{Bandwidth}_{keepalive} =
\frac{S_{ping} + S_{pong}}{\tau_{keepalive}} (bytes/sec)
$$
Based on the measured values:
- Bandwidth per sec = 6.67 bytes/sec - **constant**
- Per-day: 11520 Ping&Pong msgs per-node
- Bandwidth per day: 576,200 bytes/day
(2) Bucket refreshes (per-node)
$$
\text{Bandwidth}_{refresh} =
\frac{C_{lookup} \cdot (S_{find} + S_{nodes})}{\tau_{refresh}} (bytes/sec)
$$
Thus the maintanance bandwidth is:
$$
Bandwidth_{maintenance} \text{ (per-node)} =
\text{Bandwidth}_{keepalive} + \text{Bandwidth}_{refresh}
$$
For **Content/advertize traffic**:
Every provider node that stores a content key (CID) must (re-)advertize it every time-interval `T`. Each advertize has cost, let's use the term $C_{adv}$ for cost-per-advertize (in RPC). The cost to advertize a single content is two parts: (1) find the `K` closest nodes to the CID in the DHT keyspace, (2) send `addProvider` msgs to these `K` nodes. This is summarized in the following formula:
$$
RPC_{adv} \text{ (per-content, per-provider)} = C_{lookup} + K
$$
To measure the bandwidth, we need the msg sizes. We already defined the msg sizes for lookup as $S_{find}, S_{nodes}$ , now we need to introduce the following symbol:
- $S_{AP}$ : AddProviderMsg, a request msg from advertiser to the `K` nodes. Based on our measurement this is 234 bytes.
I'm assuming here there is no "ack" response to `AddProvider`. Now the formula for bandwidth per-content per-provider is the following:
$$
Bandwidth_{adv}^{\text{ (content,provider)}} = \frac{C_{lookup} \times (S_{find} + S_{nodes}) + K \times S_{AP}}{T} \text{ (bytes/sec)}
$$
Then the per-provider bandwidth:
$$
Bandwidth_{adv}^{\text{ (provider)}} = C \times Bandwidth_{adv}^{\text{ (content,provider)}} \text{ (bytes/sec)}
$$
Again in the above formulas I'm ignoring churn rate and assuming constant number of providers `P`
For **query traffic**:
The query traffic is highly dependent on the query rate `Q` which is somewhat difficult to estimate, but fortunately the formulas are quite simple as we will see next.
The number of RPCs required to query is the same as the advertisement one, i.e. to find the providers of a specific content key, you first have to find the `K` closest nodes through lookup, then send `getProviders` to these `K` nodes. Therefore:
$$
RPC_{query} \text{ (per-content)} = C_{lookup} + K
$$
To measure bandwidth we need msg sizes again so let's introduce new symbols:
- $S_{GP}$ : GetProvider, a request to get the providers for a content key, measured = 33 bytes.
- $S_{PM}$ : ProvidersMessage, reponse containing a list of `P` providers - approx equals to `5 + P*300` bytes. Note here that `P` is 5 max but if more than 5 then multiple msgs are sent, for simplicity we can just assume `P*305`.
The per-content bandwidth (bytes/sec) formula is
$$
Bandwidth_{query}^{\text{content}} = Q \times C_{lookup} \times (S_{find} + S_{nodes}) + K \times (S_{GP} + S_{PM}) \text{ (bytes/sec)}
$$
Then we can estimate the per-node bandwidth using:
$$
\text{Bandwidth}_{query} =
C \times Bandwidth_{query}^{\text{content}}
$$
Note that the above formula calculates the bandwidth per-node/provider given the query rate `Q`, but this rate is expected to be equal for all providers. There is no "symmetry" here, i.e. if you are a part of certain communities, this rate might be high/low. This depends on few things:
- The content in the status community archive use-case is the archive files which I would assume isn't affect by "popularity" meaning that we don't have popular content that can skew the query rate toward some files over the others (e.g. popular vidoes/imgs). Instead, what would be factors in this setting are the following:
- The members' churn rate (members going off-line then on-line)
- The number/rate of new members joining the community.
- The size of the community -> bigger community, more churn + more joiners.
- Although the content (archive files) are replicated to all members (per-community), this doesn't change the fact that `getProviders` will have to be sent to `k` nodes in the DHT space (those closest to the archive CID) and these nodes will be hot-spots when you have communities affected by the factors above.
**Total Bandwidth**
Give the formulas above, we would like to calculate the expected bandwidth for a node/provider. This come to summing up the three main bandwidth formulas above:
$$
\begin{aligned}
\text{Bandwidth}_{\text{total}}
&=Bandwidth_{maintenance} + Bandwidth_{adv} + Bandwidth_{query} \\
\text{Bandwidth}_{\text{total}}
&= \frac{S_{\text{ping}} + S_{\text{pong}}}{\tau_{\text{keepalive}}} + \frac{C_{\text{lookup}}\,(S_{\text{find}}+S_{\text{nodes}})}{\tau_{\text{refresh}}}
\\ &\quad + \frac{C}{T}\Big(C_{\text{lookup}}\,(S_{\text{find}}+S_{\text{nodes}}) + K\,S_{\text{AP}}\Big)
\\ &\quad + C\,Q \Big(C_{\text{lookup}}\,(S_{\text{find}}+S_{\text{nodes}}) + K\,(S_{\text{GP}}+S_{\text{PM}})\Big)
\end{aligned}
$$
expanded to:
$$
\begin{aligned}
\text{Bandwidth}_{\text{total}}
&= \frac{S_{\text{ping}} + S_{\text{pong}}}{\tau_{\text{keepalive}}} \\
&\quad + \frac{\alpha \log_{2}(N)\,(S_{\text{find}} + S_{\text{nodes}})}{\tau_{\text{refresh}}} \\
&\quad + \frac{C}{T}\Big(\alpha \log_{2}(N)\,(S_{\text{find}} + S_{\text{nodes}}) + K\,S_{\text{AP}}\Big) \\
&\quad + C\,Q \Big(\alpha \log_{2}(N)\,(S_{\text{find}} + S_{\text{nodes}}) + K\,(S_{\text{GP}} + S_{\text{PM}})\Big)
\end{aligned}
$$
### Expected range for variables
In the formula for total bandwidth, we can plug-in our measured values for msg sizes and constants, these are stated again in the following:
| Symbol | Meaning | Value |
|-------------------|--------------------------------------------|--------------------------------|
| `S_ping` | Ping message size | 15 bytes |
| `S_pong` | Pong message size | 35 bytes |
| `τ_keepalive` | Keepalive period | 7.5 s |
| `τ_refresh` | Bucket refresh period | 300 s (5 min) |
| `α` | Lookup factor | 3 |
| `K` | DHT replication factor (k-closest) | 16 |
| `S_find` | FIND_NODE request size | 20 bytes |
| `S_nodes` | NODES response (max) | ≈ 4800 bytes |
| `S_AP` | AddProvider request size | 234 bytes |
| `S_GP` | GetProviders request size | 33 bytes |
| `S_PM` | ProvidersMessage response size | 305*P bytes (P = providers)|
Plugging-in these numbers in the formula would give us:
$$
\begin{aligned}
\text{Bandwidth}_{\text{total}}
&= \frac{15 + 35}{7.5} \\
&\quad + \frac{3 \log_{2}(N)\,(20 + 4800)}{300} \\
&\quad + \frac{C}{T}\Big(3 \log_{2}(N)\,(20 + 4800) + 16 \cdot 234\Big) \\
&\quad + C\,Q \Big(3 \log_{2}(N)\,(20 + 4800) + 16\,(33 + 305P)\Big)
\end{aligned}
$$
Simplified:
$$
\begin{aligned}
\text{Bandwidth}_{\text{total}}
&= 6.67 \\
&\quad + 48.2 \,\log_{2}(N) \\
&\quad + \frac{C}{T}\,\big(14460 \,\log_{2}(N) + 3744\big) \\
&\quad + C\,Q\,\big(14460 \,\log_{2}(N) + 528 + 4880P\big)
\end{aligned}
$$
Now let's talk about expected ranges for each of the variables above ($N,P,C,T,Q$).
`N` the number of nodes in the DHT, in our use-case this would be all status (desktop?) nodes, i.e. all members of all communities.
This could range from 10K - 100K nodes depending on how popular status is, we can maybe get some numbers?
`P` the number of providers per-content. This depends on the size of the community, i.e. the larger the community, the mode providers would be because we assume all community members replicate the data. We can estimate this to be in range 10-1000 members, but we can also assume that about %50-%70 of them are online at any one time, so `P` would be about half of expected number of members and then we can also ignore churn rate, and expect constant number of providers. Note here that since all members are providers, it means for large communities, the list providers for content key (CID) is long, so the response to `getProviders` would be a big list. Alghough if there is a limit to how many providers is in the response then this won't be a bandwidth issue, maybe a storage issue (for storing that long list)!?
`C` the number of content pieces/keys. This variable in our setting increases with time (in weeks) and so we should model this as function of time since the begining of each community. Each community would have about 52 archive files/CIDs a year then multiply this by the number of communities to get `C`. See $C_{com}$ and $C_{all}$ described previously.
`T` provider refresh time interval. Looking at the bittorrent docs, this value is about 15-30min with TTL of about 2hr, but in libp2p this values much higher, around 22h with TTL of 48hr. In Codex, the blockExchange protocol advertizes every 30min with TTL of 24hr. The range is then 30min-24h for `T`
`Q` the query rate per-content. This I would expect to be small in our setting but depends on how often nodes/community members go offline. In a stable state, every member has the archive and there are no queries. However, as stated earlier there are two cases in which nodes/community members would query the archive data (or parts of it): (1) When joining the community you request the whole archive. (2) When a node go offline for a while and then comes back then the request would be only for part of the archive and how many parts (CIDs) depend on how long it has been online. The first situation is easier to estimate than the second. We can try to experess this as a function of size of the community `P`, Let's give it a try here:
We can consider these factor:
- $\lambda_{\text{join}}$ : community-wide new join rate (joins per second)
- $\lambda_{\text{churn}}$ : per-provider churn rate (avg rejoins per second)
- $\theta_{\text{re}}$ : fraction of content keys a re-joining node must query.
We can use the above to calculate `Q` which sums up to part:
(1) new joins: each new node that joins needs to fetch all content keys `C`, so the rate per-content is just $\lambda_{\text{join}}$
(2) re-joins: each online provider has a probability of going offline and then re-joining, which we describe with $\lambda_{\text{churn}}$
Here is a table to summarize the variables:
| Variable | Expected Range | Notes |
| ----------------------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| $N$: DHT nodes | $10^4 - 10^5$ | All Status desktop/mobile nodes. |
| $P$: providers per content | \~5 - 500 | 10–1000 members per community × 50–70% online. |
| $C$: content keys | \~50–100 per community per year, global = sum across comms | 52 weekly archives + 1 index each year. Grows linearly with age. |
| $T$: refresh interval | 30min – 22 h | BitTorrent = 15–30 min, libp2p reprovides every \~22 h. |
| $Q$: query rate per content | ... |... |
### Example values and calculations
...
## Alternative Options:
If the analysis shows unacceptable numbers, we could look at alternatives. Here are some of the suggested one. I list the ideas below and put some open questions (at least for me) to think/research.
### (1) Move to one append-only archive file
We could simply use the same approach that status took and everyweek append the ever-growing archive file and single CID.
Questions:
- How does Codex handle file appends?
- How to append the file + Merkle root?
- How would the index work? how to point to specific 1-week archive file? same as status currently does?
- How would proving system work with growing/mutable (append-only) file?
### (2) Annual bundling
Bundling at longer interval, instead of merging/appending every week, why not every few months or a year e.g. annual bundling. I think this is needed anyway for the proof system, but we could seperate the bundling for proofs from bundling for DHT scaling.
Questions:
- Prior to merging, how do you prove all these small files? are they floating free without proof or durability guarantees until merging?
- Same questions as before on the index.
- How do you prove merging/appending correctly?
- if everyone appends, then agree on the (root) hash?
### (3) Mutable Data CIDs
While looking at bittorrent DHT I found this [BEP44](https://www.bittorrent.org/beps/bep_0044.html) which is an extension to support mutable data with a constant key/cid as I understand it. I think the key is derived from the hash of the public key of the data publisher not a hash of the content. This might be useful for the index file of the status community archive use-case, since it is expected to mutable and it is also published by single party (the control node). There is also a mention of signatures and validation that the update is done by the publisher. Another thing is that they also have [BEP50](https://www.bittorrent.org/beps/bep_0050.html) which explain the use of pub/sub for broadcasting the update to the mutable file.
I was thinking first that maybe we can use this to elimenate the one-cid-per-archive, but maybe not. Because with this, you would have a single mutable file and one cid, but then again this cid is for the whole thing not chunks of it, so you would then have to deal with indexing so that members can fetch specific chunks (1-week archive file).
Not sure if this will improve the effciency or worth exploring more, espcially for a prototype but looks like an interesting approach.