File per Archive vs One Big Blob

# File per Archive vs One Big Blob Status currently stores community messages in archival files covering $7$-day periods. Let $A = [a_1, \cdots, a_n]$ be the set of archives for a given status community. In the current version of status, $A$ is published as a _single_ file within a torrent -- which we refer to as the _archival blob_ -- with the actual archival files appended one after the other within the archival blob. The torrent also includes an _index file_, which specifies which block offset ranges correspond to which archival file. One immediate question that comes to mind is _why_ such a complicated design. Indeed, publishing a new archival file requires: 1. appending the archival file to the existing archival blob, while respecting piece alignment constraints; 2. updating the index file; 3. regenerating the torrent file and getting a new infohash; 4. unseeding the previous infohash; 5. re-seeding the new infohash and publishing it in the community over Waku. Since we already have an index file, why cannot we simply update the index and then publish each archival file as a separate torrent together with the new version for the index? The index would then contain tuples of the form $([t^{i}_0, t^{i}_1], c^i)$, where $[t^{i}_0, t^{i}_1]$ specifies a time interval covered by an archive, and $c^i$ its magnet link. This would allow us to avoid having to deal with the intricacies of appends, and would also be much easier to integrate with today's Codex[^1]. **The efficiency argument.** One objection to such approach is that one would have to publish, and maintain, many more archives. Indeed, for a community alone, we would have $52$ archives per year. Those are $52$ new archives that every community member must seed per year, for every community they participate in. What is, however, the real cost of seeding many files? **Archive access.** From our conversation with the Status team, all members of a community will proactively attempt to complete their archival files and seed them upon joining[^3]. This means that, in a stable state, the percentage of nodes requiring access to historical archives at any given time is likely very low.[^2] If that's indeed the case, then such swarms are likely made up mostly of seeders. Seeders in principle need not seek out connections to other peers, so the cost of seeding in a swarm for which there are few leechers boils down to refreshing the tracker that we are still around. For bittorrent, this means one announcement every $5$ minutes, for HTTP trackers, or every $15$ minutes, for DHT trackers. This means that for a community of $900$ members in which everyone is fully online, a DHT tracker will take an average of $1$ request per second, per file. If we are using one torrent per file, this becomes $k$ requests per second, where $k$ is the number of torrent files, and $k$ increases by one at every week, since the day the community has been created. The overhead of using a torrent per archive is, therefore, not insignificant. It also grows linearly with the age of the community, which is not a desirable property. While it is also true that the overhead of sharing a torrent will also grow with the size (bitfields, piece lists, and whatnot), this will likely take longer to become a bottleneck. Finally, if we eventually support proofs of storage, larger (mutable) files will probably make more sense. Using one torrent per archive, therefore, is probably not the best option. ### Interest Groups There are several ways to address the problem above for Codex. One of them would be supporting appends and doing exactly what status does for Bittorrent. The other would be modifying Codex to support communities with an "interest group" primitive that works like IPFS sessions in some sense: nodes could then advertise themselves as part of an interest group (which has a CID), and the swarm building algorithm could work as usual (with random bootstrap, and a low and high cap on connections for the group). New files published within the group however would cause a node to treat its group neighbors as its swarm neighbors -- a node would simply start requesting blocks to neighboring peers, as it would with a regular swarm, except that this would happen for every file published within the group. Learning the list of CIDs within a group would remain a problem, however, and likely some form of index file would still be required. [^1]: Codex could support appends, but currently requires rebuilding the whole Merkle tree. Using something like a Merkle Mountain Range could make this cheaper. [^2]: This is contingent on churn, and should be confirmed. [^3]:

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.