## Meeting notes - Community History Archive Indexing
Attendees: Sanaz, Pascal
### Meeting goal
The goal of the meeting was to reiterate over the two solution proposals regarding message archive bundling and indexing, and discuss their trade-offs so that we eventually reach consensus on which solution will be used.
### Discussion
- **Recap**
- Community nodes that receive a magnet link with metadata for message archives should not need to re-download archive data they already have downloaded previously
- Example:
- Community owner published archive A via torrent, which was successfully downloaded by interested nodes
- Community owner publishes archive B (which includes archive A) via torrent
- Interested nodes receiving that torrent **need to be able to detect that the data of archive A was already downloaded** (even though this is a new torrent) and only download the additional data (archive b)
- There are two proposals on how to bundle message archives
- An append-only binary of message archives (John)
- An archive index with pointers to published archives (Pascal)
- Both solutions comes with trade-offs, these are discussed below
- #### Append-only message archive binary
- A little bit of bittorrent background is needed (super simplified)
- When torrents are created, the data in question is split into small pieces
- Each piece gets hashed by SHA1 (20 bytes)
- Pieces (hashes) are shared with the network, allowing nodes to figure out which pieces they need
- Because torrent clients rely on hashes for each piece to download, they can easily figure out which data they already obtain
- Example:
- Assume a file A to be shared, size 200 bytes
- File is slices into pieces of 100 bytes
- 200 bytes (total size) / 100 bytes (piece length) = 2 piece
- SHA1(pieces[0]) = 0xABC
- SHA1(pieces[1]) = 0xDEF
- A torrent client trying to downlaod that file, will do that by requesting the pieces `0xABC` and `0xDEF` by the network
- If a data piece which hases to `0xABC` already exists, they only need to request `0XDEF`
- For the sake of simplicity: this means, if 1 archive == 1 piece, clients could easily figure out which they have already downloaded
- **Problem**
- There's no guarantee 1 archive == 1 piece
- There's also no guarantee that 1 archive == #n pieces, where #n is a predictable number
- Archives vary in sizes, meaning the amount of pieces they will be split into varies as well
- The last piece of data might not take up the whole piece length
- Example:
- Assume we have data of 10 bytes (this could be archive A)
- ```
[12 34 56 78 90 12 34 56 78 90]
```
- Assume the piece length is 8 bytes the data would be split up into **2 pieces**
- ```
[12 34 56 78 90 12 34 56]
[79 90]
```
- ^ Assume piece 1 is hashed to `0x123` and piece 2 is hased to `0x456`
- Assume **we add archive B** (also 10 bytes) to the previous archive
- ```
[12 34 56 78 90 12 34 56 78 90 \
11 22 33 44 55 66 7 88 99 00]
```
- Pieces are still going to be 8 bytes so we get **3 pieces** (notice how data of archive B bleeds into piece 2)
- ```
[12 34 56 78 90 12 34 56]
[79 90 11 22 33 44 55 66]
[77 88 99 00]
```
- Piece 1 is still going to be `0x123` but piece 2 is no longer going to be `0x456`, but something else because the data is different
- **This results in piece 2 being recognized as completely new data that needs to be downloaded, even though, it may only differt slightly**
- Because of the problem described above, Status nodes can't rely on torrent's piecing protocol to ensure data isn't redownloaded **unless we ensure pieces are always going to be the same for already downlaoded data**
- To make this work we'd need to:
- Ensure every _last_ piece of an archive is filled up with some dummy data so the next archive isn't going to bleed into it
- Somehow add markers into the resulting pieces, so nodes know where to split the bytes into archives (given that we added dummy data, the data is no longer just a sequence of archives, in fact it's gonna be something else that can't be recognized by status nodes)
- This means we'll have to build our own encoding and parser semantics which is quite complex
- In addition, if we went with this, nodes lose the ability to selectively download archives
- Nodes might be able to only download the **last** archive if they recognize that previous archives where downloaded, but they can't download just a specific archive for a given time frame
- #### Message archive index
- This solution proposes that
- Every individual archive is published as torrent to the torrent network
- In addition to that, an archive index is created which is also published to the network
- When nodes receive a message containing an archive index they get metadata about published archives
- Looks something like
- ```
{
"0x123": {
from: startTimestamp
to: endTimestamp
magnet_uri: ...
},
...
}
```
- With an index like that, nodes can easily figure out which archives they obtain and which are missing
- The key in the mapping is the hash of the `magnet_uri`
- Nodes store the hashes of magnet_uris containing archives they've already downloaded
- Furthermore, nodes not only have the ability to only download "Just the last archive", they actually get to download whatever they are interested in
- Since the metadata includes `from` and `to` fields, nodes can request archives for specific time ranges, without downloaded an entire binary containing all other archives
- With the message archive index we don't have to deal with the complexities introduced in the "Append-only archive binary"
- Also, since archives are published individually, there's less risk to lose archive data entirely
- ### Conclusion
- Based on the pros and cons discussed above, we've decided it is much more feasible to created a message archive index and give nodes the flexiblity to downloaded all, some or none of the archives they are interested in.