## Meeting notes - Community History Archive Indexing Attendees: Sanaz, Pascal ### Meeting goal The goal of the meeting was to reiterate over the two solution proposals regarding message archive bundling and indexing, and discuss their trade-offs so that we eventually reach consensus on which solution will be used. ### Discussion - **Recap** - Community nodes that receive a magnet link with metadata for message archives should not need to re-download archive data they already have downloaded previously - Example: - Community owner published archive A via torrent, which was successfully downloaded by interested nodes - Community owner publishes archive B (which includes archive A) via torrent - Interested nodes receiving that torrent **need to be able to detect that the data of archive A was already downloaded** (even though this is a new torrent) and only download the additional data (archive b) - There are two proposals on how to bundle message archives - An append-only binary of message archives (John) - An archive index with pointers to published archives (Pascal) - Both solutions comes with trade-offs, these are discussed below - #### Append-only message archive binary - A little bit of bittorrent background is needed (super simplified) - When torrents are created, the data in question is split into small pieces - Each piece gets hashed by SHA1 (20 bytes) - Pieces (hashes) are shared with the network, allowing nodes to figure out which pieces they need - Because torrent clients rely on hashes for each piece to download, they can easily figure out which data they already obtain - Example: - Assume a file A to be shared, size 200 bytes - File is slices into pieces of 100 bytes - 200 bytes (total size) / 100 bytes (piece length) = 2 piece - SHA1(pieces[0]) = 0xABC - SHA1(pieces[1]) = 0xDEF - A torrent client trying to downlaod that file, will do that by requesting the pieces `0xABC` and `0xDEF` by the network - If a data piece which hases to `0xABC` already exists, they only need to request `0XDEF` - For the sake of simplicity: this means, if 1 archive == 1 piece, clients could easily figure out which they have already downloaded - **Problem** - There's no guarantee 1 archive == 1 piece - There's also no guarantee that 1 archive == #n pieces, where #n is a predictable number - Archives vary in sizes, meaning the amount of pieces they will be split into varies as well - The last piece of data might not take up the whole piece length - Example: - Assume we have data of 10 bytes (this could be archive A) - ``` [12 34 56 78 90 12 34 56 78 90] ``` - Assume the piece length is 8 bytes the data would be split up into **2 pieces** - ``` [12 34 56 78 90 12 34 56] [79 90] ``` - ^ Assume piece 1 is hashed to `0x123` and piece 2 is hased to `0x456` - Assume **we add archive B** (also 10 bytes) to the previous archive - ``` [12 34 56 78 90 12 34 56 78 90 \ 11 22 33 44 55 66 7 88 99 00] ``` - Pieces are still going to be 8 bytes so we get **3 pieces** (notice how data of archive B bleeds into piece 2) - ``` [12 34 56 78 90 12 34 56] [79 90 11 22 33 44 55 66] [77 88 99 00] ``` - Piece 1 is still going to be `0x123` but piece 2 is no longer going to be `0x456`, but something else because the data is different - **This results in piece 2 being recognized as completely new data that needs to be downloaded, even though, it may only differt slightly** - Because of the problem described above, Status nodes can't rely on torrent's piecing protocol to ensure data isn't redownloaded **unless we ensure pieces are always going to be the same for already downlaoded data** - To make this work we'd need to: - Ensure every _last_ piece of an archive is filled up with some dummy data so the next archive isn't going to bleed into it - Somehow add markers into the resulting pieces, so nodes know where to split the bytes into archives (given that we added dummy data, the data is no longer just a sequence of archives, in fact it's gonna be something else that can't be recognized by status nodes) - This means we'll have to build our own encoding and parser semantics which is quite complex - In addition, if we went with this, nodes lose the ability to selectively download archives - Nodes might be able to only download the **last** archive if they recognize that previous archives where downloaded, but they can't download just a specific archive for a given time frame - #### Message archive index - This solution proposes that - Every individual archive is published as torrent to the torrent network - In addition to that, an archive index is created which is also published to the network - When nodes receive a message containing an archive index they get metadata about published archives - Looks something like - ``` { "0x123": { from: startTimestamp to: endTimestamp magnet_uri: ... }, ... } ``` - With an index like that, nodes can easily figure out which archives they obtain and which are missing - The key in the mapping is the hash of the `magnet_uri` - Nodes store the hashes of magnet_uris containing archives they've already downloaded - Furthermore, nodes not only have the ability to only download "Just the last archive", they actually get to download whatever they are interested in - Since the metadata includes `from` and `to` fields, nodes can request archives for specific time ranges, without downloaded an entire binary containing all other archives - With the message archive index we don't have to deal with the complexities introduced in the "Append-only archive binary" - Also, since archives are published individually, there's less risk to lose archive data entirely - ### Conclusion - Based on the pros and cons discussed above, we've decided it is much more feasible to created a message archive index and give nodes the flexiblity to downloaded all, some or none of the archives they are interested in.