A minimal radicle peer-to-peer protocol

A minimal radicle peer-to-peer protocol ======================================= > ⚠️ This document has been superseded by RIP#1 Assuming we have a point to point replication protocol (Git), an identity system and a means to resolve conflicts, the final question we need to answer is that of the network architecture and topology. The guiding principles here are similar to other peer-to-peer networks, in other words *censorship resistance*, *fault tolerance* and *resilience against attacks*. Most of the properties and characteristics of the system fall from these three principles. The design we propose is inspired by Secure Scuttlebutt, as well as the Bitcoin network. One characteristic both of these protocols share, is that both have a very simple, unstructured network topology, with a very basic wire protocol. This fits our need for a minimal protocol that will serve the first phase of the project, before any kind of mainstream adoption. The network and design has to serve one core purpose: given a project identifier, a user should be able to retrieve the full project associated with that identifier. Given that there is no natural incentive to host arbitrary projects, nodes on the network should be given the option to host only the projects they care about. For example, a company that uses the Rust programming language may want to host the Rust compiler project on their node, to ensure its continued availability on the network, and to "give back" to the community. Another example could be a service that hosts projects for a fee, paid by the project maintainers. In both cases, the choice of which projects to host on a node is essential. Equally important is the ability *not* to host projects that are unethical in the eyes of the node operator. For these reasons, we cannot deterministically compute on which node(s) a given project will end up, as is the case with DHTs. Nodes are able to choose what they host, and therefore the network is fundamentally *unstructured*. Though there might be a benefit to arranging the peer topology in a certain fashion (eg. to reduce message amplification), this doesn't bring much to the table in terms of data lookups, and increases the protocol complexity. With that said, let's talk about what data is being exchanged on the network, and how. Protocol -------- In essence, to answer the question of *where can I find a given project?*, nodes on the network need to build a routing table. This table should map between project identifiers (URNs) and the physical addresses (IP) of nodes that have those projects. did:rad:jnrx6t…mfdyy 54.122.99.1, 89.2.23.67 did:rad:gi59bk…l2jfi 66.12.193.8, 89.2.23.67, 12.43.212.9 … In other words, a map of type `HashMap<Urn, Vec<IpAddr>>`. Since the amount of data this represents fits on commodity hardware - even for millions of projects - we don't need to partition it as we would with a DHT (though there may be advantages other than disk space to partitioning it). To build this table, every node connecting to the network announces to its peers what projects it has. Nodes may also share with neighbors what other neighbors have, and so on. In this manner, every node in the network will eventually converge to the same routing table. This "eventually consistent" model is not designed to ever converge fully, but since it's always possible for clients to ask multiple nodes for a given piece of data to increase the chances of finding a project, it doesn't *have* to be fully consistent. Changes to what projects a node has, ie. additions or deletions are announced to peers in the same fashion, and this information is propagated through the network. ### Overview Before we dive into some of the details of how the above would work, let's think about what the different actions a node may want to take: * Finding peers to connect to * Announcing what projects it has to the network * Announcing to the network that one of its projects was updated * Fetching updates for one of its projects, from the network * Asking a peer what projects it has We can break these down into three *types* of data being exchanged: * Peer data, eg. IP addresses * Project metadata, eg. the location and hash of a project * Project data, ie. source code and commits of a project It goes without saying that for a node to fetch project data, it needs to know the location of that project, and for it to know the location of the project, it needs to be connected to at least one peer. Thus, the above also represents a state machine: connect to peers -> fetch project metadata -> fetch project data ### Bootstrapping When a node joins the network for the first time, the first thing it will try to do is to download the projects it is interested in but doesn't have. Nodes are configured with a list of project ids that they are to seek out and replicate on the network. These are called "tracked" projects. Since users may also directly "push" projects onto a node (given they are permitted to), nodes will also have projects that no other node has. This is how all projects start out on the network. To fetch a project that it is missing, our node, once it is aware of the location of said project, will connect to the remote node (the "provider") in question and attempt to `git-fetch` the relevant refs. Since it's possible for more than one provider to exist for a project, our node may choose at random, or attempt to fetch from *all* providers, to make sure it has the latest state of the project. A node joining the network for the first time will not know of any peers. Hence, we require that the node client software ship with a set of "bootstrap" node IP addresses. These are trusted nodes that can aid the new node in the bootstrap process, by communicating the address of other peers on the network. Though this is perhaps enough for a minimal protocol, using DNS will be more flexible. The parties operating the DNS names can then update their list of IP addresses as needed. Once a node is connected to an initial set of nodes, it may ask for the addresses of other nodes, to build its address book, and ensure there are always peers to connect to. ### Keeping up to date When a user pushes project updates to a node, the other nodes need to catch up to these changes. Since users, when updating project repositories will likely not push to *all* nodes tracking their project, there has to be a mechanism for nodes to keep up to date with the latest changes of a project. To achieve this, the node(s) receiving the push should send a message to the providers it knows about for that project, and upon receiving that message, they may `git-fetch` from the initiating node. This may present challenges if a connection needs to be established every time, and so we propose that nodes with shared projects maintain connections with each other. This, perhaps is the only area where network structure is relevant, and a data-oriented topology would make routing of messages much more efficient. Thus, we can say that nodes should prioritize connecting to peers with similar interests, once they find out what those interests are, ie. once their routing table is fleshed out. ### Untracking If a node tries to retrieve a project from a provider, and that provider responds that they don't have the project, the node should remove that provider from the list of providers for that project, in its local table. A node may also not respond to such a request, leaving the requester hanging, this can also be reason enough to remove that project from the provider's list. ### Project tracking and branches When we talk about project hosting and tracking, what exactly do we mean? Which *branches* of a project are hosted? This simple answer is that we host all branches published by the project *delegates* (typically the maintainers). This is a safe choice, and works, but there is one shortcomming: contributors who aren't delegates won't have their changes (forks) replicated. This becomes a problem when a contributor wants to propose a new patch to a project, and the maintainer has no way of retrieving the patch or associated code, which lives on a branch. In the past, we have solved this with a "tracking graph" -- a social graph that represents trust in the network. Delegates choose which users to track, and the replication system then chooses which forks to replicate based on that information. For example, a node could decided to track all forks of users one degree out of a project's delegates. Another node could choose to track forks two degrees out, etc. Though this works, it doesn't solve the issue of first-time contributors who aren't known to the delegates. In a minimal protocol, we can default to replicating all forks, but this can eventually lead to abuse. We leave this as an open question for future iterations of the protocol. ### Incentivization & data availability As alluded to in the beginning of this document, nodes may want to host projects that they value or want to support. This incentive is non-monetary, but has the potential outcome of providing data availability that is correlated with project popularity. This in essence is a good thing, and will solve the data availability problem for the most popular projects, but what of the less popular projects? Since nodes choose what they host, they can take payment for hosting specific projects. This is close to the traditional pay-to-host model, and the network should be designed for these types of nodes to exist alongside other nodes. Since this business model benefits from a growing radicle ecosystem, we can assume that these providers will also host projects for free, under a more basic hosting plan than the paid customers. Hence we have a potential solution for the long tail. Finally, for users who don't want to rely on the network, self-hosting is always an option. Since the node software and protocol are not resource intensive, it's plausible that users will be able to host their own projects on commodity hardware. To make it even more viable, we can consider a "light" mode of operation where only a partial routing table is maintained by a user node. Of course, a solution to NAT traversal will have to be implemented. ### Misbehavior We won't dive deep on what happens when a node misbehaves on the network, as this is a minimal protocol, but it's worth mentioning that nodes can and should disconnect from peers that exhibit any of the following behaviors: * Sending badly formed messages * Not respecting the protocol semantics * Sending too much data * Not responding to requests * Having too high a latency * etc. Nodes may keep a *ban list* to avoid re-connecting to peers which have exhibited the above behavior, though we should be careful not to ban a node simply because it went offline. ### Wire protocol We won't go into detail on the wire protocol, but it's worth enumerating the message types that need to exist for nodes to carry out the protocol: * `Hello(...)`: A handshake message with node information, eg. address, port and protocol version. * `GetInventory`: A message to ask a node for its project inventory * `Inventory((id, hash)[])`: A message announcing what projects a node has. This should include the hash of each project as well. * `GetAddresses`: A message to request peer addresses. * `Addresses(addr[])`: A message to send peer addresses Note that there is no immediate need for a node to announce what projects it is tracking. It can simply fetch the projects it is tracking from nodes, and then announce them as inventory. A node shouldn't announce as inventory a project it is tracking but doesn't have. Nodes should also be *pruning* their routing table by comparing it to the received `Inventory` messages: if a project under a provider is no longer included in their inventory, it can be pruned from the routing table entry. This implies calling `GetInventory` on direct peers at a regular interval. Some messages, such as `Inventory` may be forwarded to other peers. In that case, the origin of the message will have to be included so that nodes can update their routing tables accordingly. If we want to prevent these messages from being forged, digital signatures can be used (see below section). ### Alternative wire protocol We could propose a slightly different method of exchange node information: instead of `GetAddresses`, nodes could announce themselves to the network, via a `AnnounceNode` message signed by the node. These messages could be gossiped, and would include the inventory. When inventory changes, nodes would re-announe themselves. These messages could then be buffered and sent to new nodes connecting to the network. Whether to announce the full inventory on node announcement, the remotes, or simply the node address is an open question. In the latter case, inventory messages would still be required. Storage ------- Briefly, storage for project data is taken care by Git. In the `link` protocol, we stored all Git refs and objects in a single "monorepo". Though this has several advantages, there may be greater advantages in splitting the dataset across projects: instead of one repo with all projects and all user forks, we have one repo *per project*: /<project>/git /refs/namespaces/<remote> /heads /master /tags ... /refs/namespaces/<remote> /heads /master /tags ... This has numerous advantages, at least in theory: * Private repositories become easier * Git caching becomes more efficient, since the ODB is smaller * Concurrency is simpler, since we can have locks at the repo level * There is less strain on each ODB Beyond the minimal protocol --------------------------- Once a minimal protocol is live and working, we may consider the following ideas and features. ### Node Identity & Signatures When a message is routed between nodes on the network, to prevent forgery, we may use digital signatures. Imagine that Eve is malicious and wants to flood the network with misinformation; she can construct a message that says that some node at a certain IP address has certain projects hosted, while that isn't true. With digital signatures we prevent nodes lying about each other, though there is still the possibility of lying about yourself. However, the lie is then associated with the liar, and we can imagine future node reputation systems to de-incentivize these behaviors. It's worth noting that the Bitcoin network uses no such signatures, and all messages are to be handled as untrusted input. This could be "good enough" for a first version of the protocol, but could introduce complexities later on, once malicious actors find a way to exploit this. Since digital signatures require the concept of node identity, introducing them does complicate the protocol a fair bit. ### Transport encryption Whether or not data should be encrypted as it is transferred between nodes is an open question. Arguably, this would mainly benefit censorship resistance, in case an ISP wanted to block certain content from being transferred. The node's public encryption key could then be used as the node identifier, and we could implement something like the Noise protocol[^0]. ### Tor If censorship resistance is paramount for a given node, it should be able to leverage the Tor network and expose itself through a `.onion` address. This is entirely opt-in, but clients should be able to transparently connect to Tor nodes for content retrieval. Note that in the case of Tor, node identity and transport encryption would be handled by the Tor protocol. ### Routing nodes It is conceivable that in the future, routing-only nodes would exist. These nodes would maintain the routing table but wouldn't track any projects. ### DoS mitigation The most obvious attack on the above system is to flood the routing table with garbage data. This deteroirates the service by having most connections and fetches fail, since the majority of the routing table would be unusable. There are several ways we can mitigate this problem: * Expire routing table entries after a certain amount of time, if they haven't been refreshed. * For each routing entry, keep track of how many peers have communicated this entry with us. Have shorter expiries for routing entries received only once. * Require a small proof-of-work for new routing entries. * Have a control plane which allows messages to be passed between trusted peers * Tag routing entries with their provenance, so that when a peer is identified as malicious, all its routing entries can be removed. * Use per-node stake certificates[^1] * Use a trust graph: only accept routing entries from nodes you trust, and nodes that they trust, up to N degrees out. [^0]: https://noiseprotocol.org/noise.html [^1]: https://lists.linuxfoundation.org/pipermail/lightning-dev/2020-November/002884.html Alternative designs ------------------- There are many potential alternatives to the protocol described here. One of them, which has been tried before is different in at least one fundamental way: instead of keeping track of which nodes host which projects, we add two wire messages: `Want(id)` and `Have(id)`. Then, nodes looking for a project simply flood the network with the `Want` message, and nodes that have the project respond with a `Have` message. We've opted not to follow this design in the hopes that a well connected network will offer better performance with a routing table if it can be kept up to date. The designs however are not mutually exclusive, and we can imagine adding these messages for the case where a user cannot find a project it is looking for via the usual means. We also talked about DHTs, which could be used to store the routing table if partitioning was desired. This would work because the routing table should be fairly static once built, as nodes that are hosting a project will likely keep hosting that project. However, this does require a certain uptime and availability for all nodes, that is hard to guarantee in a permissionless network. The extra complexity of a DHT may also not be warranted for a first version of the protocol.