A minimal radicle peer-to-peer protocol
=======================================
> ⚠️ This document has been superseded by RIP#1
Assuming we have a point to point replication protocol (Git), an
identity system and a means to resolve conflicts, the final question we
need to answer is that of the network architecture and topology. The
guiding principles here are similar to other peer-to-peer networks, in
other words *censorship resistance*, *fault tolerance* and *resilience
against attacks*. Most of the properties and characteristics of the system
fall from these three principles.
The design we propose is inspired by Secure Scuttlebutt, as well as the
Bitcoin network. One characteristic both of these protocols share, is that
both have a very simple, unstructured network topology, with a very basic
wire protocol. This fits our need for a minimal protocol that will serve
the first phase of the project, before any kind of mainstream adoption.
The network and design has to serve one core purpose: given a project
identifier, a user should be able to retrieve the full project associated
with that identifier. Given that there is no natural incentive to host
arbitrary projects, nodes on the network should be given the option to
host only the projects they care about. For example, a company that uses
the Rust programming language may want to host the Rust compiler project
on their node, to ensure its continued availability on the network, and
to "give back" to the community. Another example could be a service
that hosts projects for a fee, paid by the project maintainers. In both
cases, the choice of which projects to host on a node is essential.
Equally important is the ability *not* to host projects that are unethical
in the eyes of the node operator.
For these reasons, we cannot deterministically compute on which node(s)
a given project will end up, as is the case with DHTs. Nodes are able
to choose what they host, and therefore the network is fundamentally
*unstructured*. Though there might be a benefit to arranging the peer
topology in a certain fashion (eg. to reduce message amplification),
this doesn't bring much to the table in terms of data lookups, and
increases the protocol complexity.
With that said, let's talk about what data is being exchanged on the network,
and how.
Protocol
--------
In essence, to answer the question of *where can I find a given project?*,
nodes on the network need to build a routing table.
This table should map between project identifiers (URNs) and the physical
addresses (IP) of nodes that have those projects.
did:rad:jnrx6t…mfdyy 54.122.99.1, 89.2.23.67
did:rad:gi59bk…l2jfi 66.12.193.8, 89.2.23.67, 12.43.212.9
…
In other words, a map of type `HashMap<Urn, Vec<IpAddr>>`. Since the amount
of data this represents fits on commodity hardware - even for millions of
projects - we don't need to partition it as we would with a DHT (though
there may be advantages other than disk space to partitioning it).
To build this table, every node connecting to the network announces
to its peers what projects it has. Nodes may also share with neighbors
what other neighbors have, and so on. In this manner, every node in
the network will eventually converge to the same routing table. This
"eventually consistent" model is not designed to ever converge fully,
but since it's always possible for clients to ask multiple nodes for
a given piece of data to increase the chances of finding a project,
it doesn't *have* to be fully consistent. Changes to what projects
a node has, ie. additions or deletions are announced to peers in
the same fashion, and this information is propagated through the
network.
### Overview
Before we dive into some of the details of how the above would work,
let's think about what the different actions a node may want to take:
* Finding peers to connect to
* Announcing what projects it has to the network
* Announcing to the network that one of its projects was updated
* Fetching updates for one of its projects, from the network
* Asking a peer what projects it has
We can break these down into three *types* of data being exchanged:
* Peer data, eg. IP addresses
* Project metadata, eg. the location and hash of a project
* Project data, ie. source code and commits of a project
It goes without saying that for a node to fetch project data, it
needs to know the location of that project, and for it to know
the location of the project, it needs to be connected to at least
one peer. Thus, the above also represents a state machine:
connect to peers -> fetch project metadata -> fetch project data
### Bootstrapping
When a node joins the network for the first time, the first thing it will try
to do is to download the projects it is interested in but doesn't have.
Nodes are configured with a list of project ids that they are to seek out and
replicate on the network. These are called "tracked" projects. Since users may
also directly "push" projects onto a node (given they are permitted to), nodes
will also have projects that no other node has. This is how all projects start
out on the network.
To fetch a project that it is missing, our node, once it is aware of
the location of said project, will connect to the remote node (the "provider")
in question and attempt to `git-fetch` the relevant refs. Since it's possible
for more than one provider to exist for a project, our node may choose at
random, or attempt to fetch from *all* providers, to make sure it has the
latest state of the project.
A node joining the network for the first time will not know of any peers. Hence,
we require that the node client software ship with a set of "bootstrap" node
IP addresses. These are trusted nodes that can aid the new node in the
bootstrap process, by communicating the address of other peers on the network.
Though this is perhaps enough for a minimal protocol, using DNS will be more
flexible. The parties operating the DNS names can then update their list of
IP addresses as needed.
Once a node is connected to an initial set of nodes, it may ask for the
addresses of other nodes, to build its address book, and ensure there are
always peers to connect to.
### Keeping up to date
When a user pushes project updates to a node, the other nodes need to catch
up to these changes.
Since users, when updating project repositories will likely not push to *all*
nodes tracking their project, there has to be a mechanism for nodes to
keep up to date with the latest changes of a project. To achieve this, the
node(s) receiving the push should send a message to the providers it knows
about for that project, and upon receiving that message, they may `git-fetch`
from the initiating node. This may present challenges if a connection needs
to be established every time, and so we propose that nodes with shared
projects maintain connections with each other. This, perhaps is the only
area where network structure is relevant, and a data-oriented topology
would make routing of messages much more efficient. Thus, we can say
that nodes should prioritize connecting to peers with similar interests,
once they find out what those interests are, ie. once their routing table is
fleshed out.
### Untracking
If a node tries to retrieve a project from a provider, and that provider
responds that they don't have the project, the node should remove that
provider from the list of providers for that project, in its local table.
A node may also not respond to such a request, leaving the requester hanging,
this can also be reason enough to remove that project from the provider's
list.
### Project tracking and branches
When we talk about project hosting and tracking, what exactly do we mean?
Which *branches* of a project are hosted? This simple answer is that
we host all branches published by the project *delegates* (typically
the maintainers). This is a safe choice, and works, but there is one
shortcomming: contributors who aren't delegates won't have their changes
(forks) replicated. This becomes a problem when a contributor wants
to propose a new patch to a project, and the maintainer has no way
of retrieving the patch or associated code, which lives on a branch.
In the past, we have solved this with a "tracking graph" -- a social
graph that represents trust in the network. Delegates choose which
users to track, and the replication system then chooses which forks
to replicate based on that information. For example, a node could decided
to track all forks of users one degree out of a project's delegates.
Another node could choose to track forks two degrees out, etc.
Though this works, it doesn't solve the issue of first-time contributors
who aren't known to the delegates. In a minimal protocol, we can default
to replicating all forks, but this can eventually lead to abuse.
We leave this as an open question for future iterations of the protocol.
### Incentivization & data availability
As alluded to in the beginning of this document, nodes may want to host
projects that they value or want to support. This incentive is non-monetary,
but has the potential outcome of providing data availability that is
correlated with project popularity. This in essence is a good thing,
and will solve the data availability problem for the most popular
projects, but what of the less popular projects?
Since nodes choose what they host, they can take payment for hosting
specific projects. This is close to the traditional pay-to-host model,
and the network should be designed for these types of nodes to exist
alongside other nodes. Since this business model benefits from a growing
radicle ecosystem, we can assume that these providers will also host
projects for free, under a more basic hosting plan than the paid
customers. Hence we have a potential solution for the long tail.
Finally, for users who don't want to rely on the network, self-hosting
is always an option. Since the node software and protocol are not
resource intensive, it's plausible that users will be able to host
their own projects on commodity hardware. To make it even more
viable, we can consider a "light" mode of operation where only a partial
routing table is maintained by a user node. Of course, a solution to
NAT traversal will have to be implemented.
### Misbehavior
We won't dive deep on what happens when a node misbehaves on the network,
as this is a minimal protocol, but it's worth mentioning that nodes can
and should disconnect from peers that exhibit any of the following behaviors:
* Sending badly formed messages
* Not respecting the protocol semantics
* Sending too much data
* Not responding to requests
* Having too high a latency
* etc.
Nodes may keep a *ban list* to avoid re-connecting to peers which have
exhibited the above behavior, though we should be careful not to ban
a node simply because it went offline.
### Wire protocol
We won't go into detail on the wire protocol, but it's worth enumerating
the message types that need to exist for nodes to carry out the protocol:
* `Hello(...)`: A handshake message with node information, eg. address, port
and protocol version.
* `GetInventory`: A message to ask a node for its project inventory
* `Inventory((id, hash)[])`: A message announcing what projects a node has.
This should include the hash of each project as well.
* `GetAddresses`: A message to request peer addresses.
* `Addresses(addr[])`: A message to send peer addresses
Note that there is no immediate need for a node to announce what projects it
is tracking. It can simply fetch the projects it is tracking from nodes,
and then announce them as inventory. A node shouldn't announce as inventory
a project it is tracking but doesn't have.
Nodes should also be *pruning* their routing table by comparing it to the
received `Inventory` messages: if a project under a provider is no longer
included in their inventory, it can be pruned from the routing table entry.
This implies calling `GetInventory` on direct peers at a regular interval.
Some messages, such as `Inventory` may be forwarded to other
peers. In that case, the origin of the message will have to be included
so that nodes can update their routing tables accordingly. If we want to
prevent these messages from being forged, digital signatures can be used
(see below section).
### Alternative wire protocol
We could propose a slightly different method of exchange node information:
instead of `GetAddresses`, nodes could announce themselves to the network,
via a `AnnounceNode` message signed by the node. These messages could be
gossiped, and would include the inventory. When inventory changes, nodes
would re-announe themselves. These messages could then be buffered and
sent to new nodes connecting to the network.
Whether to announce the full inventory on node announcement, the remotes,
or simply the node address is an open question. In the latter case, inventory
messages would still be required.
Storage
-------
Briefly, storage for project data is taken care by Git. In the `link` protocol,
we stored all Git refs and objects in a single "monorepo". Though this has
several advantages, there may be greater advantages in splitting the dataset
across projects: instead of one repo with all projects and all user forks,
we have one repo *per project*:
/<project>/git
/refs/namespaces/<remote>
/heads
/master
/tags
...
/refs/namespaces/<remote>
/heads
/master
/tags
...
This has numerous advantages, at least in theory:
* Private repositories become easier
* Git caching becomes more efficient, since the ODB is smaller
* Concurrency is simpler, since we can have locks at the repo level
* There is less strain on each ODB
Beyond the minimal protocol
---------------------------
Once a minimal protocol is live and working, we may consider the following
ideas and features.
### Node Identity & Signatures
When a message is routed between nodes on the network, to prevent forgery,
we may use digital signatures. Imagine that Eve is malicious and wants to
flood the network with misinformation; she can construct a message that says
that some node at a certain IP address has certain projects hosted,
while that isn't true.
With digital signatures we prevent nodes lying about each other, though there
is still the possibility of lying about yourself. However, the lie is then
associated with the liar, and we can imagine future node reputation systems
to de-incentivize these behaviors.
It's worth noting that the Bitcoin network uses no such signatures, and all
messages are to be handled as untrusted input. This could be "good enough"
for a first version of the protocol, but could introduce complexities later
on, once malicious actors find a way to exploit this. Since digital signatures
require the concept of node identity, introducing them does complicate the
protocol a fair bit.
### Transport encryption
Whether or not data should be encrypted as it is transferred between nodes
is an open question. Arguably, this would mainly benefit censorship
resistance, in case an ISP wanted to block certain content from being
transferred. The node's public encryption key could then be used as the
node identifier, and we could implement something like the Noise protocol[^0].
### Tor
If censorship resistance is paramount for a given node, it should be
able to leverage the Tor network and expose itself through a `.onion`
address. This is entirely opt-in, but clients should be able to
transparently connect to Tor nodes for content retrieval. Note that in
the case of Tor, node identity and transport encryption would be handled
by the Tor protocol.
### Routing nodes
It is conceivable that in the future, routing-only nodes would
exist. These nodes would maintain the routing table but wouldn't
track any projects.
### DoS mitigation
The most obvious attack on the above system is to flood the routing table
with garbage data. This deteroirates the service by having most connections
and fetches fail, since the majority of the routing table would be unusable.
There are several ways we can mitigate this problem:
* Expire routing table entries after a certain amount of time, if they haven't
been refreshed.
* For each routing entry, keep track of how many peers have communicated this
entry with us. Have shorter expiries for routing entries received only once.
* Require a small proof-of-work for new routing entries.
* Have a control plane which allows messages to be passed between trusted peers
* Tag routing entries with their provenance, so that when a peer is identified
as malicious, all its routing entries can be removed.
* Use per-node stake certificates[^1]
* Use a trust graph: only accept routing entries from nodes you trust, and nodes
that they trust, up to N degrees out.
[^0]: https://noiseprotocol.org/noise.html
[^1]: https://lists.linuxfoundation.org/pipermail/lightning-dev/2020-November/002884.html
Alternative designs
-------------------
There are many potential alternatives to the protocol described here. One
of them, which has been tried before is different in at least one fundamental
way: instead of keeping track of which nodes host which projects, we add
two wire messages: `Want(id)` and `Have(id)`. Then, nodes looking for a project
simply flood the network with the `Want` message, and nodes that have the
project respond with a `Have` message. We've opted not to follow this design
in the hopes that a well connected network will offer better performance with
a routing table if it can be kept up to date. The designs however are not
mutually exclusive, and we can imagine adding these messages for the case
where a user cannot find a project it is looking for via the usual means.
We also talked about DHTs, which could be used to store the routing table
if partitioning was desired. This would work because the routing table
should be fairly static once built, as nodes that are hosting a project
will likely keep hosting that project. However, this does require a certain
uptime and availability for all nodes, that is hard to guarantee in a
permissionless network. The extra complexity of a DHT may also not be
warranted for a first version of the protocol.