A minimal radicle identity protocol
===================================
> ⚠️ This document has been superseded by RIP#1
Solving the problem of p2p code collaboration requires a solution to three
separate problems:
1. Replication, or how to publish and retrieve data from the network.
2. Identity, or how to identify and verify data and its provenance on the network.
3. Conflict-resolution, or how to reconcile diverging changes to a given document.
Let's dive into these three topics and talk about what a minimal solution
might look like.
Replication
-----------
Replication is perhaps the most well understood of these three problems, but
depending on the desired network topology, can get very complex. For this
minimal protocol, we'll focus on replicating data amongst known but untrusted
peers. This means we will intentionally avoid the problem of peer discovery,
as well as locating nodes that have the data we want.
Since replication and storage are closely related, our solution to replication
should also include a solution to storage. Our solution must also excel at
replicating source code, since that is the primary type of data on the network.
For this reason, we choose to use `git` as the underlying protocol.
As shown by the current `link` protocol, it's possible to use `git` for storage
and replication. For storage, the *namespaces* feature is used with all data
stored in a single bare repository, and for replication, one of the `git`
protocols (eg. "smart" or "dumb" HTTP[^0]) can be used, over some transport.
On the question of authentication, `git` natively supports SSH as a transport,
and therefore, SSH-compatible keys can be used to authenticate writes. This
however is limiting for reads, and therefore any node serving radicle content
would want to also setup an HTTPS-based git daemon.
Since radicle doesn't offer the ability for multiple users to write to the
same repository and branches, the question of authentication at the replication
boils down to: *who should be allowed to publish code to a given node?*
A node could be configured to either accept *all* users, or only a specific
set of users, which with SSH authentication would be represented as a list of
keys. More complex setups (eg. a tracking graph) are beyond the scope of this
design.
To prevent users from overwriting each other's git objects, we namespace
users according to their public keys, as we do currently. For example:
project/
remotes/
hyn9diwfnytah…wdxqu1zo645pe/
heads/
master
dev
hyb5to4rshftx…qqe7f8wrpawxa/
heads/
master
alice/feature-1
Each sub-tree represents a complete git tree that can be checked out into a
working copy and replicated individually.
To implement the above, we can turn to `libgit2`, or in the worse case use
`git` sub-processes. However, we'll have to prevent users from writing
to arbitrary refs, which can potentially be done with either a
custom `git-receive-pack` or a `pre-receive` hook in the bare repository.
To conclude, replication requires:
* Each user to have an SSH-compatible radicle key
* Each node to run an SSH-enabled Git server as well as an HTTPS one.
* Each node to configure which keys should be allowed to publish data.
* Each node and user to have a local bare repository storing all radicle data.
An implementation of the Git-based storage is available as part of the
`link` protocol.
<!-- TODO: How are forks resolved? -->
<!-- TODO: User pushes to some nodes, then force pushes to others? -->
Identity
--------
The problem of identity is the problem of trust, and can be stated as such:
given a unique identifier received out of band, and a method to ask any node
for data associated with this identifier, how can we ensure that the data
given to us by that node is legitimate? In other words, if we know the identifier
is correct, how can we also know that the data retrieved with it hasn't been
tampered with, and is indeed the correct data?
For static files, this problem has a simple solution: use a hash
of the file as the identifier. But for things like code repositories that
change over time, we need to always be able to verify the latest state with
the given stable identifier, so the content hash won't work, as it'll quickly
get out of date.
A very similar problem to this has been solved in blockchains: given a
genesis hash (the hash of the very first block), and a set of rules defining
what a valid change is (state transition function), it's possible to
verify any given state by verifying all changes (blocks) starting from the
original state (genesis) to the given state.
With this same method, we can choose an identifier that represents an identity's
initial state, and given a state transition function and all intermediate states,
verify all changes up to the latest one. Let's see how that can work in practice.
### Repositories & Projects
Since anyone can fork a repository and add their own changes to it, it's
necessary to have a way to know whether the changes are legitimate or not,
when downloading a repository from a peer, since there is no trusted server
from which to download an "official" copy.
In radicle, *projects*, which are essentially repositories with an associated
cryptographic identity are the solution to this: they provide a "root of trust"
and a verifiable history that allows anyone to check whether the repository
they downloaded is the one they asked for.
To implement this verification process, we need a few things:
1. A document which encodes the rules for validating the project history
2. A stable identifier that points to the initial version of that document
3. A state transition function that can be used to compute the state of the project
as well as validate it, using the document and identifier as input
The simplest way to implement the above could be:
1. A plain text file (eg. TOML) inside the code repository, under `.radicle/project.toml`,
that encodes the metadata required for validation.
2. A SHA256 hash of the initial version of the `project.toml` as the stable identifier.
3. A CLI executable, eg. `rad-verify` that validates the project's history, taking
the SHA256 identifier and `project.toml` as inputs.
Let's dig into these a little more. The initial contents of `project.toml`
would have to look something like this:
[project]
name = "radicle-cli"
description = "Radicle's CLI"
parent = "6202914650b7fe1cde97bcb0f81f0bcdb184405f"
[[delegate]]
name = "cloudhead"
id = "did:key:z6MkhaXgBZDvotDkL5257faiztiGiC2QtKLGpbnnEGta2doK"
The `name` and `description` fields are simply metadata that isn't used for
verification purposes. The `parent` is the repository's `HEAD` commit hash
*before* the identity document is committed to the repository. Essentially
it's the last commit that is unverified and "taken for granted". All
commits following that shall be verified in the verification process.
The `delegate` section contains one key per delegate. The keys are named
for ease of use, but the name isn't used for verification. The initial
document has one key, which is the key of the user who initialized the project.
Thus, to verify a project with a single delegate:
1. The repository is cloned from an arbitrary node, given the project identifier.
2. The initial version of the `.radicle/project.toml` file is checked out
and hashed. The hash is verified against the given project identifier.
3. The commit history is traversed, starting from the `parent` commit, as
specified in `.radicle/project.toml`.
4. For each commit following `parent`, the commit signature is verified
against the delegate key.
5. If the commit modifies the delegate set, the next commits are verified
using this new set.
6. Once `HEAD` is reached, and if there are no errors, the verification is
successful.
In the case where there is more than one delegate, the question is whether
changes need more than one signature to be valid. If not, the above process
works as-is. If yes, commit trailers can be used to include more than one
signature per commit.
> To aggregate signatures before they are added to commits, COBs can be used
> via a familiar patch-based workflow (see below section on conflict
> resolution).
Project maintainers can configure how many signatures are needed for a change
to be valid via a `threshold` field on a file glob, for example:
[file.".radicle/project.toml"]
threshold = 2
[file."*"]
threshold = 1
This means that two signatures are required for changes to the
`.radicle/project.toml` file, while only one signature is required to modify
other files.
With the above system, we're able to verify that *every commit* from the commit
at which the radicle project was initialized was either signed by the creator
of the project, or a key that was trusted by the creator of the project, either
directly or indirectly via the delegation scheme. No matter where this
repository is retrieved from, if the identifier is trusted, the history can
be verified.
As an identifier format for projects, we may use DID for future
interoperability, eg. `did:rad:hnrkmx6trm4bu19bwa4apbxj8ftw8f7amfdyy`.
### Signed refs
There is one issue we haven't touched upon in the above section: how does a
user publish a branch of a project if they *haven't* authored or signed
its latest commit?
Suppose that Alice wants to share some code from project "Acme" with Bob,
but she isn't and never was a delegate of project Acme. However, since
they intend to collaborate on some feature of Acme, Alice wants to share
the branch head from which they should start their work. So, she
publishes a branch, call it `feature/1`, which for now points to the
same commit as `master`. Since she hasn't signed that branch's head,
there's no way for Bob to verify that Alice indeed intended to publish
that branch, unless it is served directly by Alice. If there is an
intermediary, Bob will have no way of verifying that the intermediary
hasn't changed the head pointed to by `feature/1`.
Hence, Alice needs a way to sign that head out-of-band, without adding
any commits to the branch. For this, we turn to *signed refs*, which are
signatures over some user's branch heads (and possibly also tags and other
refs). This allows Bob, given Alice's public key, to verify the branches
she published, regardless of whether she signed any commits.
We may use these signed refs for all branches published (even by delegates),
though it's worth emphasizing that these signatures only establish user
provenance, not canonicity at the project level. In other words, to verify
that a branch head is "official" as established by the project delegates,
Bob would still have to traverse the project history to first see *who*
are the delegates. Once computed, Bob can verify that the signed refs
of some supposed "canonical" branch come from one of the delegates.
### Users
As seen in the above section, user identifiers are used as delegates to projects.
We choose the `DID` standard for future-compatibility, but for now only support
the `key` method, which makes things easier for verification.
Most systems designed today use the `ed25519` curve, but this poses an interesting
problem for us: can we sign message using one of the multitude of software
and hardware signers available today? For example, Brave, the web browser
has a built-in "Web3" wallet with signing capabilities. If we can harness
this for signing payloads, we'd make things simple for a lot of users.
One approach seen in the wild is to generate an `ed25519` private key using
entropy generated by a Web3 wallet, often by signing a well-known message.
The way this works is as follows:
1. User signs the message `Authenticating as <address> into Radicle`
2. The signature is then converted to a 32 byte array (eg. via hashing)
3. The 32 byte array is used as entropy for deterministically generating a
brand new `ed25519` private key. This is a one-way function. A public key
is then derived from the private key through usual means.
4. The user now has an `ed25519` key pair that can be used to sign radicle
payloads. This key can be added as a delegate to projects, and can always
be re-generated from the Web3 wallet.
Now, this doesn't actually associate the Ethereum address with the key. In
other words, given the `ed25519` key, there is no way of finding out the
address of the wallet that generated it. This can be considered a "feature",
but in cases where a user wants to associate an Ethereum address with their
key, they'll have to do this separately.
One way this can be done is via a simple message which would be signed with the
`ed25519` key and would say something like:
Associating my Ethereum address <address> with my Radicle identity
This would act as a proof, and optionally could contain an embedded signature
from the Web3 wallet as well. This proof can then be embedded where it's
required for verification.
With that out of the way, we can rely on `ed25519` keys, which are also supported
by SSH, and thus the conditions required by replication are fulfilled: users
can use their radicle key to authenticate to Git servers.
In the future, if users desire more extensive user profiles associated with their
projects, they can opt for other DID methods, such as `3` which stands for
Ceramic's `3ID`[^1].
Conflict resolution
-------------------
The last problem we need to solve is how to reconcile potentially diverging
histories. This can naturally occur for example when multiple people are
commenting on an issue or patch locally and then syncing their changes
asynchronously.
One common solution to this is to use some form of CRDT[^3]. We opt to use
Automerge[^4] as the CRDT, and to allow for the histories to be replicated,
we choose a Git encoding. In other words, Automerge histories are represented
as Git objects (commits) and we thus get replication for free.
An implementation of this already exists and is called *Collaborative Objects*
or "COBs".
[^0]: https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols
[^1]: https://developers.ceramic.network/docs/advanced/standards/accounts/3id-did/
[^3]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
[^4]: https://automerge.org/