A minimal radicle identity protocol

A minimal radicle identity protocol =================================== > ⚠️ This document has been superseded by RIP#1 Solving the problem of p2p code collaboration requires a solution to three separate problems: 1. Replication, or how to publish and retrieve data from the network. 2. Identity, or how to identify and verify data and its provenance on the network. 3. Conflict-resolution, or how to reconcile diverging changes to a given document. Let's dive into these three topics and talk about what a minimal solution might look like. Replication ----------- Replication is perhaps the most well understood of these three problems, but depending on the desired network topology, can get very complex. For this minimal protocol, we'll focus on replicating data amongst known but untrusted peers. This means we will intentionally avoid the problem of peer discovery, as well as locating nodes that have the data we want. Since replication and storage are closely related, our solution to replication should also include a solution to storage. Our solution must also excel at replicating source code, since that is the primary type of data on the network. For this reason, we choose to use `git` as the underlying protocol. As shown by the current `link` protocol, it's possible to use `git` for storage and replication. For storage, the *namespaces* feature is used with all data stored in a single bare repository, and for replication, one of the `git` protocols (eg. "smart" or "dumb" HTTP[^0]) can be used, over some transport. On the question of authentication, `git` natively supports SSH as a transport, and therefore, SSH-compatible keys can be used to authenticate writes. This however is limiting for reads, and therefore any node serving radicle content would want to also setup an HTTPS-based git daemon. Since radicle doesn't offer the ability for multiple users to write to the same repository and branches, the question of authentication at the replication boils down to: *who should be allowed to publish code to a given node?* A node could be configured to either accept *all* users, or only a specific set of users, which with SSH authentication would be represented as a list of keys. More complex setups (eg. a tracking graph) are beyond the scope of this design. To prevent users from overwriting each other's git objects, we namespace users according to their public keys, as we do currently. For example: project/ remotes/ hyn9diwfnytah…wdxqu1zo645pe/ heads/ master dev hyb5to4rshftx…qqe7f8wrpawxa/ heads/ master alice/feature-1 Each sub-tree represents a complete git tree that can be checked out into a working copy and replicated individually. To implement the above, we can turn to `libgit2`, or in the worse case use `git` sub-processes. However, we'll have to prevent users from writing to arbitrary refs, which can potentially be done with either a custom `git-receive-pack` or a `pre-receive` hook in the bare repository. To conclude, replication requires: * Each user to have an SSH-compatible radicle key * Each node to run an SSH-enabled Git server as well as an HTTPS one. * Each node to configure which keys should be allowed to publish data. * Each node and user to have a local bare repository storing all radicle data. An implementation of the Git-based storage is available as part of the `link` protocol.   Identity -------- The problem of identity is the problem of trust, and can be stated as such: given a unique identifier received out of band, and a method to ask any node for data associated with this identifier, how can we ensure that the data given to us by that node is legitimate? In other words, if we know the identifier is correct, how can we also know that the data retrieved with it hasn't been tampered with, and is indeed the correct data? For static files, this problem has a simple solution: use a hash of the file as the identifier. But for things like code repositories that change over time, we need to always be able to verify the latest state with the given stable identifier, so the content hash won't work, as it'll quickly get out of date. A very similar problem to this has been solved in blockchains: given a genesis hash (the hash of the very first block), and a set of rules defining what a valid change is (state transition function), it's possible to verify any given state by verifying all changes (blocks) starting from the original state (genesis) to the given state. With this same method, we can choose an identifier that represents an identity's initial state, and given a state transition function and all intermediate states, verify all changes up to the latest one. Let's see how that can work in practice. ### Repositories & Projects Since anyone can fork a repository and add their own changes to it, it's necessary to have a way to know whether the changes are legitimate or not, when downloading a repository from a peer, since there is no trusted server from which to download an "official" copy. In radicle, *projects*, which are essentially repositories with an associated cryptographic identity are the solution to this: they provide a "root of trust" and a verifiable history that allows anyone to check whether the repository they downloaded is the one they asked for. To implement this verification process, we need a few things: 1. A document which encodes the rules for validating the project history 2. A stable identifier that points to the initial version of that document 3. A state transition function that can be used to compute the state of the project as well as validate it, using the document and identifier as input The simplest way to implement the above could be: 1. A plain text file (eg. TOML) inside the code repository, under `.radicle/project.toml`, that encodes the metadata required for validation. 2. A SHA256 hash of the initial version of the `project.toml` as the stable identifier. 3. A CLI executable, eg. `rad-verify` that validates the project's history, taking the SHA256 identifier and `project.toml` as inputs. Let's dig into these a little more. The initial contents of `project.toml` would have to look something like this: [project] name = "radicle-cli" description = "Radicle's CLI" parent = "6202914650b7fe1cde97bcb0f81f0bcdb184405f" [[delegate]] name = "cloudhead" id = "did:key:z6MkhaXgBZDvotDkL5257faiztiGiC2QtKLGpbnnEGta2doK" The `name` and `description` fields are simply metadata that isn't used for verification purposes. The `parent` is the repository's `HEAD` commit hash *before* the identity document is committed to the repository. Essentially it's the last commit that is unverified and "taken for granted". All commits following that shall be verified in the verification process. The `delegate` section contains one key per delegate. The keys are named for ease of use, but the name isn't used for verification. The initial document has one key, which is the key of the user who initialized the project. Thus, to verify a project with a single delegate: 1. The repository is cloned from an arbitrary node, given the project identifier. 2. The initial version of the `.radicle/project.toml` file is checked out and hashed. The hash is verified against the given project identifier. 3. The commit history is traversed, starting from the `parent` commit, as specified in `.radicle/project.toml`. 4. For each commit following `parent`, the commit signature is verified against the delegate key. 5. If the commit modifies the delegate set, the next commits are verified using this new set. 6. Once `HEAD` is reached, and if there are no errors, the verification is successful. In the case where there is more than one delegate, the question is whether changes need more than one signature to be valid. If not, the above process works as-is. If yes, commit trailers can be used to include more than one signature per commit. > To aggregate signatures before they are added to commits, COBs can be used > via a familiar patch-based workflow (see below section on conflict > resolution). Project maintainers can configure how many signatures are needed for a change to be valid via a `threshold` field on a file glob, for example: [file.".radicle/project.toml"] threshold = 2 [file."*"] threshold = 1 This means that two signatures are required for changes to the `.radicle/project.toml` file, while only one signature is required to modify other files. With the above system, we're able to verify that *every commit* from the commit at which the radicle project was initialized was either signed by the creator of the project, or a key that was trusted by the creator of the project, either directly or indirectly via the delegation scheme. No matter where this repository is retrieved from, if the identifier is trusted, the history can be verified. As an identifier format for projects, we may use DID for future interoperability, eg. `did:rad:hnrkmx6trm4bu19bwa4apbxj8ftw8f7amfdyy`. ### Signed refs There is one issue we haven't touched upon in the above section: how does a user publish a branch of a project if they *haven't* authored or signed its latest commit? Suppose that Alice wants to share some code from project "Acme" with Bob, but she isn't and never was a delegate of project Acme. However, since they intend to collaborate on some feature of Acme, Alice wants to share the branch head from which they should start their work. So, she publishes a branch, call it `feature/1`, which for now points to the same commit as `master`. Since she hasn't signed that branch's head, there's no way for Bob to verify that Alice indeed intended to publish that branch, unless it is served directly by Alice. If there is an intermediary, Bob will have no way of verifying that the intermediary hasn't changed the head pointed to by `feature/1`. Hence, Alice needs a way to sign that head out-of-band, without adding any commits to the branch. For this, we turn to *signed refs*, which are signatures over some user's branch heads (and possibly also tags and other refs). This allows Bob, given Alice's public key, to verify the branches she published, regardless of whether she signed any commits. We may use these signed refs for all branches published (even by delegates), though it's worth emphasizing that these signatures only establish user provenance, not canonicity at the project level. In other words, to verify that a branch head is "official" as established by the project delegates, Bob would still have to traverse the project history to first see *who* are the delegates. Once computed, Bob can verify that the signed refs of some supposed "canonical" branch come from one of the delegates. ### Users As seen in the above section, user identifiers are used as delegates to projects. We choose the `DID` standard for future-compatibility, but for now only support the `key` method, which makes things easier for verification. Most systems designed today use the `ed25519` curve, but this poses an interesting problem for us: can we sign message using one of the multitude of software and hardware signers available today? For example, Brave, the web browser has a built-in "Web3" wallet with signing capabilities. If we can harness this for signing payloads, we'd make things simple for a lot of users. One approach seen in the wild is to generate an `ed25519` private key using entropy generated by a Web3 wallet, often by signing a well-known message. The way this works is as follows: 1. User signs the message `Authenticating as <address> into Radicle` 2. The signature is then converted to a 32 byte array (eg. via hashing) 3. The 32 byte array is used as entropy for deterministically generating a brand new `ed25519` private key. This is a one-way function. A public key is then derived from the private key through usual means. 4. The user now has an `ed25519` key pair that can be used to sign radicle payloads. This key can be added as a delegate to projects, and can always be re-generated from the Web3 wallet. Now, this doesn't actually associate the Ethereum address with the key. In other words, given the `ed25519` key, there is no way of finding out the address of the wallet that generated it. This can be considered a "feature", but in cases where a user wants to associate an Ethereum address with their key, they'll have to do this separately. One way this can be done is via a simple message which would be signed with the `ed25519` key and would say something like: Associating my Ethereum address <address> with my Radicle identity This would act as a proof, and optionally could contain an embedded signature from the Web3 wallet as well. This proof can then be embedded where it's required for verification. With that out of the way, we can rely on `ed25519` keys, which are also supported by SSH, and thus the conditions required by replication are fulfilled: users can use their radicle key to authenticate to Git servers. In the future, if users desire more extensive user profiles associated with their projects, they can opt for other DID methods, such as `3` which stands for Ceramic's `3ID`[^1]. Conflict resolution ------------------- The last problem we need to solve is how to reconcile potentially diverging histories. This can naturally occur for example when multiple people are commenting on an issue or patch locally and then syncing their changes asynchronously. One common solution to this is to use some form of CRDT[^3]. We opt to use Automerge[^4] as the CRDT, and to allow for the histories to be replicated, we choose a Git encoding. In other words, Automerge histories are represented as Git objects (commits) and we thus get replication for free. An implementation of this already exists and is called *Collaborative Objects* or "COBs". [^0]: https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols [^1]: https://developers.ceramic.network/docs/advanced/standards/accounts/3id-did/ [^3]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type [^4]: https://automerge.org/