radicle-ci design

# radicle-ci design ## radicle-ci architecture ## Discussion - [Generic discussion on Zulip Stream](https://radicle.zulipchat.com/#narrow/stream/369277-heartwood/topic/radicle.20ci.20proof-of-concept) - [Protocol Definition announcing on Zulip Stream]() ### Good points Here are the value propositions I have seen brought up so far: 1. native integration with radicle - what cloudhead mentions that existing solutions were not designed with radicle in mind. 2. distributing CI workload - what Vincenzo mentions as "every node can decide to be a radicle-node able to run a CI". 3. consensus about CI workload results - what Vincenzo mentions as "we could also potentially integrate the CI result inside the protocol itself, and gossiping the result for a specific refs, and aggregate the result in some way". There are some in favor points to run this, such as 1. Other CI can not run in a tiny machine like raspberry pi 2/3; 2. other CI can not be integrated inside the protocol itself without making some ad hock code; 3. Other CI is required to be installed inside the system, and I do not like this idea because we already have a node and an httpd server 4. You can not support all the architecture without duplicating the CI server; in the case of my proof of concept you can advertise the kind of arch or test that you want to support and run just that 5. Other CI are based on docker or other think, so you can not build and sign the binary on a deployed machine because you must trust the docker image (this is why Bitcoin core has no automated CI release) However there are some case where having a CI can help a lot with problem of parallel computing, but this is not something that you can't do because from my design (today I will open a draft patch) you can run different part of the pipeline in different machine, it is just trivial. The worload problem is solved by the p2p itself by allowing the node to reject a request of running a CI on a refs ### Canonical Result? >ultimately, it seems to me, that this is about a "canonical CI job result" (in the same way we talk about canonical refs, etc.) (vincent): Mh, yes and no, canonical CI result for the same machine, but this does not apply to a different machine. e.g: currently, on Debian we do not work, so I expect that the job run on arch Linux will return `true`, but we should also have the possibility to look at what happens inside the CI result of the node with Debian and do not consider the Debian as a flaky test. ### Docker Runner - It is possible to implement a small API for Docker where you can create a container and retrieve the container image. - Example: [Dockworker](https://github.com/Idein/dockworker) #### Design Point - How to build a more complex Docker image that requires importing different Docker images (similar to the `use: <action name>` feature in GitHub). ### Patch Set - PoC: [Radicle Patch](https://app.radicle.xyz/seeds/seed.radicle.xyz/rad:z3gqcJUoA1n9HaHKufZs5FCSGazv5/patches/b6dc670ac86ea97a2df63a3bab03871b6db06ada?tab=activity) ## Protocol Design <pre> Author: Vincenzo Palazzo <vincenzopalazzo@member.fsf.org> Status: draft Created: 2023-15-06 Version: 0 </pre> ### Announcing the CI feature for a specific repository Running a CI (by either importing a standard one or implementing one from scratch) requires a way to communicate the interest and availability of a node to run a set of workflows defined inside the repository and communicate back the result of the workflow. ```rust InventoryAnnouncement = ( NodeId, Vec<(RepositoryId, Bitfield)>, Timestamp, Signature, ) ``` The `bitfield` represents a minimal encoding of the features that the node offers for the repository. We could also consider a structure similar to Lightning, where features can be divided into `odd` and `even` to specify their `required` (even) or `optional` (odd) nature. > Imagine if we also came up with a solution like GitHub Pages, where the site is hosted under `https://app.radicle.xyz/seeds/rad.hedwing.dev/<repository-id>/pages/index.html`. In this case, we could have an additional optional field to allow a node to display a side as well. However, this is left for future work. Now that we are able to announce the information of a node within the inventory by specifying the features, we define the `Continuous Integration` feature with the bitfield encoding of the number `1` and leave the feature `0` as a placeholder for future evolution to support `odd/even` features. In conclusion, I am proposing to add the feature in the inventory message and not in the `NodeAnnouncement` because these kinds of features are related to the repository and not to the node itself. For example, a node wants to run a CI for a specific repository. So putting the information of the CI feature inside the `NodeAnnouncement` looks like a violation of the separation of concern. However, I can imagine adding a CI feature also in the `NodeAnnouncement` in the case of a DNS seed bootstrap protocol (similar to Bitcoin) where a node is able to connect to similar peers by filtering by features. In this case, a node can be interested in connecting with other nodes for a specific feature. But this is left for future work or discussion. ### Finals Comments - **Status**: Rejected Currently, all the refs are propagated to the peers, so the change to `InventoryAnnouncement` is not needed at the moment. # Advertise the need to execute the CI workflow When all the nodes are able to express their interest in the features by repository ID, the next step is defining how to propagate the information to run the CI and collect the results. > As discussed in some of the previous sections, we will have some work to do in defining the meaning of a **canonical CI result** because in CI, a canonical result is a bit difficult to define, as it is possible to hide CI failures under non-canonical results (e.g., flaky tests). As defined in RIP 1, the update to a repository is done by `RefsAnnouncement` and then propagated through the network. So, initially, we could try to use the `RefsAnnouncement` for patches or commits as a request to run the CI on all the nodes that have enabled the CI feature bit. However, while this simplifies the design of this feature, it could complicate all `radicle-cob` code that now needs to manage all the CI refs. Another solution is to manage all the CI workflows in the initial phase with a separate protocol and separate messages. Then, when we are sure that this protocol works, we can include it inside the Heartwood protocol. This is left as a design meeting discussion. ### Final Comments - **Status**: Move to the next step Using Announce Refs to communicate information about the CI runs is a good practice. We can include additional meta information about the runner (the node that runs the code), such as `{ "architecture": "arm", "os": "arch-linux", "core": 1, "kind-runner": "native", "jobs": [ "clippy", "tests" ] }`. Furthermore, we can expose an endpoint via HTTPS or WSS to retrieve logs and stream real-time events. # Collecting the CI workflows results When a node finishes running a CI workflow, we need a way to communicate back this result, and there are two ways where this is possible to do: 1. Ad hoc message: - Pro: Easy to deploy and remove in the future in favor of a better proposal; - Cons: A result is just a signed refs, no different from an issue comment or a patch review, so adding a new message would raise unanswered questions like "Why not add a new message for xyz?" and "What are the rules when to add a new message?" 2. A CI result is just another announce refs, like issue comments and patch reviews. ### Final Comments - **Status**: Move to the next step Using the Announce Refs is a good experiment to conduct, and it would be beneficial to add a new COB for CI. # Conclusion The CI workflow in Radicle can be easily integrated with any kind of CI. However, with an ad hoc solution, it will be easier to deploy and allow the development team to focus 100% on Radicle problems rather than implementing specific feature that radicle required with a particular CI implementation. In addition, in this document, the tracking policy defined in RIP 1 is completely ignored, as it is assumed that the nodes that run the CI for a specific repository are part of the trusted peers or repository gruop. # Open questions - DoS Attack: - Open 100 patches at the same time for each node that is hosting a repository. - Possible solution is rate limiting the announce refs that are triggering the CI or just overriding the previous run with a new one for a specific commit or patch. - How to aggregate the CI results? - Define a good canonical result definition by allowing different combinations of device architecture. For example, the canonical result needs to be for each node that runs a specific architecture with a specific OS. - Do we want to filter the nodes to send the announce refs from the sender side or from the receiver side? Here's an example: - Alice wants to run a workflow for an ARM device to make sure that the software runs on other architectures as well. - Alice creates the workflow and marks this workflow as `target: arm`. - In this case, should the announce refs to run the CI workflow be sent by Alice, and all the receivers that are not on an ARM device can ignore it? Or should Alice filter and send the announce refs only to nodes with a specific architecture? In this case, the architecture of the node (and maybe the OS) should be a node feature advertised inside the `NodeAnnouncement` message? ## Discussion Please review the status of each subsection of the proposal, where the status is mentioned. The discussion is now locked!