# Orb Supervisor Paradigm (Draft)
Ensure Orb functionality is split out and siloed. The goal is to minimize impact on signups by preventing
failures in less importants parts of the orb stack to cascade to the critical signup path.
## Motivation
The monolithic approach of orb-core allows for crashes from areas that aren't in the critical path of a signup to nevertheless impact a signup. In the short to medium term on-orb signup complexity will increase drastically with the addition of signup extensions, fraud detection models, and multiple gabor identifiers. There is a growing need to clearly distinguish between what is *critical* to pushing a signup as successful, and what is not.
### Process coordination
In addition, there is an immediate need for both interprocess communication and execution coordination between the _`update-agent`_\* (responsible for OTA updates) and orb-core. These processes need to be able to coordinate to avoid CAN line saturation or restarts in the middle of an update. When a security-critical update is pushed, we need to ensure that signups are restricted and, if an orb has just powered on, that orb-core isn't and will not be started until the security update has been applied.
### Faster UX development
Restructuring the on-orb stack will allow writing a independent UX-server that listens to events from Orb-core and, based on those, decides to play sounds or change LEDs. Because feedback loops and iteration cycles for UX will likely be much shorter than orb-core development this means that we can spin up a team that can work without affecting orb-core.
### Open sourcing
Open-sourcing our on-orb software is on the roadmap for 2022 and is something we publicly committed to. Releasing the orb-core mono-repo would likely not be feasible before a full audit, given the largely interconnected nature of all the pieces therein. As we look to bolster hiring and find goodwill with skeptics of the project, open-sourcing may turn out to be a critical piece in those processes.
### Integration testing
Integration-testing on-and-off the orb is made tricky by the inclusion of many `bindgen` C-bindings for various libraries in orb-core. To do an integration test of various pieces of orb-core (which might have mock-able inputs), one needs to bring in a number of dependencies and pre-requisites which may not have a direct impact or bearing on the piece to be tested. Specifically, there isn't anything that intrinsically couples the presence of real cameras or a real face to the testing of the signup UX flow. Rather, this is linked simply by project structure.
> Comment: (Necessary?)
Siloing functionality and opening the door for rapid integration testing/prototyping could prove to be powerful assets in the coming sprints as we attempt to stabilize and "professionalize" the signup flow. The uses cases around splitting functionality are largely to do with UX and stability, but by introducing standalone processes as first-class projects with well-maintained support we might see new ideas and extensions come into existence in the future.
## Guide-level explanation
### Siloed components
Creating a siloed component involves connecting and registering itself on the IPC bus, and registering for any signals/events it's interested in.
```rust
/// Receiving "signup finished" events.
///
/// This is all pseudo-code intended to strike a balance between the mentioned IPC bus solutions
/// to avoid biasing strongly one way or another. This API _does not exist_ as written below.
fn main() -> Result<()> {
// ...
let subscriber = Bus::connect(Type::SUB, "org.worldcoin.bus1")?;
subscriber.subscribe(vec![Topics::SIGNUP])?;
let start = Instant::now();
loop {
let event = TryInto::<Event>::try_into(subscriber.recv()?);
match event {
Event::SignupFinished(success) if success => println!("signup finished successfully!"),
Event::SignupFinished(_) => println!("signup failed :("),
_ => {},
};
if Instant::now().duration_since(start) >= Duration::from_secs(10) {
break;
}
}
}
```
#### Event publishing
We use a bus to support easy extensions without significant modifications to existing processes and projects. This allows for greater composability when testing, and when writing for the first time.
On the sending side, we want to ensure that we only notify when there has been a state transition:
```rust
/// Though variable across actual IPC bus implementations, in our example the publisher is not
/// required to register process-specific events. This means that, in theory, any process could
/// publish any event.
///
/// We consider this approach reasonable in a highly-collaborative environment but it should be
/// understood that such a compromise may not scale with team-size nor outside the organization.
async fn main() -> Result<()> {
let publisher = Bus::connect(Type::PUB, "org.worldcoin.bus1")?;
select! {
result = core::run_signup() => {
publisher.publish(Topics::SIGNUP, Event::SignupFinished(result.is_ok())?;
// ... do real post-signup cleanup
}
}
}
```
#### Event design
By notifying as early into the state change as possible, we give the other components a better chance to properly respond to the state change in a timely fashion. This implements and covers a key point: *events should convey the minimal information necessary to act upon it*.
With careful event design, we discourage remote process calls and avoid strict dependence on foreign APIs. An example of this is the intentional decision to **not** include "LedState" events wherein another process might try to publish an "LedState" event and have the `led-server` subscribe to such a topic and perform operations accordingly. This violates the separation of active "Messages" (which this RFC makes no attempt to include or define) from "passive 'Events'".
### Systemd
We use systemd to manage standalone processes and register their initialization. Most services will depend only on the presence of the event bus. Referring back to the first example for a simple "signup finished" listener program, the service file might look something like:
```
[Unit]
Description=Worldcoin Signup Finished-listener
After=worldcoin-supervisor.service
Requires=worldcoin-supervisor.service
[Service]
Type=simple
ExecStart=/usr/local/bin/signup-finished-listener
SyslogIdentifier=worldcoin-signup-finished-listener
Restart=on-failure
[Install]
WantedBy=multi-user.target
```
#### Let's add a party LED mode
- i want to implement a special UI control if someone taps the button 5 times
- pull in `zbus` + our `can-rs`
- open receive loop to receive `button pressed` message from MCU
- (isotp) open listen-mode ISO-TP stream on broadcast address
- (canfd) open `FrameStream` with the broadcast filter
- connect to the supervisor bus (maybe the system bus?)
- (dbus) call the LED interface methods
- (generic bus) publish a `set LEDs (medium priority)` message
- **behind the scenes**
- (isotp) the `isotp-courier` process manages the flow-control for receiving content
- (all bus systems) the orb-supervisor initializes the bus to connect to at startup
- the LED UX process receives the message on the bus and, if not in the middle of a higher-priority task will execute the `set LEDs` instruction
## Reference-level explanation
Major orb-core restructuring is necessary to move UX handling over to the event bus. There are two main support components to enable this extraction:
- IPC event bus library
- CAN library stabilization
With the supporting components mentioned above, the audio and LED management are split into their own processes started and managed my `systemd`. These both subscribe to `update-agent` and `orb-core` events. Strict adherence to the notion of "passive 'Events'" hands off arbitration of priority entirely to the implementing processes.
The `orb-supervisor` is concerned with three tasks:
- Dbus private-bus registration
- `update-agent` <-> `orb-core` execution coordination
- `update-agent` & `orb-core` shutdown coordination
#### Dbus for IPC event bus
Dbus supports pub/sub events through the "signals" interface. Processes register themselves and their signals with the *worldcoin private bus* initialized by `zbus`. Using Dbus reduces the dependency complexity and could have huge gains in implementing IPC outside of the worldcoin private bus, extending into systemd state querying, NetworkManager communication, and more.
#### CAN communication + ISO-TP
CAN communications remain almost entirely unchanged. The only exception is a good-tenant expectation when working with the *broadcast* ISO-TP stream. All processes that engage with the *broadcast* ISO-TP stream do so in *LISTEN_MODE*, which disables the sending of Flow Control frames when receiving. This works in combination with an `isotp-manager` which holds open a non-*LISTEN_MODE* ISO-TP stream on the *broadcast* IDs, so that only one stream sends the Flow Control frames\*.
## Drawbacks
When using a daemon-based bus, "Event" transmission is guaranteed to be slower than a similar IPC solution without a centralized bus. The hope and intention of introducing events at the same time as we split apart functionality was to lessen this impact. This means that--for example-- highly complicated LED patterns can still be used because there isn't a round-trip IPC action for every "Set LED" message. That being said, we may see this break down as more work is done in this system.
This IPC implementation does not cover all of the current IPC needs. As an example, the execution of PyCUDA-related tasks (python execution of ML models) are time-sensitive and do not fit in this paradigm. We generally consider this okay, and encourage processes with latency-sensitive aspects to pull them directly into their immediate execution context. However, this means that we will need to maintain different IPC systems at the same time.
## Rationale and Alternatives
- Choosing Dbus for our main IPC event bus library allows us to leverage software that has been thoroughly battle-tested, and does not require us to re-implement an IPC library on top of Unix Domain Sockets or work with less actively maintained libraries. There have also been recent efforts to shed some of the legacy baggage of Dbus, and ChromeOS has published their "Dbus best practices" document. Finally, the `zbus` Rust bindings for Dbus are actively maintained and very intentionally thought-out to make the huge complexity of Dbus' features manageable.
**However**
- Dbus suffers from even more overhead than other IPC solutions due to it's self-proclaimed aim to be a batteries-included, all-in-one Linux IPC solution. **The main alternative to Dbus--investigated in the writing of this RFC-- is ZeroMQ**. ZeroMQ is extremely light-weight, similarly well-maintained, and the Rust bindings are similarly reasonable (although we would have to implement the serialization and deserialization to strings ourselves).
- Dbus' API, even with `zbus`, is very complicated when compared with other IPC solutions. The reliance on procedural macros would make auditing Dbus-reliant processes very difficult (?). Generally, the design is reminiscent of Java Spring
- Splitting apart "functionality-isolated" components both at the process level and at the code level addresses many of the pain-points and motivations described in the "Motivations" section. However, there is already work being done on the process-isolation front within orb-core, as well as already-written examples of independant binary generation. While open-sourcing may still be effected, the solutions put forth in this RFC are not **strictly necessary** to lessen the impact of, for example, the audio device failing during a signup.
## Notable Mentions
- dbus-broker(1)
- cargo-generate
- zbus
- nng-rs
- nng ipc (https://nng.nanomsg.org/man/tip/nng_ipc.7.html)
- grpc latency (https://www.mpi-hd.mpg.de/personalhomes/fwerner/research/2021/09/grpc-for-ipc/)
- chromeos dbus best practices (https://chromium.googlesource.com/chromiumos/docs/+/2efe4b73ea2109870480a3d6148024686faf1e6e/dbus_best_practices.md#Avoid-depending-heavily-on-D_Bus_specific-concepts)