Backstage High-level Design RFC

# Backstage High-level Design RFC Backstage is an Actor model framework which seeks to benefit from Rust features to provide a safe and easy to use foundation upon which scalable and distributed applications can be built. Backstage differs from other actor frameworks in that it enforces strict supervision rules, like `Scopes`, to maintain a tree of `Actors` and data. ## Terms ### Aggregation List * Runtime: Actor's context; * Scope: Actor's global scope; * ScopeId: Actor's scope `unique identifier`; * Registry: Global structure stores the active scopes; * Actor: Event loop `state machine`; * Service: Current status and some stats of the actor; * Resource: Accessible `object`; * Supervisor: Actor supervising another actors/children; * else? ## Features ### Scopes Scopes, in the context of an asynchronous framework, define async lifetimes for data contained within them. This also relates to the concept of supervision because parent scopes must wait for their inner scopes to complete before exiting. #### Implementation Details A framework with built in scoping allows assurances for the user that their data is sound and will obey a set of rules: 1. Data visible in any parent scope will be visible to all child scopes. 2. Data can be added to the current scope, or to any scope above the current one. 3. Data owned by a scope is removed when that scope exits. 4. A scope cannot exit until all of its child scopes have exited. #### Examples ##### `actor-refactor` branch ```rust= // Creates the initial root scope and runtime RuntimeScope::<ActorRegistry>::launch(|scope| { async move { // Spawn a system into this scope scope .spawn_system_unsupervised(Launcher, Arc::new(RwLock::new(LauncherAPI))) .await?; // Spawn a task into this scope scope .spawn_task(|task_scope| { async move { for i in 0..10 { // Query for the Launcher system from the scope task_scope.system::<Launcher>() .await .unwrap() // The system may not exist, so an Option is returned .state // Access the shared state of the system .read() // We only need a read lock .await // Here we invoke the public API provided by the Launcher struct .send_to_hello_world(HelloWorldEvent::Print(format!("foo {}", i)), task_scope) .await .unwrap(); tokio::time::sleep(Duration::from_secs(1)).await; } Ok(()) } .boxed() }) .await; Ok(()) } .boxed() }) .await?; ``` ##### `nightly` branch ```rust= ``` #### Open Questions - Tokio tasks are not scoped and likely won't ever allow this. Do scopes make sense even if it's "faked"? - Should actors be indexed by type? This allows simple rules for fetching actors and enforces pools when using multiple actors of the same type in a scope, but it may be too limiting ### Actors Backstage is a framework built on the fundamental concept of an `Actor`: in essence, an asynchronously running event loop. `Actors` may have an internal state, defined by its struct, which is locally borrowed while running the actor. Actors may also access outside `Resources` and depend on other `Actors` and `Resources`. #### Examples ##### `actor-refactor` branch ```rust= #[derive(Debug)] pub struct HelloWorld { name: String, num: u32, } // Using the build proc macro to allow the builder pattern for this struct #[build] #[derive(Debug, Clone)] pub fn build_hello_world(name: String, num: Option<u32>) -> HelloWorld { HelloWorld { name, num: num.unwrap_or_default(), } } #[async_trait] impl Actor for HelloWorld { type Dependencies = (); type Event = HelloWorldEvent; type Channel = UnboundedTokioChannel<Self::Event>; async fn init<Reg: 'static + RegistryAccess + Send + Sync, Sup: EventDriven>( &mut self, _rt: &mut ActorScopedRuntime<Self, Reg, Sup>, ) -> Result<(), ActorError> { Ok(()) } async fn run<Reg: 'static + RegistryAccess + Send + Sync, Sup: EventDriven>( &mut self, rt: &mut ActorScopedRuntime<Self, Reg, Sup>, _deps: Self::Dependencies, ) -> Result<(), ActorError> where Self: Sized, Sup::Event: SupervisorEvent, <Sup::Event as SupervisorEvent>::Children: From<PhantomData<Self>>, { rt.update_status(format!("Running {}", self.num)).await.ok(); while let Some(evt) = rt.next_event().await { match evt { HelloWorldEvent::Print(s) => { info!("HelloWorld {} printing: {}", self.num, s); if rand::random() { panic!("Random panic attack!") } } } } rt.update_status(ServiceStatus::Stopped).await.ok(); Ok(()) } } ``` ##### `nightly` branch ```rust= ``` ### Routers A scope local, dynamic router at the `Actor` level. Backstage intend to fulfill the requirements of any `actor model` based system, therefore we had to enable `Actors` to send events to each other using the `Pid/ScopeId`. It allows developers to `send`, `route` supported events to any alive `Actor` using its `ScopeId` from any scope in the system. Each entry in the `router` is `trait object` for the supported event type, it can be dynamically added or removed. The router can be used for different purposes, such as dynamic deps or any `user defined` pub/sub application, but most importantly it exposes the `PublicAPI` of the Actor to `front-end/etc`. ### Resources Data which is shared between multiple actors is called a `Resource`. This data can be requested from the central storage solution or depended on by `Actors`. #### Examples ##### `actor-refactor` branch ```rust= pub struct NecessaryResource { counter: usize, } ... rt.add_resource(Arc::new(RwLock::new(NecessaryResource { counter: 0 }))).await; ... // Will link the resource so that this actor will be shutdown if it ever gets removed let linked_counter: anyhow::Result<Res<Arc<RwLock<NecessaryResource>>>> = rt .link_data::<Res<Arc<RwLock<NecessaryResource>>>>() .await .map_err(|_| anyhow!("Counter never became available!")); ... // Will return a result indicating whether the resource currently exists let opt_counter: anyhow::Result<Res<Arc<RwLock<NecessaryResource>>>> = rt .query_data::<Res<Arc<RwLock<NecessaryResource>>>>() .await .map_err(|_| anyhow!("Counter does not currently exist!")); ... // Will wait for the resource to exist, similarly to linking, // but will not shutdown the actor if it is removed later let awaited_counter: anyhow::Result<Res<Arc<RwLock<NecessaryResource>>>> = rt .request_data::<Res<Arc<RwLock<NecessaryResource>>>>() .await .map_err(|_| anyhow!("Counter never became available!")); ``` ##### `nightly` branch ```rust= ``` ### Systems `Systems` are simply pairs of `Actors` and `Aesources` which define the `Actor's` shared state. These may be conceptually useful for users who want to spawn actors that have some private state fields and some public state fields. #### Examples ##### `actor-refactor` branch ```rust= struct Launcher; struct LauncherAPI; impl LauncherAPI { pub async fn send_to_hello_world<Reg: 'static + RegistryAccess + Send + Sync>( &self, event: HelloWorldEvent, rt: &mut RuntimeScope<Reg>, ) -> anyhow::Result<()> { rt.send_actor_event::<Launcher>(LauncherEvents::HelloWorld(event)).await } } #[supervise(HelloWorld)] #[derive(Debug)] pub enum LauncherEvents { HelloWorld(HelloWorldEvent), Shutdown { using_ctrl_c: bool }, } #[async_trait] impl Actor for Launcher { ... } impl System for Launcher { type State = Arc<RwLock<LauncherAPI>>; } ... rt.spawn_system_unsupervised(Launcher, Arc::new(RwLock::new(LauncherAPI))).await?; ``` ##### `nightly` branch ```rust= ``` ### Pools It often makes sense to group actors of the same type together (expecially with scopes that limit actors by type) into a pool. These actors can then be sent messages using a metric defined by that pool. Some examples are: - Randomly - By a `Hash` key - By the frequency of usage of actors in the pool (LRU) #### Examples ##### `actor-refactor` branch ```rust= let builder = HelloWorldBuilder::new().name("Hello World".to_string()); let my_handle = rt.handle(); // Start by initializing all the actors in the pool let mut initialized = Vec::new(); for i in 0..5 { initialized.push( rt.init_into_pool_keyed::<MapPool<HelloWorld, i32>>(i as i32, builder.clone().num(i).build()) .await?, ); } // Next, spawn them all at once for init in initialized { init.spawn(rt).await; } // An alternate way to do the same thing // Spawn the pool let mut pool = rt.spawn_pool::<MapPool<HelloWorld, i32>>().await; // Init all the actors into it for i in 5..10 { pool.init_keyed(i as i32, builder.clone().num(i).build()).await?; } // Finalize the pool pool.spawn_all().await; ... match evt { LauncherEvents::HelloWorld(event) => { info!("Received event for HelloWorld"); if let Some(pool) = rt.pool::<MapPool<HelloWorld, i32>>().await { pool.send(&i, event).await.expect("Failed to pass along message!"); i += 1; } } ... } ``` ##### `nightly` branch ```rust= ``` ### Supervision Tree Actors can spawn other actors as children, thus forming a supervision tree. This relates to the concept of scopes, but is technically separate as it does not force the same scoping rules. Instead, supervision consists of actors notifying parents and children in various ways: - Status changes - Reporting exit - Reporting errors These notifications can be handled in various ways. Each has pros and cons, and likely the best solution is to use whichever is most appropriate for the situation. #### Examples ##### `actor-refactor` branch This branch's supervision model is current managed by the `supervise` proc-macro. This macro generates the enums `Children` and `ChildStates` which can be matched on from within the status change and report exit data. This solution is temporary and not user-friendly. ```rust= #[supervise(HelloWorld)] #[derive(Debug)] pub enum LauncherEvents { HelloWorld(HelloWorldEvent), Shutdown { using_ctrl_c: bool }, } ... match evt { LauncherEvents::ReportExit(res) => match res { Ok(s) => { info!("{} {} has shutdown!", s.state.name, s.state.num); } Err(e) => { info!("{} {} has shutdown unexpectedly!", e.state.name, e.state.num); match e.error.request() { ActorRequest::Restart => { info!("Restarting {} {}", e.state.name, e.state.num); let i = e.state.num as i32; rt.spawn_into_pool_keyed::<MapPool<HelloWorld, i32>>(i, e.state).await?; } ActorRequest::Reschedule(dur) => { let handle = rt.handle(); e.error = ActorError::RuntimeError(ActorRequest::Restart); let evt = Self::Event::report_err(e); tokio::spawn(async move { tokio::time::sleep(dur).await; handle.send(evt).ok(); }); } ActorRequest::Finish => (), ActorRequest::Panic => panic!("Received request to panic....so I did"), } } }, LauncherEvents::StatusChange(s) => { info!( "{} status changed ({} -> {})!", s.service.name(), s.prev_status, s.service.status() ); } ... } ``` ##### `nightly` branch ```rust= ``` #### Direct Supervision The parent actor receives these messages directly from its children and processed them in its event loop as if it were any other message. This allows the parent to make decisions based on its current state. ##### Pros - The parent state can be used to decide about notification handling ##### Cons - The parent's event loop must process these events, may cause slowdowns with a large number of supervised children #### One-to-one Indirect Supervision The parent actor spawns a supervising actor for each child. The sole purpose of this intermediary actor is to handle notifications from the single child. ##### Pros - Allows the parent to only handle events unrelated to supervision - Supervisor actor prefabs can be provided to users for various strategies ##### Cons - Cannot make decisions based on the parent's state unless it is part of a shared resource - More complicated for users to implements - Creates an extra layer in the tree which may complicate the user experience (i.e. dashboard) - Potentially doubles the number of total tokio tasks in the system #### One-to-many Indirect Supervision The parent actor spawns a single supervising actor which serves as the intermediary actor for all of its children. #### Open Questions - Is there any real use case for indirect supervision? ##### Pros - Allows the parent to only handle events unrelated to supervision - Supervisor actor prefabs can be provided to users for various strategies ##### Cons - Cannot make decisions based on the parent's state unless it is part of a shared resource - More complicated for users to implements - Creates an extra layer in the tree which may complicate the user experience (i.e. dashboard) - Increases the number of total tokio tasks in the system - Spawning into a running scope from the parent may be complicated and involve messaging ### Metrics Built-in metrics for the framework would allow any users to gather important data about applications using backstage. Data that should be gathered includes: - Event channel size for each actor - Status of each actor - Supervision tree Applications could take advantage of our built-in metrics component and register their own `prometheus` metric `collector` endpoints on the fly, which will be automatically unregistered once the actor reaches its EOL. #### Examples ##### `actor-refactor` branch ```rust= ``` ##### `nightly` branch ```rust= ``` #### Open Questions - How should users request metrics? - I would say through isolated prometheus server or one unified websocket/prometheus endpoint we can use warp for this (Louay). - What should be used to gather metrics (promethius / tokio console / other)? - Would keep prometheus(industry standard) for applications metrics (Louay); - Maybe keep tokio console for debugging (Louay); - What other metrics can we collect? ### Dependencies Actors often depend on other actors in order to function, so there must be a way to link actors together in various ways: - Hard dependency: The actor cannot run without it, and will shut down with an error - The actor can either wait for the dependency to exist or exit immediately - Soft dependency: The actor can run without the dependency but functions differently if it does - The actor may retry fetching the dependency - Cyclic dependency: Two actors may depend on each other simultaneously - This must not lock either actor #### Examples ##### `actor-refactor` branch ```rust= impl Actor for Example { type Dependencies = Pool<MapPool<Other1, Key>>; ... } ... impl Actor for Example { type Dependencies = (Act<Other1>, Res<Act<RwLock<SomeResource>>>); ... } ... impl Actor for Example { type Dependencies = (); ... async fn run<Reg: 'static + RegistryAccess + Send + Sync, Sup: EventDriven>( &mut self, rt: &mut ActorScopedRuntime<Self, Reg, Sup>, _: Self::Dependencies, ) -> Result<(), ActorError> where Self: Sized, Sup::Event: SupervisorEvent, <Sup::Event as SupervisorEvent>::Children: From<PhantomData<Self>>, { // Functionally identical to: // type Dependencies = (Res<Arc<RwLock<NecessaryResource>>>, Act<HelloWorld>) let (counter, hello_world) = rt .link_data::<(Res<Arc<RwLock<NecessaryResource>>>, Act<HelloWorld>)>() .await?; } } ``` ##### `nightly` branch ```rust= ``` #### Open Questions - Can we generate a graph (graphviz, like below?) using actor dependencies - Could this include runtime dependencies? - Could this include parent/child relationships? - Could this add shapes based on the type of relationship? - Do we need a dynamic graph? ```dot digraph Scylla { splines="true"; scylla [shape="square"] cluster listener "node" [shape="doublecircle"] stage [shape="doublecircle"] websocket [shape="doublecircle"] ring [shape="square"] websocket -> scylla [dir=both arrowhead=dot arrowtail=odot] scylla -> { cluster, listener } cluster -> "node" [label="on add" arrowhead=diamond] listener -> websocket [label="on connection" arrowhead=diamond] ring -> reporter [arrowhead=dot] "node" -> stage [label="N=num_shards" arrowhead=diamond] subgraph cluster_stage { stage receiver reporter [shape="doublecircle"] sender send_recv [shape=point] stage -> reporter [label="N=num_reporters" arrowhead=diamond] stage -> send_recv [dir=none label="on scylla connection"] send_recv -> { sender, receiver } sender -> reporter [dir=both arrowhead=dot arrowtail=odot] receiver -> reporter [arrowhead=dot] } { rank=same; scylla; ring } } ``` ### Branch Interconnectivity Sometimes actors from discrete branches need to access each other's data. If that data is scoped, it may not be a simple matter to collect it. One option is to add sharable data to the highest appropriate scope. For instance, a scylla app creates many children in a tree below it, but outside actors may wish to access the cluster specifically. Thus, the cluster may wish to add some data to the top-most level of scylla if it should be publicly accessible. However, some data is inherent to the framework, such as the status of the actor. Actors who wish to know the status of the cluster may need a way to query that information without having access to the scope ID of the actor. This may simply be built into the framework. We may want to automatically add a subscribeable broadcast channel for each actor at their scope which can be used by the parent or any other interested actor. #### Examples ##### `actor-refactor` branch ```rust= ``` ##### `nightly` branch ```rust= ``` #### Scope / Tree Pathing It may be useful to allow users to query a scope ID / data / metadata from a scope using a path like a file system: `/scylla/cluster/node[address=127.0.0.1:9042]/stage[shard_id=3]` Actors spawned in pools would need to be differentiated via some key(s) as nodes and stages are above. #### Examples ##### `actor-refactor` branch ```rust= ``` ##### `nightly` branch ```rust= ``` ### Prefabs Many use-cases for the framework should be generic enough that prefabricated instances of those actors or sets of actors can be provided to users. These prefabs should be essentially plug-and-play. One example is a simple websocket, which can be built and spawned given only the listen address and will forward incoming messages to its supervisor. #### Examples ##### `actor-refactor` branch ```rust= ``` ##### `nightly` branch ```rust= ```