Restructuing `Staker` implementation with modular architecture.

# Restructuing `Staker` implementation with modular architecture. PR: https://github.com/ReamLabs/ream/pull/679 ## Motivation In `3sf-mini` [implementation](https://github.com/ethereum/research/tree/master/3sf-mini) written by Vitalik, there is a `Staker` which is basically a node implementation. We already migrated our prior work([rust-3sf](https://github.com/ReamLabs/rust-3sf)) into our codebase in the [PR #672](https://github.com/ReamLabs/ream/pull/672), so we [do have](https://github.com/ReamLabs/ream/blob/9c5f3103477c8faf5b54c16b33ac5356aec8a231/crates/common/chain/lean/src/staker.rs#L21-L33) `Staker`. However, I was thinking of how our `lean_node` will be structured, and finally concluded that we need to separate the full responsiblities of `Staker` into multiple services. The reasons are: 1. `Staker` is overfitted to the "mock" p2p network. For example, in the mock network, it manually [increases its time step](https://github.com/ethereum/research/blob/d225a6775a9b184b5c1fd6c830cc58a375d9535f/3sf-mini/p2p.py#L218), which is not the case in our real-world devnet. 2. As a team, we have to separate the tasks, but one single file implementation of node is not suitable for dev collaboration. 3. `Staker` doesn't have functionality to sign the data. Also it is assumed that a `Staker` can only have one validator, which is not practical in the real world use case. So I propose to use the architecture that I will explain in a detail in this write-up. All of the core logics will definitely live in our codebase, but they will be scattered in each service with its interest. Each service has its own interest, so testing/developing for each service is viable in this architecture. I don't want a new architecture to be super complex, so minimized the components so that it is easy to understand. Feel free to give some feedback and concerns! ## Anatomy of `Staker` > You can see full Rust implementation of `Staker` in [this link](https://github.com/ReamLabs/ream/blob/9c5f3103477c8faf5b54c16b33ac5356aec8a231/crates/common/chain/lean/src/staker.rs). ```rust! pub struct Staker { pub validator_id: u64, pub chain: HashMap<B256, Block>, pub post_states: HashMap<B256, LeanState>, pub known_votes: Vec<Vote>, pub new_votes: Vec<Vote>, pub dependencies: HashMap<B256, Vec<QueueItem>>, pub genesis_hash: B256, pub num_validators: u64, pub safe_target: B256, pub head: B256, } ``` We can divide its members like: 1. Validator(`validator_id`) 2. Chain configuration(`genesis_hash`, `num_validators`) 3. Vote management(`known_votes`, `new_votes`) 4. P2P(`dependencies`) 5. State transition(`chain`, `post_states`, `safe_target`, `head`) Modification of `Staker` is triggerred either by 1) every `tick` (called every second) and 2) `receive` function. Let's get into each function. ```rust! /// Called every second pub fn tick(&mut self) -> anyhow::Result<()> { let current_slot = self.get_current_slot()?; let time_in_slot = { let network = self .network .lock() .map_err(|err| anyhow::anyhow!("Failed to acquire network lock: {err:?}"))?; network.time % SLOT_DURATION }; // t=0: propose a block if time_in_slot == 0 { if current_slot % self.num_validators == self.validator_id { // View merge mechanism: a node accepts attestations that it received // <= 1/4 before slot start, or attestations in the latest block self.accept_new_votes()?; self.propose_block()?; } // t=1/4: vote } else if time_in_slot == SLOT_DURATION / 4 { self.vote()?; // t=2/4: compute the safe target (this must be done here to ensure // that, assuming network latency assumptions are satisfied, anything that // one honest node receives by this time, every honest node will receive by // the general attestation deadline) } else if time_in_slot == SLOT_DURATION * 2 / 4 { self.safe_target = self.compute_safe_target()?; // Deadline to accept attestations except for those included in a block } else if time_in_slot == SLOT_DURATION * 3 / 4 { self.accept_new_votes()?; } Ok(()) } ``` `tick` is responsible for two things: 1. Perform validator duty (See `t=0` and `t=1/4`). 2. Update the state (See `t=2/4` and `t=3/4`) ```rust! /// Called by the p2p network fn receive(&mut self, queue_item: QueueItem) -> anyhow::Result<()> { match queue_item { QueueItem::BlockItem(block) => { let block_hash = block.tree_hash_root(); // If the block is already known, ignore it if self.chain.contains_key(&block_hash) { return Ok(()); } match self.post_states.get(&block.parent) { Some(parent_state) => { let state = process_block(parent_state, &block)?; for vote in &block.votes { if !self.known_votes.contains(vote) { self.known_votes.push(vote.clone()); } } self.chain.insert(block_hash, block); self.post_states.insert(block_hash, state); self.recompute_head()?; // Once we have received a block, also process all of its dependencies if let Some(queue_items) = self.dependencies.remove(&block_hash) { for item in queue_items { self.receive(item)?; } } } None => { // If we have not yet seen the block's parent, ignore for now, // process later once we actually see the parent self.dependencies .entry(block.parent) .or_default() .push(QueueItem::BlockItem(block)); } } } QueueItem::VoteItem(vote) => { let is_known_vote = self.known_votes.contains(&vote.data); let is_new_vote = self.new_votes.contains(&vote.data); if is_known_vote || is_new_vote { // Do nothing } else if self.chain.contains_key(&vote.data.head) { self.new_votes.push(vote.data); } else { self.dependencies .entry(vote.data.head) .or_default() .push(QueueItem::VoteItem(vote)); } } } Ok(()) } ``` `receive` handles message from the network as well as the **node itself**. ([See what `vote` does after building the vote.](https://github.com/ReamLabs/ream/blob/9c5f3103477c8faf5b54c16b33ac5356aec8a231/crates/common/chain/lean/src/staker.rs#L288)) Block (`QueueItem::BlockItem`) is only from the network, while Vote (`QueueItem::VoteItem`) can come from either network and node itself. ## Details for the architecture > [!Note] > The below diagram is AI-generated, but I think my AI did a good job, so I added it. ```mermaid graph TD subgraph Lean Node direction LR subgraph Services direction TB VS[LeanValidatorService] NS[LeanNetworkService] CS[LeanChainService] OS["Other Services (e.g., RPC, DB)"] end subgraph Shared State LC["LeanChain (Arc<RwLock<T>>)"] end %% Service Interactions with Shared State VS <-->|Reads/Writes| LC NS <-->|Reads/Writes| LC CS <-->|Reads/Writes| LC OS <-->|Reads/Writes| LC end %% External Triggers & Peers Clock[Clock / Timer ] P2P[External P2P Network] User[User / CLI] %% Data Flow & Triggers Clock -.->|t=0, t=1/4 Propose, Vote| VS Clock -.->|t=2/4, t=3/4 Update Safe Target, Accept Votes| CS User -- Load Keystores --> VS P2P <==>|Gossip: Blocks/Votes| NS %% Inter-Service Communication (MPSC Channels) NS -- "New Block/Vote (from peers)" --> CS VS -- "Own New Block/Vote (to broadcast)" --> NS VS -- "Own New Vote (for state update)" --> CS ``` ### `LeanChain` ```rust! /// [LeanChain] represents the state that the Lean node should maintain. /// /// Most of the fields are based on the Python implementation of [`Staker`](https://github.com/ethereum/research/blob/d225a6775a9b184b5c1fd6c830cc58a375d9535f/3sf-mini/p2p.py#L15-L42), /// but doesn't include `validator_id` as a node should manage multiple validators. #[derive(Clone, Debug)] pub struct LeanChain { pub chain: HashMap<B256, Block>, pub post_states: HashMap<B256, LeanState>, pub known_votes: Vec<Vote>, pub new_votes: Vec<Vote>, pub genesis_hash: B256, pub num_validators: u64, pub safe_target: B256, pub head: B256, } ``` `LeanChain` is a struct that holds *everything* that the node should be aware of. It excludes `validator_id` and `dependencies` in `Staker`. I could have suggested a better design but at this moment, I was trying to preserve our existing functions and code. ```rust! // Initialize the lean chain with genesis block and state. let (genesis_block, genesis_state) = lean_genesis::setup_genesis(); let lean_chain = Arc::new(RwLock::new(LeanChain::new(genesis_block, genesis_state))); // Initialize the services that will run in the lean node. let chain_service = LeanChainService::new(lean_chain.clone()).await; let network_service = LeanNetworkService::new(lean_chain.clone()).await; let validator_service = LeanValidatorService::new(lean_chain.clone(), Vec::new()).await; ``` `LeanChain` is wrapped with synchronization primitives, and provided to every services that needed. Each service should obtain read/write lock before accessing/modifying it. See the [rationale](#Why-shared-state) for choosing this way. ### Services ```rust! let (chain_sender, chain_receiver) = mpsc::unbounded_channel::<LeanChainServiceMessage>(); ``` Services will use `tokio::mpsc::channel` as a core communicating method. Each service will have one or two receivers(`rx`s), and also senders(`tx`s) to the target service. #### `LeanChainService` `LeanChainService` is responsible **for updating the `LeanChain` state**. `LeanChain` is updated when: 1. Every third (t=2/4) and fourth (t=3/4) ticks. This is triggered by itself. 2. Receiving new blocks or votes from the network. This is triggered by receiving a message from other services. It also has `dependencies` as a member of the service, which plays like a queue in original 3sf-mini implementation. #### `ValidatorService` `ValidatorService` is responsible for the following: 1. Perform validator duties in right tick. 2. Sign with its keystores. Keystores should be parsed from a file (maybe YAML) that is provided by the command line option. Currently we don't have any standard for saving the keystore, but I think it is not infeasible to implement this feature: just ship it first and follow the standard after. #### `NetworkService` `NetworkService` is responsible for the following: 1. Peer management. (At the last interop call (6th of Aug), we are aligned with connecting only with static peers for PQ devnet.) 2. Gossiping blocks and votes. It **only receives and forwards** the consensus types into other services. No decision making about received data in this service. #### Other services We might need different services if needed. For example, if we must persist the data in DB, we can spawn up the service for DB management. RPC service can also be added. ## Rationale ### Why shared state? I considered two ways to manage the state of node in two ways. One is using the well-known [Actor Model](https://en.wikipedia.org/wiki/Actor_model), and other is using shared state. (FYI: Currently, the Beacon node implementation of Ream follows the latter approach.) Actor Model can be represented as this statement: "Every components communicate with message, and the *actor* handles it." So, in our case, `LeanChainService` only holds `LeanChain` and no other services refer to it. If a service wants to fetch the head from the state, the service should send a message to `LeanChainService` that contains its intent. Actor Model is one of the best practice for designing scalable, distributed application, as well as it also removes some synchronization issues like deadlock. However, I think introducing this type of architecture can be **too verbose** compared to what we want to achieve. We need to define the events in `enum LeanNodeEvent` if we would add a new feature, which requires quite a lot of boilerplate code. Also, creating message and delivering it via channel can be one of the *possible* performance bottleneck. In conclusion, despite of the advantages that Actor Model provides, I would say we can choose the shared state model regarding our limited resources. ## Appendix ### Where do `Staker` methods move? `Staker` had those methods: - `latest_justified_hash` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L71)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `latest_finalized_hash` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L75C12-L75C33)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `compute_safe_target` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L82)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `accept_new_votes` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L96)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `recompute_head` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L108)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `tick` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L116)) -> Separated into two service (`ValidatorSercive` ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/validator/lean/src/service.rs#L92-L153)) and `LeanChainService` ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/service.rs#L79-L97))) to fit each interest. First and second ticks are for validator duty, and third and fourth ticks are for updating the state in right timing. - `get_current_slot` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L151)) -> Calculating slot can be achieved by the global network spec. So I separated [this function as `slot.rs`](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/slot.rs), not a method in particular struct. - `propose_block` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L160)) -> Moved to one of the methods `LeanChain`. ([Link](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs)) - `vote` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L225)) - Logics for building a vote from the current state is in [`build_vote` method](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/lean_chain.rs#L148). - `ValidatorService` [requests](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/validator/lean/src/service.rs#L128) to `LeanChainService` for building a vote, [signs it](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/validator/lean/src/service.rs#L146), and then [submit it again](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/validator/lean/src/service.rs#L138-L144) to `LeanChainService` to handle it. We need to implement to submit the **signed** vote into the network later. - `receive` ([Link](https://github.com/ReamLabs/ream/blob/8f75c223d3b433c62cae033b6794dda57575ca71/crates/common/chain/lean/src/staker.rs#L299)) - All logics are [handled in `LeanChainService`](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/service.rs#L109-L120) by receiving messages [from the channel](https://github.com/syjn99/ream/blob/96af999e76f21f0de74f9d3fdcb8da65582a6093/crates/common/chain/lean/src/service.rs#L98-L100).