Sync from arbitrary state

# Sync from arbitrary state ###### tags: `weak subjectivity`, `sync` **Author(s):** Victor Farazdagi (Prysmatic Labs) *Last Updated: Apr 19, 2021* [TOC] ## 0. Overview This document describes steps necessary to implement syncing from an arbitrary initial state. The primary aim is to allow starting beacon node from any given state, as if it was a genesis state. Once design outlined in this document is implemented, we can make sure that such a sync works with `--weak-subjectivity-checkpoint block_root:epoch_number` flag, i.e. syncing is only allowed if, in addition to the initial state, weak subjectivity checkpoint is provided (and state is within weak subjectivity period). That will allow to sync from an arbitrary state, and not worry about long-range attacks. ## 1. Distributing initial state We need methods to specify initial state when running a beacon node. Ideally, we should support several of such methods. Here is the list (in priority order) of possible initial state distribution methods: 1. Load state from a compressed local or remote file. 1. Embed the state itself right into binaries. 1. Embed URLs of trusted parties into binaries, so that initial state can be fetched when needed. :::info **One more way of state distribution (highly experimental)** We can consider one more way: allow to download state for a given weak subjectivity checkpoint. We can expose `/beacon/ws-checkpoints/:block_root:epoch_number/state` and allow to download state for a given checkpoint. So, when only `--weak-subjectivity-checkpoint` flag is provided, and state is downloaded from our trusted beacon node. Therefore, user can copy/paste checkpoint info from our GitHub and start node, which will be able to sync from that checkpoint (by first downloading the corresponding state). This method will certainly NOT be implemented during the first iteration, but is something to think about. ::: All methods together, provide a very robust way of loading the initial state: 1. It starts with checking if path to a file containing compressed state is specified. ```flow st=>start: Start node e=>end: Run node subCheckEmbeddedState=>subroutine: Check embedded state opLoadState=>operation: Load state opIgnoreState=>operation: Ignore state opDownloadState=>operation: Download state opSaveState=>operation: Save and use state condEmptyDB=>condition: Empty DB? condHasInitState=>condition: Initial state provided? condIsLocalFile=>condition: Is local file? condIsWithinWS=>condition: Within WS period? condDownloadIsOK=>condition: Downloaded OK? st->condEmptyDB condEmptyDB(yes)->condHasInitState condEmptyDB(no)->opIgnoreState->e condHasInitState(yes)->condIsLocalFile condHasInitState(no)->subCheckEmbeddedState condIsLocalFile(yes)->opLoadState condIsLocalFile(no)->opDownloadState opDownloadState(right)->condDownloadIsOK opLoadState->condIsWithinWS condDownloadIsOK(yes)->opLoadState condDownloadIsOK(no)->subCheckEmbeddedState condIsWithinWS(yes)->opSaveState(right)->e condIsWithinWS(no)->subCheckEmbeddedState ``` 2. If initial state is not specified as path to local file or URL to remote file, or it cannot be used (stale state, failed download etc) node checks whether there exists an embedded state. ```flow st=>start: Check embedded state e=>end: Run node subCheckEmbeddesURLs=>subroutine: Check embedded URLs opLoadState=>operation: Load state opIgnoreState=>operation: Ignore state opDownloadState=>operation: Download state opSaveState=>operation: Save and use state condEmbeddedStateProvided=>condition: Has embedded state? condIsWithinWS=>condition: Within WS period? st(right)->condEmbeddedStateProvided condEmbeddedStateProvided(yes)->opLoadState condEmbeddedStateProvided(no)->subCheckEmbeddesURLs opLoadState(right)->condIsWithinWS opIgnoreState->e condIsWithinWS(yes)->opSaveState(right)->e condIsWithinWS(no)->subCheckEmbeddesURLs ``` 3. Finally, node checks whether it can obtain a state from embedded trusted 3rd party URLs. ```flow st=>start: Check embedded URLs e=>end: Run node opLoadState=>operation: Load state opIgnoreState=>operation: Ignore state opDownloadState=>operation: Download state opSaveState=>operation: Save and use state condIsWithinWS=>condition: Within WS period? condHasEmbeddedURLs=>condition: Has non-checked embedded URLs? condDownloadIsOK=>condition: Downloaded OK? st(right)->condHasEmbeddedURLs condHasEmbeddedURLs(yes, right)->opDownloadState condHasEmbeddedURLs(no)->opIgnoreState condIsWithinWS(yes)->opSaveState(left)->e condIsWithinWS(no)->opIgnoreState condDownloadIsOK(yes)->opLoadState condDownloadIsOK(no)->condHasEmbeddedURLs opDownloadState(right)->condDownloadIsOK opLoadState->condIsWithinWS opIgnoreState->e ``` ### 1.0 Invariants Disregarding of how initial state was obtained, there are number of invariants on how it will be processed/used: #### 1.0.1 Initial state is used for node initialization The initial state is for node **initialization**, so it will be used only if node is started for the first time, or if node is started with `--clear-db` flag. If database is non-empty, provided initial state is ignored, and warning is emitted. *Note: this invariant makes sure it is easy to reason about initial states, otherwise if node is restarted with another initial state, we have to decide how to populate several gaps `genesis -> ..gap.. -> state1 -> ..gap.. -> state2`. It is way easier to consider initial state as being a starting point of en empty node, so we are responsible for makeing sure it can sync from that point, and backfill blocks from genesis to that point.* **Action Items:** - [ ] Make sure that we do not allow to overwrite state of a non-empty database. Emit warning, if `--initial-state` flag is being used on a non-empty DB. #### 1.0.2 Protecting from long-range attacks When providing node with an initial state, the state must be within weak subjectivity period (otherwise we are dealing with a stale state). To ensure security, users are expected to provide the `--weak-subjectivity-checkpoint block_root:epoch_number` param as well. The `IsWithinWeakSubjectivityPeriod()` helper will be used to check if that provided checkpoint is not stale itself, and if it is -- initial state will be ignored. *Note: the weak subjectivity checkpoint will be useful in any case (stale or not) to make sure that its root is present in our DB, as if it doesn't node must terminate.* **Action items:** - [ ] Make sure that `--initial-state=PATH` requires `--weak-subjectivity-checkpoint=block_root:epoch_number` to be provided as well. Consider alternative ways of specifying the weak subjectivity checkpoint (embedded, for example). - [ ] Rely on `IsWithinWeakSubjectivityPeriod()` to make sure that checkpoint is not stale i.e. our node is too far beyond. - [ ] Assert that checkpoint and input state have the same root and epoch. ### 1.1 Load state from a compressed local or remote file This method allows to provide path to compressed state file (either as path to local file or URL to remote). When users want to provide initial state as the local file, the state must be fetched from a trusted 3rd party, and saved locally. Then, a node will be started with `--initial-state=/path/to/state.ssz.snappy` (or we can use`ssz_snappy` extension, for consistency with how ETH2 specs outputs are generated, see [ethereum/eth2.0-specs/pull/2097](https://github.com/ethereum/eth2.0-specs/pull/2097)) flag. Passing URL to remote state file, is very similar (`--initial-state=https://raw.githubusercontent.com/prysmaticlabs/prysm/develop/states/finalized.ssz.snappy`), but node itself is responsible for downloading the state. **Action items:** - [ ] Add `--initial-state=PATH` flag. - [ ] Make sure that `--initial-state` accepts URL, and can download, parse and load state from that URL. - [ ] Add decompressing, parsing and loading into `BeaconState` the provided `*.ssz.snappy` file. This can be in form of a new helper method `LoadInitialStateFromSSZ(compressedState []byte) error`, which will be responsible for parsing and loading the state into DB. *Note: we can re-use code from how we parse and load embedded genesis state*. **Expected outcomes:** - [ ] Make sure that loaded `BeaconState` is ready to be used in node syncing (both regular and init-sync). ### 1.2 Embed the state itself right into binaries On releases, the latest weak subjectivity checkpoint and its state should be embedded right into binary. Node will be able to sync from that point (quickly), without requirement of users to obtain initial state themselves. **Action items:** - [ ] Update node code to use pre-defined embedded state: ```golang //go:embed initial.ssz.snappy var initialStateRawSSZCompressed []byte ``` - [ ] Make sure that embedded state is used only if it is present, not stale, and no initial state has been supplied via CLI `--initial-state` argument. - [ ] Ideally we should have a CLI script that helps us assemble the binary and save the most recent weak subjectivity checkpoint and corresponding state. - [ ] Consider whether weak subjectivity checkpoint data can be embedded as well. **Expected outcomes:** - [ ] On releases, binary will contain updated embedded state, that will be used as initial state on raw databases (when no initial state is provided via CLI arguments). ### 1.3 Embed URLs of trusted parties into binaries This is basically a follow up on `--initial-state=URL` functionality, but adding more usability and robustness the method. Consider the following embedded data: ```golang var embeddedInitialState = downloadableState{ hash: "9bef...0155", urls: []string{ "https://prysmaticlabs.com/uploads/states/v0.7.1.ssz.snappy", "https://beaconscan.com/ws_checkpoint/state.ssz", // some more URLs ... }, } ``` If initial state is not provided as CLI argument (or it is stale/unavailable for download etc), node looks for embedded state, if it is not available (or stale), node finally goes through the list of providers in an attemt to obtain the state. **Action items:** - [ ] Implement and test functionality where node is able to traverse list of embedded URLs and obtains initial state from one of it. - [ ] Consider using meta-links (e.g. `https://foo.bar/state/latest-finalized.ssz.snappy`, where hash cannot be known in advance, so cannot guarantee file integrity). **Expected outcomes:** - [ ] Ability to robustly download state from one of the trusted state providers. ## 2. Syncing ### 2.0 Overview We assume that chain has started, and obtained beacon state is valid. Ideally, Prysm should be able to proceed from a given state w/o any major refactoring i.e. core components should rely on a given state and be able to progress in absence of previous states (blocks). ### 2.1 Changes required to the existing components Syncing from a state that doesn't have historical blocks/states between it and genesis, requires updating some of our existing components. The list of components below is incomplete, as we proceed with implementation there certainly will occur more packages that assume presence of historical blocks/states. #### 2.1.1 `sync` and `initial-sync` packages - [ ] Either obtaining block for the given state, or make sure that block processing works for successive slots w/o requirement of parent's block existence in DB. At the moment, we will get "parent block not found in DB", when trying to build on top of the state. - [ ] Ability to restart a node that has been started with initial state. For this, the provided initial state must be saved in DB (or some meta-data, at least), and on restart, while it will not be re-saved (as DB is already not empty), system should know that the node has started out of arbitrary state and some extra operations are required (like back-filling previous blocks, for instance). *Note: we can implement this by storing `InitalStateCheckpoint` in database -- whenever it exists we know the root and slot of the provided state* - [ ] When initial state is provided, in addition to saving it, we need to make sure that head block is also updated. Again, it means we either obtain block for a given state, or update our code in such a way that block_root is enough when head block is expected. - [ ] Similar things for justified and finalized checkpoints: our provided state is considered finalized, so when saving it, we need to make sure that checkpoints are updated accordingly. #### 2.1.2 `stategen` package - [ ] Double check that block of base state is not required when regenerating states. That's an arbitrary state should be enough to apply more blocks on top of that base state. #### 2.1.3 `rpc` package - [ ] Make sure that both `BeaconBlocksByRange` and `BeaconBlocksByRoot` work correctly when queried for non-existent historical blocks (that's close to head, we have blocks, but there are gap between genesis and the initial state). - [ ] RPC module (not only by range and by root requests) should be checked as a whole, what happens if non-existent state is requested? Terence suggested to emit error with a good message (e.g. "Error: node started syncing in epoch X, unable to retrieve block in epoch X - Y"). - [ ] It is probably worth exploring if we want to keep peers with finalized epoch lower than that of the provided initial state, and if yes, then what portion of those peers to keep. That's they will be useful for back-filling historical blocks, they are not that useful for progressing. #### 2.1.4 `blockchain` package - [ ] In `chain_info.go` we define (heavily used) `HeadRoot()` and `HeadBlock()` methods. When starting from arbitrary state we do not have full block available (but have block header, if necessary), so we need to check how our block-less heads will behave. ### 2.2 New components Ideally, there should be no new components required for implementing the sync from arbitrary base state i.e. there are some places that may rely on historical states/blocks, and we need to update those, but other than that -- our existing code should handle sync w/o any issues. ## 3. Back-filling historical data ### 3.0 Overview When node is started from an arbitrary initial state, we have the following data layout: ```mermaid graph LR genesis[Genesis] -->|...historical states...| init-state[Arbitrary Initial State] --> |...non-synced states...| canonical-state[Head of canonical chain] ``` All our current components (with minor refactoring), will be able to proceed and build on top of the arbitrary initial state: ```mermaid graph LR genesis[Genesis] -->|...historical states...| init-state[Arbitrary Initial State] subgraph one[Sync to the head] init-state --> |...non-synced states...| canonical-state[Head of canonical chain] end ``` Now, the tricky part: syncing back historical blocks (and then regenerate historical states), all this w/o any interference into the normal sync of un-synced states. ```mermaid graph LR subgraph two[Sync historical blocks] genesis[Genesis] -->|...historical states...| init-state[Arbitrary Initial State] end init-state --> |...non-synced states...| canonical-state[Head of canonical chain] ``` Now, since arbitrary state will be within the weak subjectivity period, and will contain block header of finalized block (the whole point of being w/i WS period is to make sure that if there is some other finalized block, then 1/3 of validators will get slashed), we can go back in history and find immediately preceding finalized block, and then the one before it, and so forth. This is possible because each finalized block has enough information about its parent block, which can then be retrieved. After the back-filling procedure, beacon DB should be in exact same state as if sync was from genesis, w/o any intermediary initial state (assuming that there weren't any forks -- as when going back we already know the canonical chain). Note: all the historical data, like `block_roots`, `state_roots`, and `historical_roots` is already there within the provided initial state even before any back-filling takes place, and **will not** be overwritten by history back-filling procedure, as we are going backwards. ### 3.1 Changes required to the existing components Ideally, no changes to the existing components will be required (when it comes to back-filling procedure). Our sync process should have been updated to start from the arbitrary state, and should be not interfered with the back-filling procedure, at all. ### 3.2 New components required In order to allow back-filling procedure have as little impact on other running components as possible, we should introduce a separate component: `historical-sync` or `backward-sync`. If possible, that new component should rely on the very same abstractions that `initial-sync` is using (queue, FSMs), as they proved to be very robust when it comes to syncing from peers with possible incomplete data. When going back we have an advantage of knowing which block to include, and which shouldn't go into DB. So, staring from our initial state's `parent_root` (from state's known block header), we can request a block with a given root, and then block with `parent_root` of that block, and so forth i.e. start from state's known finalized block, and get back to the previous finalized block, and so on, up until genesis: ```mermaid graph RL subgraph b4["(n-4)"th block] further[...] end subgraph b3["(n-3)"th block] b3_root[root] b3_parent_root[parent_root] -->further end subgraph b2["(n-2)"th block] b2_parent_root[parent_root]-->b3_root b2_root[root] end subgraph b1["(n-1)"th block] b1_root[root] b1_parent_root[parent_root]-->b2_root end subgraph state block_header[block_header.parent_root]-->b1_root end ``` Of course, pulling blocks one by one is very inefficient, so we will still rely on fetching blocks by range, but further filter those batches to make sure that only chain of finalized blocks is processed (in backwards fashion). That's system will save block only if its root matches the expected root of the previously saved **finalized** block. Once all blocks are pulled into our database, it is time to regenerate all the intermediary states. The last state generated, will be compared with the provided initial state, and it is expected that those do match! **Action items:** - [ ] Introduce new `historical-sync` service. - [ ] We need to track earliest known finalized checkpoint (as we will be pulling its parent block when going backwards in history), so probably introduce `earliestFinalizedCheckpoint`. - [ ] Adapt FSMs queue to proceed in backward direction. - [ ] Allow to quickly filter out non-matching blocks from block batches (we will pull data using blocks by range requests, but then will make sure that we save data in backwards direction, checking `block(n).parent_root == block(n-1).root`). - [ ] Make sure that all back-filling routine runs in complete isolation, the only external effect is blocks data is being updated in database, and various checkpoints to keep track of the earliest unknown block are updated. - [ ] If those checkpoints are updated, then node should survive restarts w/o any issue. - [ ] Make sure that back-filling is always ends either in genesis block or in terminal error if genesis block is unreachable (that's we arrive to 0th slot, and our earliest block doesn't match the genesis). - [ ] Make sure that as soon as some block range is pushed into database the `BeaconBlocksByRange` can return them via RPC. - [ ] Once all blocks are available (and all of them finalized), we can regenerate history states. ## 4. References - [Weak Subjectivity in Eth2.0](https://notes.ethereum.org/@adiasg/weak-subjectvity-eth2) by Aditya Asgaonkar. - [Shipping With Genesis States](https://hackmd.io/@prysmaticlabs/shipping_with_genesis) by Preston Van Loon - [Weak Subjectivity: implementation roadmap](https://hackmd.io/V9vY48R6QHCN0whHbPjk9g?view) by Victor Farazdagi