Prysm Checkpoint Sync Options

# Prysm Checkpoint Sync Options ## Background In order to implement the capability to sync from a weak subjectivity checkpoint, we need to extend our current "sync from genesis state" support to work with arbitrary states. However states after genesis pose an additional challenge, because we can no longer derive the initial block from a priori knowledge (genesis uses a zero'd out block root, other states will not). There are 3 obvious solutions to this problem: 1. Carefully audit and modify Prysm so that the block is unnecessary. This was suggested as a possibility by Victor in his [WSS design document](https://hackmd.io/oxLcrRxwRCGQNYowPWefcg#21-Changes-required-to-the-existing-components). In short, if we have the block root and have explicit trust in the state, in theory we should be able to build on top of that. This seems like a risky change because it potentially changes invariants in the core processing of the blockchain. It's the option I am least comfortable with, so I haven't explored it in much depth. 2. Distribute a serialized copy of the BeaconBlock value used to derive the checkpoint state, in addition to the checkpoint BeaconState itself. Handling two separate files that are intrinsically related has some UX issues, but there are some obvious workarounds. This option would likely result in the smallest possible change to the codebase. 3. Fetch the block from the network. This could be preferable from a user-experience perspective as it more closely aligns with the experience of specifying a genesis state. However it rubs up against numerous assumptions baked into the current implementation and will require some deeper changes to accomodate. ## Other implementations Teku has a working implementation of checkpoint sync, and Lighthouse seems close to releasing it -- it was pushed back by Altair work and other higher priority bugfixes, but the PR linked below was updated on 9/14, saying it's ready for review. ### Teku The `--initial-state` flag can be used to specify either a genesis state OR a checkpoint state. This seems like a nice UX, reducing cli flag clutter (and with this approach it is intuitive that (genesis | checkpoint) states are mutually exclusive). Teku only needs the block root to initialize, they don't need to fetch the full block over the network. * [cli flag documentation](https://docs.teku.consensys.net/en/latest/Reference/CLI/CLI-Syntax/#initial-state) * [HowTo](https://docs.teku.consensys.net/en/latest/HowTo/Get-Started/Checkpoint-Start/) ### Lighthouse Lighthouse requires the `SignedBeaconBlock` to be provided alongside the `BeaconState` in 2 separate flags, `--checkpoint-state` and `--checkpoint-block`. Also of interest, lighthouse did not attempt to implement reverse syncing, instead they kick off a backfill sync from genesis. * [Issue](https://github.com/sigp/lighthouse/issues/1891) * [PR](https://github.com/sigp/lighthouse/pull/2244) ## Back to Prysm - Option 1: No initial block, deal with it I'm scared and I need a grownup. Don't want to spend significant effort investigating this unless someone wants to champion it. ## Option 2: Distribute BeaconState + SignedBeaconBlock aka the lighthouse approach. One idea to make this a bit more user-friendly is to distribute the combined state + block, along with other helpful metadata, as a single protobuf message, like: ``` message SyncData { // ssz-encoded BeaconState required bytes state = 1; // ssz-encoded BeaconBlock required bytes block = 2; // ssz-encoded Checkpoint (epoch+root) required bytes checkpoint = 3; } message SignedSyncData { // embedded SyncData message required SyncData checkpoint = 1; // signature of the above, made w/ prysm-owned keys required signature bytes = 2; } ``` One thing I like about this idea is that, since it embeds the checkpoint (ie epoch+root), the same file could be used in place of the current format of specifying those values as a separate flag (`--weak-subjectivity-checkpoint`, with an awkward microformat of `block_root:epoch_number` for the value). Encourages the user to keep around the checkpoint sync file and reduces user error, maybe. for example: `prysm.sh --sync-from-checkpoint checkpoint.snappy --verify-checkpoint checkpoint.snappy`. After the first run the user would clip off the `--sync-from-checkpoint` flag but keep the `--verify-checkpoint` (or we can just allow the flag to stick around). We could also distribute the genesis state in the same format and combine the genesis + checkpoint sync flags, like Teku. ## 3. Fetch Block from network The rest of this document will break down the complexity of this approach and give some options for implementing it if we so choose. ### Database initialization Before any of the services are initialized, we process the genesis state, if specified on the command line, which includes saving the state to the database. Note that we have special cases in the code to look for blocks with a zero-value, in order to recognize the bootstrap block from the genesis state as "finalized". This will be different in checkpoint sync as we will need to ensure that the included block is saved to the database and marked finalized before dependent components (like the `blockchain` service) are able to start. ### Event Feeds (StateFeed) node.go initializes 3 different instances of [event.Feed](https://github.com/prysmaticlabs/prysm/blob/develop/shared/event/feed.go#L36): * OperationFeed * BlockFeed * StateFeed I won't cover OperationFeed and BlockFeed, as these seem to mostly be used in a pubsub fashion to distribute notifications of events to listeners via the rpc package (ie [StreamEvents](https://github.com/prysmaticlabs/prysm/blob/develop/beacon-chain/rpc/eth/events/events.go#L46)). StateFeed on the other hand is worthy of some closer examination because it is used to control the order that different services progress through their initialization/start phases. To start with here is a diagram showing the bird's eye view of the major StateFeed events. ![](https://i.imgur.com/Xmikn0B.png) **powchain - ChainStarted** This event only comes into play in the pre-genesis state. // TODO: I am confused by the conditions that get us into this code path, trying to clarify by asking a question in discord. In the pre-genesis state, the `blockchain` Service blocks in the `Start` method until this event is received. This event is fired by `powchain` when the powchain package has confirmed that the system time has reached the genesis and there are enough deposits to begin blockchain processing. **blockchain - Initialized** `p2p`, `initial-sync` and `sync` services all have methods to block until this event is received: * `p2p`: awaitStateInitialized() * `initial-sync`: waitForStateInitialization * `sync`: registerHandlers (blocks until Initialized, then registers RPC handlers) note: `sync` and `initial-sync` actually start listening for `Initialized` in goroutines started in their constructor methods. I'm not exactly sure why they do this, I'm guessing something to do with the initialization order in `node.go`. I think that p2p needs to wait for blockchain to start because the networking documents specify that the ENR record for eth2 clients need to contain `ENRForkID`, which requires data broadcast via the Initialized event (genesis time and the genesis validator root). I think the reasoning follows from single responsibility principle, where powchain is responsible for the sources of these values (`ChainStartData`). It's not as clear to me why sync is doing this. It does use the startTime to sleep in the event that genesis is in the future. initial-sync on the other hand waits on genesis time, which is passes down into `roundRobinSync`. This parameter is used throughout initial sync to compute slot positions (eg `core.SlotsSince`). **initial-sync - Synced (+ lowkey Service.Synced)** In order to allow `initial-sync` to run in the cases where it has not been disabled via various conditionals, `sync` blocks until it receives the `Synced` flag from `initial-sync`. When `initial-sync` fires this event, it also flips a `synced` boolean that the service owns and exposes via the `Synced()` method. The `sync` package has a dozen or so code paths that check the value of this boolean, in addition to blocking on the event. `rpc` also checks this value in methods that shouldn't finish until `initial-sync` is complete. ### Checkpoint Sync (The Problem) 1) When a checkpoint state is specified as a cli flag, prysm knows the block root of the last block integrated into the state, but does not have the block itself. 2) So there is no way for the `startDB` initialization code to set an appropriate value for the finalized block (so no `BeaconDB.FinalizedCheckpoint`). 3) `blockchain/Service.Start()` relies on the `BeaconDB` to retrieve the `FinalizedCheckpoint` so that `stateGen` can load `StateByRoot`. So blockchain can't start until after checkpoint sync has fetched the block and written it to the db. 4) checkpoint sync can't fetch the block until p2p is online. But p2p blocks on startup, waiting for blockchain to send `StateFeed.Initialized`. So when checkpoint sync is used, rather than blockchain owning the initial state validation, we need to let checkpoint sync unblock p2p. But we don't really need to take over all of blockchain's jobs (like processing attestations, fork choice etc), just the validation that happens before it kicks everything off. ### Considered Solutions **Implement in initial-sync** My first attempt was to add this as an `initial-sync` capability. `initial-sync` is pretty complex and is designed to run after `blockchain`, so there is a mismatch in the order of operations, as well as adding more responsibilities to an area of the code that seems difficult to extend. **Move fetching of the initial block into blockchain** This is an approach I tried early on. The deal-breaker was that attempting to import the p2p support for fetching the block from within the blockchain service resulted in a circular import. It also feels like this approach goes against the existing separation of concerns in the codebase, as fetching blocks is usually the responsibility of the sync package. **Yet Another Service - StateValidator** We could refactor the checks that the blockchain service does into a new service, which would fire a sequence of events as different validation thresholds are passed. 1) ValidGenesis - asserts that genesis has passed, and includes genesis time and validators root (this allows p2p configure itself, at which point it would be usable by checkpoint sync to fetch a block). Checkpoint sync would be unblocked by this state transition, would fetch the initial block, validate it, save the block, save the state summary, and mark the block as finalized. -- Note: this is different from the usual genesis state, in which the genesis block is not marked finalized. The blockchain startup code relies on knowledge of the fact that a) the `FinalizedCheckpoint` db method will return the zero value for Checkpoint b) the zero value of checkpoint has a `.Root` attribute equal to `params.BeaconConfig().ZeroHash`. So in a nutshell genesis is a special case in the current blockchain code (presumably because we don't want to store the zero-d out block in the db?). 2) ValidFinalizedBlock - asserts that we have a justified and finalized checkpoint and head block. At this point blockchain has what it needs to start up fork choice. blockchain can then fire the usual Initialized signal(pending verification of the weak subjectivity root if the flag is present). **Initialization refactor** This is potentially a more extreme option, and if we're interested in this direction I'll expand more on a design. Currently beacon-chain leaves initialization to the individual servies, calling their constructors, then Start(), and letting them figure things out amongst themselves. This comes at the cost of the event-based initialization documented above. Moving the initialization flow into node.go could make the whole process more visible and easy to understand. For instance by letting one service Start before another service is initialized, we could ensure that the state which blockchain or sync require is present from within the beacon node service, reducing the amount of startup validation performed by services like blockchain. We can also move more of the flag-based choices as to whether a given service should start (like we can see in the initial-sync startup code) into one scope. **A little of column a, column b** In order to avoid making a mess of the BeaconNode code, we can implement a StateValidator type package which is used by BeaconNode, but leave BeaconNode as the only consumer of the event. In this design BeaconNode would be in charge of explicitly providing values to the services (things like a p2p object, or the genesis timestamp), with a looser coupling between the StateValidator logic and the initialization of services. # notes from review - we are going with option #2 - distributing the BeaconState and SignedBeaconBlock as separate ssz encoded files. This drastically reduces the effort to deliver this feature by avoiding large structural changes - this also means that any code health changes for service initialization will get prioritized on their own merits when we think it's appropriate - we have user feedback that a tool to fetch the state+block from a running beacon-node via the api would be valuable, esp for large operators. we'll include that as part of the scope for the first release. we think this is doable via existing rpcs - we want to look closely at teku and lighthouse implementations to align with them. they both support url methods (going to follow up on this in code-and-design) so we should work together with them to find the best approach # Unpacking the effort We'll break up the work into 2 phases: checkpoint sync and backfill. ## Checkpoint Sync ### tool support (2 days) cli tool that can: - query a running beacon node for checkpoint epoch:root - request state at that epoch boundary - request block incorporated into state - write results to local filesystem - build a new cmd subpackage for this tool ### checkpoint sync from file (5 days) 1. initialize db from state + block - a lot of the wiring to accomplish this is already in place through code written as part of exploring this concept up to the point where I went down the block fetching rabbit hole. finish that partial implementation and improve test coverage. - more DB support should be added for keeping track of what state was used to initialize the db 2. develop a testing strategy / end-to-end test that understands checkpoint sync - most of the scope in this work item is here. ### changes to compensate for missing history (1 day) 1. RPC methods should return an error response when blocks are requested that fall between genesis and the synced checkpoint. look at other clients to determine the right response format. Work with Radek on this. For the first pass we can stub out the backfill database methods so that we can retrieve the backfilled slot number (which will always be 0) until we actually implement backfill. ## Backfill Need to do more research to determine how to approach this.