node sync algorithm research

# node sync algorithm research ### Filecoin: Lotus https://github.com/filecoin-project/lotus/blob/eede19fb0b2d3a7726f35392d0028b9e2ff56ce9/chain/sync_manager.go `syncManager` implements the interface: ``` // SyncManager manages the chain synchronization process, both at bootstrap time // and during ongoing operation. // // It receives candidate chain heads in the form of tipsets from peers, // and schedules them onto sync workers, deduplicating processing for // already-active syncs. type SyncManager interface { // Start starts the SyncManager. Start() // Stop stops the SyncManager. Stop() // SetPeerHead informs the SyncManager that the supplied peer reported the // supplied tipset. SetPeerHead(ctx context.Context, p peer.ID, ts *types.TipSet) // State retrieves the state of the sync workers. State() []SyncerStateSnapshot } ``` `collectHeaders` function does the heavy lifting, requesting headers between `known` and `incoming`: https://github.com/filecoin-project/lotus/blob/d7076778e2666e88f2855bd83569142768120185/chain/sync.go#L1256 network-level request happens here: https://github.com/filecoin-project/lotus/blob/d7076778e2666e88f2855bd83569142768120185/chain/exchange/client.go#L296 `syncManager.doSync` is set to https://github.com/filecoin-project/lotus/blob/d7076778e2666e88f2855bd83569142768120185/chain/sync.go#L534 flow generally seems to be: - syncManager receives a new peer head, if it's higher than ours (and we haven't started syncing yet) spawn a new sync worker that calls `doSync` - the target for this worker is returned by `selectInitialSyncTarget` - seems this function is called once for bootstrapping, and returns the highest head seen as of that point - if we already started syncing, then `addSyncTarget` is called instead - if there is no worker for it then a worker is started, otherwise it may be queued if it's deemed to be a heavy enough chain ### Substrate https://github.com/paritytech/substrate/blob/4c3a55e7ca5c4c85c1eb53fd82ed71029d952510/client/network/src/protocol/sync.rs substrate has the following constants: ``` /// Maximum blocks to request in a single packet. const MAX_BLOCKS_TO_REQUEST: usize = 128; /// Maximum blocks to store in the import queue. const MAX_IMPORTING_BLOCKS: usize = 2048; /// Maximum blocks to download ahead of any gap. const MAX_DOWNLOAD_AHEAD: u32 = 2048; /// Maximum blocks to look backwards. The gap is the difference between the highest block and the /// common block of a node. const MAX_BLOCKS_TO_LOOK_BACKWARDS: u32 = MAX_DOWNLOAD_AHEAD / 2; /// Maximum number of concurrent block announce validations. /// /// If the queue reaches the maximum, we drop any new block /// announcements. const MAX_CONCURRENT_BLOCK_ANNOUNCE_VALIDATIONS: usize = 256; /// Maximum number of concurrent block announce validations per peer. /// /// See [`MAX_CONCURRENT_BLOCK_ANNOUNCE_VALIDATIONS`] for more information. const MAX_CONCURRENT_BLOCK_ANNOUNCE_VALIDATIONS_PER_PEER: usize = 4; ``` we should implement `MAX_CONCURRENT_BLOCK_ANNOUNCE_VALIDATIONS` currently there is no limit substrate: - tracks best reported hash/number for each peer - has hash map of currently pending requests by peer ID (outgoing??) - light sync only requests headers and justifications, doesn't execute blocks (maybe we can implement this after) - looks like substrate handles syncing forks specifically (`set_sync_fork_request`) - also has a separate justification requests queue; only requests one justification at a time - responses are handled by also passing in the corresponding request ``` // A chain is classified as downloading if the provided best block is // more than `MAJOR_SYNC_BLOCKS` behind the best queued block. ``` `MAJOR_SYNC_BLOCKS = 5` on new peer: - validate handshake message - if their best block is >5 ahead of ours, add peer to sync peers set - if their best hash is unknown (`BlockStatus::Unknown`), perform "ancestor search" to find highest common - if their best hash is known (`BlockStatus::InChainPruned`), add to sync peers set and add to pending requests on blocks processed: - calls `on_blocks_processed` which may update the request queue - may also add new justification request on justification processed: - may update pending requests ancestor search: - binary search for highest common ancestor between two nodes requesting on forks: - sync is aware of blocktree, so deletes requests for forks if it's found that they are pruned https://github.com/paritytech/substrate/blob/026a8491694783dd95c9c3d2f917b11cf609cb49/primitives/consensus/common/src/lib.rs#L48 https://github.com/paritytech/substrate/blob/d2a43d47ab7339e70c3448b4553073584cfd61ab/client/network/src/protocol/sync/extra_requests.rs ### smoldot (alternative client for substrate chains) smoldot is essentially a "lighter" version of substrate. from the implementer: > In case that helps: in smoldot I split the syncing strategy in two, with the initial syncing and head-of-chain syncing in different modules. The initial syncing simply downloads chunks of 128 blocks at a time (without keeping track of forks, since we're supposed to be on the finalized chain) and verifies them, while the head of chain syncing downloads all forks. The head of chain syncing is basically a data structure containing: > Blocks ready to be verified (whose header and body are known, for example) Blocks that we know exist but aren't ready to be verified, for example because only their header is known or because their parent hash isn't a known block. When we connect to a peer and they say in their handshake that their best block is block hash H, we also insert H in there. On-going network requests that target not-ready-to-be-verified yet blocks. When a network request finishes, we update the data structure and potentially turn not-ready-to-be-verified into ready-to-be-verified blocks > The syncing algorithm also needs to track which blocks are known to which peers. In smoldot this is simply a glorified BTreeSet<(PeerId, BlockHash)>. This code is definitely one of the hardest challenges when implementing a Polkadot client though Blocks that we know exist but aren't ready to be verified > This list needs to be bounded, ideally, to avoid attacks where someone sends millions of block announces that we can't verify as well, the ancestor search is used for at-the-head fork syncing. thus, gossamer should implement 2 different modes of syncing - bootstrap and "idle" or near-head. https://github.com/paritytech/smoldot/blob/main/src/sync/all_forks/disjoint.rs https://github.com/paritytech/smoldot/blob/main/src/sync/all.rs https://github.com/paritytech/smoldot/blob/main/src/sync/optimistic/verification_queue.rs ### Gossamer sync algorithm design #### Requirements 1. modular - needs to be able to support multiple algorithms in the future, ie `full`, `light`, and `warp` 2. isolated from the networking layer - network package should only deal with sending and receiving messages (and potentially scoring, TBD) 3. needs to be able to handle requests for block [headers, bodies] as well as justifications 4. needs to handle 2 different modes of sync - bootstrap and near-head #### interface for syncer within network package ``` type Syncer interface { // HandleBlockAnnounceHandshake updates our view of the peer's state by potentially updating their best block and hash HandleBlockAnnounceHandshake(from peer.ID, msg *BlockAnnounceHandshake) error // HandleBlockAnnounce ... HandleBlockAnnounce(from peer.ID, msg *BlockAnnounceMessage) (propagate bool, err error) } ``` Note that `HandleBlockAnnounceHandshake` is of type `network.HandshakeValidator` and `HandleBlockAnnounce` is of type `network.NotificationsMessageHandler` #### interface for network within syncer package ``` type Network interface { // DoBlockRequest sends a request to the given peer. If a response is received within a certain time period, it is returned, otherwise an error is returned. DoBlockRequest(to peer.ID, req *BlockRequestMessage) (*BlockResponseMessage, error) // BanPeer disconnects and blocks the given peer from reconnecting // Should only be used in extreme circumstances BanPeer(p peer.ID) // TODO: do we want to add banning now, or wait until scoring? } ``` #### syncer design similar to Lotus, we will have a `chainSync` module, which keeps track of the current sync workers and handles their success or failure. a worker will consist of: - start hash/number to begin request from - target hash/number - ID - data to request - direction of request also similar to both, we will keep track of the latest known state for each peer, ie. their best block hash and number. this is obtained via the `BlockAnnounceHandshake` message. we can use this info to determine who to request blocks from we will also have a `chainProcessor` module which is similar to the current `syncer.Service` in that it processes the blocks once they are ready. this module should have some queue of blocks that the `chainSync` writes to, which the `chainProcessor` continually reads from and processes. the `chainSync` should make sure the blocks are ordered / have no gaps, as it needs to make sure there are no blocks missing in a response. the worker will request in ordered 128-block chunks, so it can just make sure it receives a response before moving on to the next #### bootstrap mode Bootstrap mode simply requests blocks in chunks of 128 and ensures they are ordered for processing (ie. in ascending order, without gaps). As we receive justifications along with the blocks, as the chain at this point is already finalized, we don't need to track forks during bootstrap. beginning bootstrap: - should be started once we connect to some number of peers (5?) and receive their best blocks - once we have their best blocks, we pick the highest as our target - dispatch a worker with start=0 and target=highest seen TODO: should we have parallel workers, or just 1 for bootstrap? - worker will request from peers who report that they have the blocks we are currently looking for - on response, pre-validate blocks and push the blocks into the `chainProcessor` queue if validated - if the blocks aren't validated, then return an error and downscore the peer - on failure, dispatch another worker with updated `start` and retry - updated `start` should be our best block (or should it be the end of the validation queue?) - probably go with best block for now, duplicate blocks are better than potentially missing response validation: - `chainSync` should perform some response pre-validation before pushing to the processor - it should check that: - the response is not empty - the response contains all the expected fields (header and body for bootstrap) - each block has the correct parent, ie the response constitutes a valid chain - once this is done, then the blocks can be pushed for processing block processing: - essentially the same as now - only difference is the queue - should it be a channel, or a full data structure? - depends on whether we need to know what's in the queue - idle mode might need to know what's in the queue, so we can determine whether we have the parent block yet - or, we can just put the block in the queue only if we already imported the parent (might actually be nicer) note: substrate/smoldot tracks all requests that have been sent out, but response has not yet been handled, do we need this or will tracking workers be sufficient? #### idle mode Idle mode (or near-head mode) requires tracking of forks. We need to download every fork we see. this *may* require "ancestor search" to find the root of the fork. (we can also use some naive approach where we just request for the blocks think might be on that fork, perhaps by descending order) Need to track: - blocks ready to be verified (ie. we have header and body and parent block) - blocks we know of but can't be verified yet (ie. only have header, have hash from a handshake, haven't imported parent block yet) - what data structure to use for this? - smoldot uses "disjoint blocks" data structure which is a hash map of (number, hash) -> block + additional data - maybe we can use a tree structure for this, or hashmap as well workers will be dispatched to target the blocks we know of but can't yet be verified - when we get the data needed, we move the blocks over to the ready queue process for syncing a fork: eg. our best block is number 100 with hash `0xab..cd`. a peer reports a block 100 with hash 0xef..gh. the peer is on a fork, we need to request blocks in descending order from `0xef..12` dispatch a worker with the following: ``` startHash = 0xef..12 startNumber = 100 direction = DIR_DESCENDING targetHash = ? targetNumber = ? ``` when we receive the response, hopefully the fork is not >128 blocks long, in which case we can traverse the results to find the common ancestor of both our chains if the fork is >128 blocks long, we will need to dispatch another worker with an updated startHash being the previous-128 #### bootstrap module #### idle module