Slasher Backfilling Requirements and Challenges

# Slasher Backfilling Requirements and Challenges **By: Raul Jordan** [toc] ## Objective To define the requirements for a slasher backfilling algorithm and approaches towards a sound implementation in Prysm. ## Background Our newly optimized slasher runs _after_ initial sync to build up a history of data for attesters and proposers. That is, slasher is able to detect anything that conflicts against the history it has accumulated over time. The downside is that if slasher does not have a recorded history for a certain validator event, it will not be able to detect a slashable event. For example, if a validator submits an attestation at at epoch `N`, then a slasher begins synchronizing a node after that point, then at some epoch `N + k`, that same validator commits a slashable offense, slasher will **NOT** be able to catch the offense. To solve this problem, we have two potential solutions: (1) We run slasher during chain initial sync, or (2) We perform a "backfilling" routine to populate slasher's history after initial sync is complete Approach (1) ends up putting a lot of pressure on the beacon node during initial sync that will lower the bps to sub 1bps (blocks per second). This is unacceptable as it would seriously harm the user experience of node operators. Additionally, slasher does **NOT** need all the data since genesis. At any given epoch, `N`, slasher only needs `N - WEAK_SUBJECTIVITY_PERIOD` epochs of history. In practice, this amounts to 4096 epochs of history as a sliding window of information. Given it has been many more epochs since genesis, it does not make sense to gather this data during initial sync only to throw it away soon afterwards. Approach (2) ends up making more sense as we have more control of how to do a slow, backfilling procedure for historical data in the background without worrying about handling burts of data all at once in initial sync. After initial sync is complete, slasher can begin fetching blocks from the last 4096 epochs and applying them through the slasher algorithm. The speed and pace of this can be bounded until there is no more data to apply. At that point, slasher will just continue listening for newly received events from the network to keep trying to detect slashable conditions. Every time slasher syncs, it must backfill if it has been greater than one epoch. ## Open Question The biggest question is: **"do we even need to perform data backfilling?"** For the sake of correctness, **yes**, slasher should perform some amount of backfilling. There should exist a slasher implementation in Ethereum that can detect even on old data for the security of the network. ## Algorithm After initial sync, a beacon node with the `--slasher` flag will be at some epoch, `N`. Starting at epoch `N`, we perform the following, **blocking** routine: Let `epochOnClock(genesisTime time.Time)` be the current epoch based on the genesis time and current unix timestamp. 1. Fetch the highest epoch slasher should use to stop backfilling. At the start, this will be epoch `N` we have just synced to 2. Compute the `WEAK_SUBJECTIVITY_PERIOD` for the number of epochs slasher should care about 3. Compute the lookback number of epochs, set to `N - WEAK_SUBJECTIVITY_PERIOD` 4. Start at the lookback epoch, fetch all blocks. For each block, fetch each attestation and convert to indexed form in the epoch. Pass the processed data into slasher and wait for it to complete its indexing of the data 5. Continue for each epoch until we reach epoch `N`. At that point, check the difference between `N` and `epochOnClock(genesisTime)`. If there is a difference > 0, we set lookbackEpochs to `N` and set `N` to `epochOnClock(genesisTime)`. If there is no difference, exit 6. Go back to step 1 ```go func main() { // Complete initial sync of the chain. beacon.initialSync() // Begin a slasher backfilling routine. wssPeriod := computeWeakSubjectivityPeriod() genesisTime := getGenesisTime() // The max epoch we have synced to. // (could be different than the clock epoch) maxEpoch := beacon.headEpoch() // The lowest epoch we lookback at. lookbackEpoch := maxEpoch.Sub(wssPeriod) for { currentEpoch := epochOnClock(genesisTime) // If we have no difference between the max epoch // we have detected for slasher and the current epoch // on the clock, then we can exiit the loop. if diff(maxEpoch, currentEpoch) == 0 { break } // We set the max epoch for slasher to the current // epoch on the clock for backfilling. maxEpoch := currentEpoch slasher.backfill(lookbackEpochs, maxEpoch) // After backfilling, we set the lowest epoch for backfilling // to be the max epoch we have completed backfill to. lookbackEpoch = maxEpoch } // Begin the slasher listening routine. slasher.listenForEvents() } func backfill(start, end types.Epoch) { for i := start; i < end; i++ { blocks := getBlocksAtEpoch(i) extractedHeaders, extractedAttestations := preprocessBlocksData(blocks) slasher.detectSlashableBlock(extractedHeaders) slasher.detectSlashableAttestations(extractedAttestations) } } ``` ## Problems The algorithm is fairly straightforward, but it quickly runs into some serious bottlenecks when ran in a real environment. Let's think about the numbers: (1) 4096 epochs worth of blocks, each epoch 32 slots, each block has on average 80-100 attestations (let's stick to the higher end). We end up with 13 million attestations (2) Before detecting slashable blocks and attestations from raw data, we need to transform the data in the format slasher needs. For attestations, this involved converting attestations into an **IndexedAttestation** format, requiring retrieval of the committee from a historical state (3) The processing from the prior step is already expensive enough, but trying to detect on 13 million attestations at once would be overkill for slasher, despite its efficiency in its use of disk space We could try doing the operations above in batches, but evidence on a mainnet beacon node with the algorithm above. This was implemented in https://github.com/prysmaticlabs/prysm/pull/9211, showing that fetching 4096 epochs worth of blocks took 1m1s on a machine running the recommended [Prysm hardware specs](https://github.com/prysmaticlabs/prysm/pull/9211). The second step, however, took over an hour for preprocessing a batch of 50 epochs worth of attestations. The expensive routine has to do with retrieving historical states, as Prysm only stores certain checkpoint states and has to regenerate requested states on the fly. 50 epochs of data is still 50 x 32 x 100 = 160,000 attestations, with a lot of them unique. The process took so long, that the amount of time required for step (3) was not even measured. However, we anticipate step (2) is the serious bottleneck. ## Suggested Approach The real expensive part of converting an attestation into indexed form is the retrieval of the attesting committee indices for any given epoch. To do this, we need the following: 1. Retrieve the epoch target state of an attestation 2. Retrieve the seed used for shuffling 3. Compute the committees 4. Convert an attestation to indexed using the committees To make this approach faster, a beacon node with slasher enabled could require storing of the shuffling seeds in the database during initial sync. Then, the algorithm for backfilling could be a lot faster. Additionally, we should perform backfilling in batches to prevent adding too much pressure to BoltDB. Our database does not like long-lived transactions, and suffers serious performance problems if we were to try opening a transaction to detect and store 13 million attestations. ## Testing To test the approach, it is suggested to add the changes for storing the seed for each epoch, then to sync a node. After sync is done, we can check how slow backfilling is and optimize accordingly. Most of the algorithm and its tests are included in https://github.com/prysmaticlabs/prysm/pull/9211