# Trailing-Edge Slot Processing There is an issue on mainnet and Pyrmont where blocks from `slot % SLOTS_PER_EPOCH == 0` are arriving late. Due to a quirk in fork choice, validators are building their block atop the late block, even though they attested to an empty block in that slot. The result is that validators infrequently but consistently miss the head/target on their attestations because they built a conflicting block. This document examines the shape and size of the problem on the Pyrmont testnet. It also details a change that can reduce block propagation times. *Note: all data in this block is from Pyrmont. Mainnet analysis can follow later.* ## Identifying the Problem This graph represents the delay between the when a block should have been created and when it was recieved. For example; if a block with slot `1` was recieved half-way through slot `1`, then the delay would be `6 seconds` (half a slot). **<GRAPH OF SLOT DELAY>** Here we can see delays of up to 9 seconds with typical values between 0.5 and 2 seconds. <-- **TODO** We will start to see the issue when blocks arrive between 3-12s late. Unaggregated attestations are broadcast at 3 seconds, if the block hasn't been seen (or verified) by this point the validators will attest to emptiness. If the block is then recieved within 12s (one slot) then the next producer will build atop it. This creates the conditions for our issue; attesting to empty slots but not building atop them. ## Determining the Cause Broadly, there are two potential reasons for this delay: 1. Validators are producing blocks late (due to bad clocks, inadequate resources, etc). 1. Propagation across the network is slow. (1) is the most obvious and simple reason. Whilst I think it's possible for clients to produce blocks faster, I'm seeing maximum block production times of ~500ms for the Sigma Prime Lighthouse Pyrmont nodes. This makes me think that (2) is at play as well. For various reasons this article will focus on (2) and leave (1) out of scope. Moving forward with (2), slow propagation across the network is likely because nodes are taking a long time to verify the block prior to propagation. This seems likely because we're not seeing a similar lag on attestations, an indication that the the gossip network is functioning well. To help confirm this, let us look at the time it takes SigP Lighthouse nodes on Pyrmont to verify a block. **<GRAPH OF BLOCK VERIFICATION DELAY>** Here we can see that verifying gossip blocks can take up to 50ms and that there is a strong correlation between the blocks that were seen the latest and those that took the longest to verify prior to propagation. Whilst I haven't done research to uncover the direct link between slow verification and slow propagation, considering that a block must travel many hops (and therefore pass many slow verifications) it seems reasonable that a per-node delay of 50ms might snowball into a multi-second delay across the network. Especially considering that some low-powered nodes might take longer to verify than 50ms. From this point, we will assume that slow verification of blocks prior to propagation is a significant factor in block propagation delays. ## Addressing the Cause Lighthouse has been testing a fix to this issue by modifying the procedure for block verification/import. Here's the existing workflow for Lighthouse importing a block: 1. Obtain the parent block/state from our block-processing cache (or DB). 1. Perform `per_slot_processing` once. 1. Perform `per_slot_processing` for each skipped slot between the parent and `block`. 1. Verify the block signature and other items. 1. Republish the block on the gossip network. 1. Peform more checks on `block`. 1. Import the block/state to the database. 1. Add the block/state to our block-processing cache. Here's how we modified it: 1. Obtain the parent block/state from our block-processing cache (or DB). 1. ~~Perform `per_slot_processing` once.~~ 1. Perform `per_slot_processing` for each skipped slot between the parent and `block`. 1. Verify the block signature and other items. 1. Republish the block on the gossip network. 1. Peform more checks on `block`. 1. Import the block/state to the database. 1. **Perform `per_slot_processing` once.** 1. Add the block/state to our block-processing cache. The effect is moving the mandantory `per_slot_processing` function call to *after* the block has been propagated and imported into the database. This function can take 10s of milliseconds in Lighthouse and [Nimbus](https://twitter.com/jcksie/status/1353655937253441537?s=20) (I'm not sure about other clients, but I suspect it's similar). In other words, we push the long-running `per_slot_processing` function to the *trailing-edge* of block imports, rather than the present *leading-edge* approach. This means we perform this function in the comfortable <12 seconds *after* the block has been propagated, rather than the critical (and compounding) <3 seconds beforehand. There is no fundamental additional cost to this change, simply a change in the order of operations (I wouldn't call it a "simple one liner", though). *Note: I believe Teku have already implemented a pre-emptive call to `per_slot_processing` as a special case for the head block. Instead, I propose a change to the existing order of operations.* ## Analysing the Fix