There is an issue on mainnet and Pyrmont where blocks from slot % SLOTS_PER_EPOCH == 0
are arriving late. Due to a quirk in fork choice, validators are building their block atop the late block, even though they attested to an empty block in that slot. The result is that validators infrequently but consistently miss the head/target on their attestations because they built a conflicting block.
This document examines the shape and size of the problem on the Pyrmont testnet. It also details a change that can reduce block propagation times.
Note: all data in this block is from Pyrmont. Mainnet analysis can follow later.
This graph represents the delay between the when a block should have been created and when it was recieved.
For example; if a block with slot 1
was recieved half-way through slot 1
, then the delay would be 6 seconds
(half a slot).
<GRAPH OF SLOT DELAY>
Here we can see delays of up to 9 seconds with typical values between 0.5 and 2 seconds. <– TODO
We will start to see the issue when blocks arrive between 3-12s late. Unaggregated attestations are broadcast at 3 seconds, if the block hasn't been seen (or verified) by this point the validators will attest to emptiness. If the block is then recieved within 12s (one slot) then the next producer will build atop it. This creates the conditions for our issue; attesting to empty slots but not building atop them.
Broadly, there are two potential reasons for this delay:
(1) is the most obvious and simple reason. Whilst I think it's possible for clients to produce blocks faster, I'm seeing maximum block production times of ~500ms for the Sigma Prime Lighthouse Pyrmont nodes. This makes me think that (2) is at play as well. For various reasons this article will focus on (2) and leave (1) out of scope.
Moving forward with (2), slow propagation across the network is likely because nodes are taking a long time to verify the block prior to propagation. This seems likely because we're not seeing a similar lag on attestations, an indication that the the gossip network is functioning well.
To help confirm this, let us look at the time it takes SigP Lighthouse nodes on Pyrmont to verify a block.
<GRAPH OF BLOCK VERIFICATION DELAY>
Here we can see that verifying gossip blocks can take up to 50ms and that there is a strong correlation between the blocks that were seen the latest and those that took the longest to verify prior to propagation.
Whilst I haven't done research to uncover the direct link between slow verification and slow propagation, considering that a block must travel many hops (and therefore pass many slow verifications) it seems reasonable that a per-node delay of 50ms might snowball into a multi-second delay across the network. Especially considering that some low-powered nodes might take longer to verify than 50ms.
From this point, we will assume that slow verification of blocks prior to propagation is a significant factor in block propagation delays.
Lighthouse has been testing a fix to this issue by modifying the procedure for block verification/import.
Here's the existing workflow for Lighthouse importing a block:
per_slot_processing
once.per_slot_processing
for each skipped slot between the parent and block
.block
.Here's how we modified it:
per_slot_processing
once.per_slot_processing
for each skipped slot between the parent and block
.block
.per_slot_processing
once.The effect is moving the mandantory per_slot_processing
function call to after the block has been propagated and imported into the database. This function can take 10s of milliseconds in Lighthouse and Nimbus (I'm not sure about other clients, but I suspect it's similar).
In other words, we push the long-running per_slot_processing
function to the trailing-edge of block imports, rather than the present leading-edge approach. This means we perform this function in the comfortable <12 seconds after the block has been propagated, rather than the critical (and compounding) <3 seconds beforehand. There is no fundamental additional cost to this change, simply a change in the order of operations (I wouldn't call it a "simple one liner", though).
Note: I believe Teku have already implemented a pre-emptive call to per_slot_processing
as a special case for the head block. Instead, I propose a change to the existing order of operations.