Trailing-Edge Slot Processing

There is an issue on mainnet and Pyrmont where blocks from slot % SLOTS_PER_EPOCH == 0 are arriving late. Due to a quirk in fork choice, validators are building their block atop the late block, even though they attested to an empty block in that slot. The result is that validators infrequently but consistently miss the head/target on their attestations because they built a conflicting block.

This document examines the shape and size of the problem on the Pyrmont testnet. It also details a change that can reduce block propagation times.

Note: all data in this block is from Pyrmont. Mainnet analysis can follow later.

Identifying the Problem

This graph represents the delay between the when a block should have been created and when it was recieved.

For example; if a block with slot 1 was recieved half-way through slot 1, then the delay would be 6 seconds (half a slot).

<GRAPH OF SLOT DELAY>

Here we can see delays of up to 9 seconds with typical values between 0.5 and 2 seconds. <– TODO

We will start to see the issue when blocks arrive between 3-12s late. Unaggregated attestations are broadcast at 3 seconds, if the block hasn't been seen (or verified) by this point the validators will attest to emptiness. If the block is then recieved within 12s (one slot) then the next producer will build atop it. This creates the conditions for our issue; attesting to empty slots but not building atop them.

Determining the Cause

Broadly, there are two potential reasons for this delay:

Validators are producing blocks late (due to bad clocks, inadequate resources, etc).
Propagation across the network is slow.

(1) is the most obvious and simple reason. Whilst I think it's possible for clients to produce blocks faster, I'm seeing maximum block production times of ~500ms for the Sigma Prime Lighthouse Pyrmont nodes. This makes me think that (2) is at play as well. For various reasons this article will focus on (2) and leave (1) out of scope.

Moving forward with (2), slow propagation across the network is likely because nodes are taking a long time to verify the block prior to propagation. This seems likely because we're not seeing a similar lag on attestations, an indication that the the gossip network is functioning well.

To help confirm this, let us look at the time it takes SigP Lighthouse nodes on Pyrmont to verify a block.

<GRAPH OF BLOCK VERIFICATION DELAY>

Here we can see that verifying gossip blocks can take up to 50ms and that there is a strong correlation between the blocks that were seen the latest and those that took the longest to verify prior to propagation.

Whilst I haven't done research to uncover the direct link between slow verification and slow propagation, considering that a block must travel many hops (and therefore pass many slow verifications) it seems reasonable that a per-node delay of 50ms might snowball into a multi-second delay across the network. Especially considering that some low-powered nodes might take longer to verify than 50ms.

From this point, we will assume that slow verification of blocks prior to propagation is a significant factor in block propagation delays.

Addressing the Cause

Lighthouse has been testing a fix to this issue by modifying the procedure for block verification/import.

Here's the existing workflow for Lighthouse importing a block:

Obtain the parent block/state from our block-processing cache (or DB).
Perform per_slot_processing once.
Perform per_slot_processing for each skipped slot between the parent and block.
Verify the block signature and other items.
Republish the block on the gossip network.
Peform more checks on block.
Import the block/state to the database.
Add the block/state to our block-processing cache.

Here's how we modified it:

Obtain the parent block/state from our block-processing cache (or DB).
~~Perform per_slot_processing once.~~
Perform per_slot_processing for each skipped slot between the parent and block.
Verify the block signature and other items.
Republish the block on the gossip network.
Peform more checks on block.
Import the block/state to the database.
Perform per_slot_processing once.
Add the block/state to our block-processing cache.

The effect is moving the mandantory per_slot_processing function call to after the block has been propagated and imported into the database. This function can take 10s of milliseconds in Lighthouse and Nimbus (I'm not sure about other clients, but I suspect it's similar).

In other words, we push the long-running per_slot_processing function to the trailing-edge of block imports, rather than the present leading-edge approach. This means we perform this function in the comfortable <12 seconds after the block has been propagated, rather than the critical (and compounding) <3 seconds beforehand. There is no fundamental additional cost to this change, simply a change in the order of operations (I wouldn't call it a "simple one liner", though).

Note: I believe Teku have already implemented a pre-emptive call to per_slot_processing as a special case for the head block. Instead, I propose a change to the existing order of operations.

Trailing-Edge Slot Processing

Identifying the Problem

Determining the Cause

Addressing the Cause

Analysing the Fix

Read more

SnowyHitch Panics

Windows `cache_arena.rs` panic

Serialization Overview for Ethereum CL Devs

Debugging