Reducing Prysm Proposer Slot 0 propose time by 800ms

The background

Every Ethereum slot is 12s, and into the 4s mark is the attestation cut-off. A newly proposed block is highly subject to getting reorg if it's not seen before the 4s cut-off due to a lack of attestation votes. Attesters vote on the head at the 4s mark.

For the block proposal, a validator client does the following:

  1. Validator client calls GetBlock RPC end point
  2. Beacon node builds and returns the block
  3. Validator signs the block and calls ProposeBlock RPC end point
  4. Beacon node broadcasts the block to the rest of the network

(1) should happen at the zero second of the slot, (2) is the critical path as the beacon node builds the block by packing consensus objects (ie. attestations, deposits, exits etc) and getting execution payload from either local EL client or highest bid builder using mev-boost. We have seen from the 90th percentile, (2) takes one to two seconds to complete. Rest (3) and (4) should be fast. Without mev-boost, a validator should see block broadcasts under 1s. With mev-boost, a validator should see block broadcasts between 1-2s depending on your network latency.

The discovery

Two weeks ago, I wanted to improve validator logging by adding when it begins and ends building blocks. Think of the start time and end time for (2).

Pull request: https://github.com/prysmaticlabs/prysm/pull/12452/

Example:

{"message":"Begin building block","prefix":"rpc/validator","severity":"INFO","sinceSlotStartTime":107853265,"slot":5792614}
{"message":"Finished building block","prefix":"rpc/validator","severity":"INFO","sinceSlotStartTime":297280711,"slot":5792614,"validator":54335}

From the logs above, the validator begins building block 107ms into the slot and ends 297ms into the slot.

As we said earlier, a validator should call GetBlock at the start of the slot. In the ideal world, Begin building block's sinceSlotStartTime should be as close to 0 as possible. But why is it not 0? Could be local RPC latency of the beacon node and the validator client, or it could be validator is performing "other tasks" beforehand. After reading the code, I discovered validator client checks exit status and current epoch assignment before calling GetBlock

It quickly dawns on me that getting current epoch assignment is not a cheap call, and this will significantly affect slot 0 proposer performance.

The bottleneck

Let's take a step back. The validator client is just a dumb signer. It requests blocks and attestations from the beacon node to sign. It's not aware of when it needs to request, so it needs to call GetDuties to get its attester slot (once every epoch) and proposer slot. The proposer slot is safely known at the start of epoch slot 0. With a caveat, the proposer slot of the next epoch is semi-safely known at the start of the current epoch as well. Semi-safely because the proposer slot could change at the start of the epoch due to the proposer balance falling below the threshold due to slashing. Here is the spec definition

effective_balance = state.validators[candidate_index].effective_balance if effective_balance * MAX_RANDOM_BYTE >= MAX_EFFECTIVE_BALANCE * random_byte: return candidate_index

Now you may ask, why do we care about the next epoch proposer? We care about the next epoch proposer because of using ForkChoiceUpdate + PayloadAttribute in the Engine API to signal intent to EL client for local block construction. This must be done at slot 31 rather than slot 0 to ensure enough time to construct a profitable execution block. Here is the spec definition

Another thing to note is after processing a block at slot 31, client implementations have look ahead optimization to advance the slot to the next epoch to avoid spending time when the slot 0 block arrives. Advance slot to the next epoch involves precomputing shuffling cache for the attester committee and proposer. This can easily save up to 500ms when the slot 0 blocks arrive.

Now here is what Prysm missed:

  • At the end of epoch 1, the Prysm beacon node caches epoch 2's shuffling result
  • Prysm beacon node does not attempt to cache epoch 3's shuffling result
  • Prysm validator calls GetDuty, which returns epochs 2 and 3's shuffling result. Epoch 2 is warm in the cache, but epoch 3 is cold in the cache. GetDuty will compute epoch 3's shuffling on the fly, which adds additional 500ms latency to complete the call

That's the reason why a Prysm validator at slot 0 will call GetBlock 500ms late into the slot, which eats in the 4s attestation cutoff.

The fix

After exploring many solutions, including extending shuffling cache to two epoch worth of data, I've settled on the simplest solution for now. The solution is to call GetProposerIndex on a beacon state slot set to epoch 3 after caching epoch 2's shuffling result. This will warm the shuffling cache for epoch 3, so for GetDuty call at epoch 2 slot 0, it'll be fast, and we verified it only took 50ms. That's a reduction from 500ms to 50ms. Note there is still inefficiency, such as slot 31 being missed, which will be addressed in the subsequent PR.

Select a repo