Kiln testnet block proposal failure

Date: March 15, 2022

Incident Summary

Shortly after total terminal difficulty has reached (~3/15 3:00 PM UTC). Prysm node proposed bad block with the following error:

{"error":"could not process block: could not verify new payload: could not validate execution payload from execution engine: could not validate block hash: ","message":"Could not handle p2p pubsub","prefix":"sync","severity":"ERROR","topic":"/eth2/e7acb210/beacon_block/ssz_snappy"}

A similar error was observed over the wire:

ERROR sync: Could not handle p2p pubsub error=could not validate block hash:
could not validate execution payload from execution engine

Impact

Data shows no beacon block came out of validators that were ran by EF and Prysmatic Labs. The client combos were Prysm - Geth, and Prysm - Nethermind. The missing blocks account for ~15-20% of the total blocks. There was no impact on attestation participation, although the valid blocks were more full to account for more attestations from the missing blocks.

Root causes

Prysm beacon node used incorrect endianness to marshal/unmarshal the base_fee_per_gas field in execution_payload object. Today, the execution layer uses big endianness, and the consensus layer uses little endianness. Since Prysm incorrectly unmarshals execution_payload back to original form, the execution layer client correctly rejected the deformed payload when calling engine_newpayloadv1 endpoint by returning INVALID_BLOCK_HASH.

Resolution

The issue was identified by comparing the before and after unmarshalled execution_payload. At the same time, Mario Vega and MariusVanDerWijden also noticed a similar pattern:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Upon discovery, the endianess bug was quickly patched. We then tested the patch on local and cluster setup before building the docker image and releasing it to everyone else.

Detection

After TTD has reached, Kiln testnet block explorer began reporting missing Prysm blocks, and the error logs in cluster nodes confirmed it. This issue did not show up in the previous devnets, as Marius pointed out, the base fee was 7 which is equal regardless of the endianness. It also did not show up in the unit tests because 7 was used as the input value.

Action items

Action Item	Type	Owner	Relevant Link
Fix base fee per gas endianness	Code change	Terence
Test the fix in cluster	Testing	Terence
Update e2e test to include tx generator	Testing	Nishant
Add differential fuzzing for engine api round trip	Testing	Nishant
Release docker image	Release	Terence	https://gcr.io/prysmaticlabs/prysm/beacon-chain:kiln-3ea8b7
Post mortem	Documentation	Terence

Lessons Learned

What went wrong

Prysm Proposer was unable to propose blocks

Where we got lucky

Client diversity. Even with Prysm proposers down, the chain was relatively healthy
Community support. People like Mario and Marius around to help with the debugging. (Thanks!)
Great tooling. Added a tool to enable Prysm beacon node to fake propose every slot for faster triage

Kiln testnet block proposal failure

Incident Summary

Impact

Root causes

Resolution

Detection

Action items

Lessons Learned

Read more

What Happens After Finality in ETH2?

Migrating Prysm to Slog

Running BOLD Challenges on Sepolia

Generalized History Commitment