Try   HackMD

Kiln testnet block proposal failure

Date: March 15, 2022

Incident Summary

Shortly after total terminal difficulty has reached (~3/15 3:00 PM UTC). Prysm node proposed bad block with the following error:

{"error":"could not process block: could not verify new payload: could not validate execution payload from execution engine: could not validate block hash: ","message":"Could not handle p2p pubsub","prefix":"sync","severity":"ERROR","topic":"/eth2/e7acb210/beacon_block/ssz_snappy"}

A similar error was observed over the wire:

ERROR sync: Could not handle p2p pubsub error=could not validate block hash:
could not validate execution payload from execution engine

Impact

Data shows no beacon block came out of validators that were ran by EF and Prysmatic Labs. The client combos were Prysm - Geth, and Prysm - Nethermind. The missing blocks account for ~15-20% of the total blocks. There was no impact on attestation participation, although the valid blocks were more full to account for more attestations from the missing blocks.

Root causes

Prysm beacon node used incorrect endianness to marshal/unmarshal the base_fee_per_gas field in execution_payload object. Today, the execution layer uses big endianness, and the consensus layer uses little endianness. Since Prysm incorrectly unmarshals execution_payload back to original form, the execution layer client correctly rejected the deformed payload when calling engine_newpayloadv1 endpoint by returning INVALID_BLOCK_HASH.

Resolution

The issue was identified by comparing the before and after unmarshalled execution_payload. At the same time, Mario Vega and MariusVanDerWijden also noticed a similar pattern:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Upon discovery, the endianess bug was quickly patched. We then tested the patch on local and cluster setup before building the docker image and releasing it to everyone else.

Detection

After TTD has reached, Kiln testnet block explorer began reporting missing Prysm blocks, and the error logs in cluster nodes confirmed it. This issue did not show up in the previous devnets, as Marius pointed out, the base fee was 7 which is equal regardless of the endianness. It also did not show up in the unit tests because 7 was used as the input value.

Action items

Action Item Type Owner Relevant Link
Fix base fee per gas endianness Code change Terence
Test the fix in cluster Testing Terence
Update e2e test to include tx generator Testing Nishant
Add differential fuzzing for engine api round trip Testing Nishant
Release docker image Release Terence https://gcr.io/prysmaticlabs/prysm/beacon-chain:kiln-3ea8b7
Post mortem Documentation Terence

Lessons Learned

What went wrong

  • Prysm Proposer was unable to propose blocks

Where we got lucky

  • Client diversity. Even with Prysm proposers down, the chain was relatively healthy
  • Community support. People like Mario and Marius around to help with the debugging. (Thanks!)
  • Great tooling. Added a tool to enable Prysm beacon node to fake propose every slot for faster triage