Holesky Prep Testnet Attestation Checkpoint issue (Lighthouse/Prysm)

Issue

On Prysm nodes, we're seeing some attestations like this from Lighthouse

time="2023-08-02 16:25:04" level=debug msg="Gossip message was ignored" agent="Lighthouse/v4.3.0-dfcb336/x86_64-linux" error="epoch 3 root 0x094f99ab02d5b0d71f54fe3a2309d360eeb6ccca96f91d72c7f8b350533d836b: not a checkpoint in forkchoice" gossip score=21.22931150205812 multiaddress="/ip4/144.126.197.69/tcp/59008" peer id=16Uiu2HAm9GwUAyubQR85fa4RDaTmmqp6QKniDXVHoboKrx8dRBMc prefix=sync topic="/eth2/eec15ee2/beacon_attestation_63/ssz_snappy"

From @potuz:

@sproul could you check if Lighthouse is struggling or in a very long fork? we have a block in 64 which is a direct descendant of 63 and Lighthouse keeps using 63 as a checkpoint
but yeah, whatever this is, the processing time of these blocks shows we can't reallistically handle these networks in these computers at least. 63 took like 9 seconds to process

It seems like Lighthouse is using an old root for the target checkpoint in attestations.

Investigation

I checked a few log files and found similar occurence as well on different slot numbers in the Lighthoue/Besu & Prysm/Geth nodes:

time="2023-08-02 16:24:21" level=debug msg="Gossip message was ignored" agent="Lighthouse/v4.3.0-dfcb336/x86_64-linux" error="epoch 3 root 0xae769c6e065ecca568214148c2d4676d623d3ca24d7db008c0061910475b1f22: not a checkpoint in forkchoice" gossip score=0.6617806177346196 multiaddress="/ip4/68.183.113.156/tcp/9000" peer id=16Uiu2HAmMs63ZdcqCtf7dbQH7JUXkXhEsDeeLHPTmQEhxFDRYqne prefix=sync topic="/eth2/eec15ee2/beacon_attestation_52/ssz_snappy"

This checkpoint root 0xae769c6 for this attestation is pointing to the block at slot 91.

Comparing the lighthouse logs at around the same time, the block at slot 94 was already imported as the head block:

Aug 02 16:23:51.436 DEBG Fork choice updated head                old_head: 0xae769c6e065ecca568214148c2d4676d623d3ca24d7db008c0061910475b1f22, old_head_weight: Some(4013024000000000), new_head: 0x02ba7c03aacd27bad2e9dd3a910bebff8dfb5988efb84eae20ec684d93e62bec, new_head_weight: Some(840000000000000), service: beacon

<more logs..>

Aug 02 16:23:54.001 INFO Synced                                  slot: 94, block: 0x02ba…2bec, epoch: 2, finalized_epoch: 0, finalized_root: 0x5096…9387, exec_hash: 0xaa94…2f26 (verified), peers: 81, service: slot_notifier

And then this was logged when preparing the attestation:

Aug 02 16:24:16.004 DEBG Attester cache miss                     request_slot: 96, head_state_slot: 94, beacon_block_root: 0x02ba7c03aacd27bad2e9dd3a910bebff8dfb5988efb84eae20ec684d93e62bec, service: beacon

So we know that for this attestation:

request slot = 96, request epoch = 3
head state slot = 94
target check point should have epoch 3, and root 0x02ba7c0 (94)
however the Prysm node is seeing an older root 0xae769c6 (91) in the target checkpoint, which I can't really explain from looking at the code.

It seems like the whole time we're seeing a large number of invalid attestation from the network, mostly due to them being "Past Slot", which makes me wonder if the messages delivery were severely delayed.

Aug 02 16:24:16.171 DEBG Invalid attestation from network        type: "aggregated", peer_id: 16Uiu2HAm7ZKCLyx3KmtyZLuTofaWDjjqqGPko7qhkCJWRBpRRYw7, block: 0x29a25932ed6eaea731f3375f1668e9a4a986824ce14a584320387776beb88fa6, reason: PastSlot { attestation_slot: Slot(45), earliest_permissible_slot: Slot(64) }

Additional Info

Lighthouse Peer IDs:16Uiu2HAmV9PVEounMWpf3TWNsvrKi2tbkNHEZSyeEBDFKhn7MkXf (Prysm disconnected due to fault, due to invalid attestations from this node)
Prysm Peer ID: 16Uiu2HAmKE7dpAckeUxzaGta9HDgp5FMDGjMxrz7gDozC1o1eabm

Issue

Investigation

Additional Info

Read more

The Impact of Increasing Blob Count: Pectra, PeerDAS and Beyond

Lighthouse PeerDAS Metrics

Kurtosis + Kubernetes

Gossip Blobs Signature Verification Efficiency