Homestaker self-produced block reorg

## Homestaker self-produced block reorg #### TLDR I do think that https://github.com/ethereum/execution-apis/pull/559 is actually helping a lot homestakers that builds block locally, the moment they need bandwidth the most. In that scenario there are no relays helping on publishing over the CL p2p. Since an homestaker building locally will select blobs from public mempool, it should be quite effective for remote nodes get blobs from their local EL mempool. Moreover, theoretically in such conditions, disabling flood publish should reduce the latency by reducing the number of blobs concurrently being sent over the network. ## Real event (Teku CL client) ### homestaker bandwidth The user declared about his internet connection: 180Mbit down, 30Mbit up. Usually never see more than 15Mbit/s from the entire node. ### block building This is what actually happened on mainnet for block at slot `10010852` ```text! 2024-09-21 17:30:47.048-04:00 | ValidatorApiChannel-8 | INFO | ValidatorApiHandler | Creating unsigned block for slot 10010852 2024-09-21 17:30:47.916-04:00 | OkHttp http://127.0.0.1:18550/... | INFO | ExecutionBuilderModule | Falling back to locally produced execution payload and blobs bundle (Block Number 20801650, Block Hash = 0xa66f58f52eca775f01c201ace90c619fc3006f4cb38ab4b184f22e1d41b42930, Blobs = 6, Fallback Reason = builder_header_not_available) ``` Block production started, builder bid was not returned by relayer (204 HttpStatus) so locally produced block has been choosen. Turned out that local EL decided for a **6 blobs** block. ```text! Imported #20,801,650 / 75 tx / 16 ws / 6 blobs / base fee 6.82 gwei / 4,606,716 (15.4%) gas / (0xa66f58f52eca775f01c201ace90c619fc3006f4cb38ab4b184f22e1d41b42930) in 0.122s. Peers: 20 ``` Besu produced a light payload, 75 txs consuming 15.4% of gas target. ### block\blobs publishing ```text! Block Publishing *** Slot: 10010852 start 1103ms, blob_sidecars_prepared +87ms, block_import_completed +297ms, block_and_blob_sidecars_publishing_initiated +1ms, complete +20ms total: 405ms ``` The block and blobs publishing started at 1103ms (ms into the slot) + 87ms + 297ms +1ms = **1.488 seconds into the slot**. Current Teku perform: - block and blobs publishing in parallel. - with flood publish is enabled (upcoming release will support disabling the flood publish and IDONTWANT) ### Nodes importing 10010852 On the other side, here is what we see (checked several teku nodes and they do have similar logs) ```text! Late Block Import *** Block: 294f24c16fe436e9008eca8179c5bc5f348466c8a4d4eb3b005d408ee938dc50 (10010852) Proposer: 171876 Result: failed_data_availability_check_not_available Timings: arrival 1794ms, gossip_validation +3ms, pre-state_retrieved +4ms, processed +189ms, execution_payload_result_received +0ms, data_availability_checked +3812ms, begin_importing +0ms, completed +0ms ``` Unpacking, this means: - overall the block import attempt failed due to unavailable blobs (`failed_data_availability_check_not_available`) - block has arrived at `1794ms` into the slot. - considering the publishing started at 1488ms, blocks took ~300ms to reach the node - from block arrival BN waited 3ms + 4ms + 189ms + 0ms + 3812ms (`~5.8s` into the slot) before giving up. #### Reorg Later, node logs an actual reorg when importing `...853` signaling the fact that, at the end, blobs all arrived for `...852` and block has been imported before receiving `...853`. ```text! Reorg Event *** New Head: 2fb5a00288f373e8915ea57627eb5a41702d94aba686e34c94aaaaac85c1f61e (10010853), Previous Head: 294f24c16fe436e9008eca8179c5bc5f348466c8a4d4eb3b005d408ee938dc50 (10010852), Common Ancestor: c0651611f00ada6b6a346620b36836cca08ba545059940d5a80a418671d004da (10010851) ``` #### full sequence ```text! 2024-09-21 23:30:51.152 INFO - Slot Event *** Slot: 10010852, Block: ... empty, Justified: 312838, Finalized: 312837, Peers: 99 2024-09-21 23:30:52.802 WARN - Late Block Import *** Block: 294f24c16fe436e9008eca8179c5bc5f348466c8a4d4eb3b005d408ee938dc50 (10010852) Proposer: 171876 Result: failed_data_availability_check_not_available Timings: arrival 1794ms, gossip_validation +3ms, pre-state_retrieved +4ms, processed +189ms, execution_payload_result_received +0ms, data_availability_checked +3812ms, begin_importing +0ms, completed +0ms 2024-09-21 23:31:01.028 INFO - Reorg Event *** New Head: 2fb5a00288f373e8915ea57627eb5a41702d94aba686e34c94aaaaac85c1f61e (10010853), Previous Head: 294f24c16fe436e9008eca8179c5bc5f348466c8a4d4eb3b005d408ee938dc50 (10010852), Common Ancestor: c0651611f00ada6b6a346620b36836cca08ba545059940d5a80a418671d004da (10010851) 2024-09-21 23:31:03.161 INFO - Slot Event *** Slot: 10010853, Block: 2fb5a00288f373e8915ea57627eb5a41702d94aba686e34c94aaaaac85c1f61e, Justified: 312838, Finalized: 312837, Peers: 100 ``` #### What could have happened with `engine_getBlobsV1` The receiving node, instead of waiting `data_availability_checked +3812ms,` and giving up, as it have received the block it could have queried the local EL to get the corresponding blobs from the mempool. The call most probably would have succeded and block could have imported by ~2s into the slot and could have been attested for it. ### Final considerations As said at the beginning, I beleavet that, if the majority of the network adopts `engine_getBlobsV1` it could help homestakers effectively. In upcoming release, teku will allow disabling flood publishing which should have a positive impact too.