Teku Mainnet Data Availability Incident (2026-01-13)

# Teku Mainnet Data Availability Incident (2026-01-13) ## Summary On 2026-01-13 around 13:00 UTC, some Teku mainnet nodes began struggling to follow the chain, printing "empty" slot events and eventually falling into sync mode. The issue was caused by `LevelDB` transaction stalls under sudden elevated blob usage. The incident resolved when the blob usage went back to previous levels. ![image](https://hackmd.io/_uploads/rk0Fu3uUWg.png) ## Impact A subset of Teku nodes running high-throughput "supernode" configurations were affected. No meaningful network participation drop was observed. Affected users experienced degraded attestation and block production performance, resulting in lost staking rewards. ## Log Evidence Affected nodes displayed empty slot events followed by late block imports with extended data availability checks: ``` 2026-01-13 13:03:15.001 INFO - Slot Event *** Slot: 13457114, Block: ... empty, Justified: 420533, Finalized: 420532, Peers: 64 2026-01-13 13:03:27.104 INFO - Slot Event *** Slot: 13457115, Block: ... empty, Justified: 420533, Finalized: 420532, Peers: 64 2026-01-13 13:03:39.098 INFO - Slot Event *** Slot: 13457116, Block: ... empty, Justified: 420533, Finalized: 420532, Peers: 64 2026-01-13 13:03:51.097 INFO - Slot Event *** Slot: 13457117, Block: ... empty, Justified: 420533, Finalized: 420532, Peers: 64 2026-01-13 13:03:52.736 WARN - Late Block Import *** Block: 8a1b073d... (13457114) Proposer: 399523 Result: success Timings: arrival 1819ms, gossip_validation +3ms, pre-state_retrieved +3ms, processed +182ms, execution_payload_result_received +0ms, data_availability_checked +39665ms, begin_importing +55ms, transaction_prepared +0ms, transaction_committed +0ms, completed +9ms ``` The `data_availability_checked +39665ms` indicates the block import was waiting ~40 seconds for data column sidecars to become available. As the situation worsened, nodes began discarding gossip messages: ``` 2026-01-13 13:03:42.106 WARN - Discarding gossip message for topic /eth2/8c9f62fe/beacon_attestation_11/ssz_snappy because the executor queue is full 2026-01-13 13:03:42.572 ERROR - Unexpected rejected execution due to full task queue in p2p-async-scheduler-0 ``` ## Root Cause The 2.5x increase in blob traffic as the network approached target blob usage pushed `LevelDB` past a throughput threshold, causing at least one transaction to stall for 30-60+ seconds. ### Background: Teku's Async Write Design Teku implements asynchronous writes for data availability checks: the client does not wait for data column sidecars to be fully committed to disk before signaling that the associated beacon block can be imported. Column sidecar transactions are routed via `SidecarUpdateChannel`, which has an associated queue limited to 500 elements. ### Failure Cascade ![image](https://hackmd.io/_uploads/Sk65u3dLWg.png) When `LevelDB` transactions began stalling, the following cascade occurred: **1. Queue Saturation** For full custody nodes writing 128 columns per slot, the 500-element `SidecarUpdateChannel` queue fills in roughly 4 slots (~48 seconds). This means affected nodes had transactions stalling for at least that duration. **2. Async Writes Became Blocking** Once the 500 limit was reached, `SidecarUpdateChannel` became blocking. Every thread attempting to write a column was blocked until the slow transaction completed, defeating Teku's ability to asynchronously write data to disk. **3. Data Availability Stall (First Effect)** A stalling thread could not signal "data is available," preventing the current block from being imported. This caused the empty slot log at 4 seconds into the slot. When the transaction finally completed, the block could import—hence `data_availability_checked +39665ms` on block 13457114. **4. P2P Thread Pool Exhaustion (Second Effect)** While waiting on the transaction, nodes continued receiving columns from gossip subnets. Even though block import was skipped (parent not yet imported), Teku optimistically writes column sidecars ahead. As the transaction continued to stall, more P2P threads piled up waiting on the queue. Eventually the thread pool was exhausted and the task queue filled (40k max), causing nodes to discard gossip messages. **5. Chain Following Failure (Third Effect)** Loss of P2P threads caused the node to lose peers due to inability to timely handle status message exchanges. Nodes fell into sync mode, attempting to catch up with the head. For long-range sync (>200 slots), this could trigger another `LevelDB` issue causing large off-heap memory allocations, resulting in OOM kills by the kernel. ### Metrics Observed on All Affected Nodes - `LevelDB` in-flight transactions stuck at 3 for over a minute - `SidecarUpdateChannel` queue maxed at 500 - P2P executor queues maxed at 40k ### Attempted Mitigation We implemented a batching version of `SidecarUpdateChannel` to reduce transaction frequency and improve write efficiency. However, testing showed this did not improve `LevelDB` performance under load—the issue is fundamental to `LevelDB`'s handling of high data throughput rather than transaction batching strategy. ## Reproduction We reproduced the issue by: 1. Artificially delaying a `LevelDB` column transaction by 1 minute, which triggered similar behavior within a few slots 2. Modifying our DB layer to write columns at 3x size (simulating the 2.5x mainnet spike), which recreated the exact user-reported behavior on a `LevelDB` node To prove `RocksDB` performs better under the same conditions, we ran the 3x experiment starting from a checkpoint sync. This forced the node to handle not only 3x column size for incoming data but also full custody backfill with the same 3x multiplier applied—an even more demanding scenario than the original incident. `RocksDB` handled this load without issue. ## Timeline (UTC, Approximate) | Time | Event | |------|-------| | 13:00 | Incident begins; affected nodes start logging empty slots | | 13:30 | First user report on Discord | | 14:30 | Internal war room created after multiple reports | | 16:30 | Confirmed EF was intentionally pushing blob usage toward target levels | | 18:30 | EF activity stopped; affected nodes recovered | ## Remediation Affected users were advised to migrate to `RocksDB`: 1. Stop Teku service (beacon node) 2. Add `--Xdata-storage-create-db-version=6` to your configuration file 3. Delete the folder `db` and the file `db.version` in your data directory 4. Start Teku again The next release, `26.2.0`, will have `RocksDB` as default. Starting from that version, the `--Xdata-storage-create-db-version=6` flag won't be needed. ## Lessons Learned - **Fleet configuration drift**: We were not running supernodes on mainnet, and our internal nodes had migrated to `RocksDB` while most users were still on the default `LevelDB`. Internal nodes should mirror the default user configuration as closely as possible. Divergence should be avoided; if unavoidable, it must be short-lived. - **`LevelDB` deprecation overdue**: We had planned to migrate from `LevelDB` (unmaintained) to `RocksDB` for some time but delayed prioritization. This incident confirms the urgency. - **Shadowfork gap**: Neither mainnet shadowforks run by ethPandaOps nor testnets like Hoodi detected this issue before Fusaka. Investigation into why is ongoing. - **Network activity awareness**: We were unaware EF was intentionally pushing blob usage toward target levels. Communication with EF should be improved to avoid being unprepared for similar events.