Post Mortem: Teku Mainnet SSZ Serialization Failure (2026-03-05)

# Post Mortem: Teku Mainnet SSZ Serialization Failure (2026-03-05) ## Summary On March 5, 2026, a bug in Teku prevented nodes from saving their internal state to disk. The bug was triggered when the total number of validators in Ethereum's registry crossed a specific threshold (~2.22 million), causing an internal calculation to overflow. A fix was released in Teku v26.3.0 within hours. **What happened:** Teku could not write its database due to a number overflow in the serialization code. Logs showed two phases of errors: First, a burst of rapid retries for approximately 1 minute: ``` ERROR - Storage update failed, retrying. java.util.concurrent.CompletionException: java.lang.NegativeArraySizeException: -215782466 ``` After the retry timeout, ongoing errors appeared once per slot (~every 12 seconds) for the duration of the incident: ``` ERROR - StoreTransaction: UNKNOWN ERROR ``` **What was NOT affected:** - Nodes did **not** crash -- they kept running throughout the incident. - Attestations were produced correctly and included on-chain -- **as far as we know, no validator rewards were lost**. - **As far as we know**, block proposals, consensus participation, and network connectivity were unaffected. - The Ethereum network itself was not impacted. **What WAS affected:** - Teku could not persist state to its local database while the bug was active. - Grafana dashboards showed "attestation included" dropping to near-zero. This was a **misleading metric** -- the monitoring system could not read from the broken local database, but the actual on-chain performance was normal. - After upgrading to v26.3.0, nodes needed to forward-sync from the last persisted state to catch up to the chain head. During this sync period, validator duties were paused until the node reached the head. ## Impact The bug affected the **serialization path only** -- writing beacon state to the local database. Critically, **nodes did not crash**. The in-memory beacon state, fork choice, and consensus processing remained correct throughout the incident. Nodes continued performing validator duties (attesting, proposing, processing blocks) normally. Justified and finalized checkpoints kept advancing without interruption. The consequence was that **no new state was persisted to disk** while the issue was active. The `RetryingStorageUpdateChannel` retried the failed storage write for approximately 1 minute before giving up on each update. Subsequent per-slot `StoreTransaction` commits also failed with the same serialization error, but the node remained operational. **As far as we know, there was no actual attestation reward loss.** Attestations were produced correctly and included on-chain by block proposers. However, the "attestation included" metric on Grafana dashboards dropped to near-zero, which initially caused alarm. This was a **monitoring artifact**: Teku's internal attestation performance tracker computes inclusion by querying the local store for blocks. Since the store wasn't being updated due to the serialization failure, the tracker couldn't find the blocks containing our attestations, and incorrectly reported them as not included. Attestation performance was unaffected. After upgrading to v26.3.0, nodes needed to forward-sync from the last successfully persisted state to catch up to the chain head. For nodes where this gap was large or sync was slow, a fresh checkpoint sync was recommended as a faster alternative. ## Timeline All times are CET (UTC+1). | Time | Event | |---|---| | 16:16 | First `NegativeArraySizeException: -215782466` errors appear in Teku logs on mainnet | | 16:17 | `RetryingStorageUpdateChannel` gives up after 1 minute of retries; `StoreTransaction` errors continue every slot | | 17:33 | Issue noticed internally by the Teku team; investigation begins | | 18:00 | First public announcement on Teku Discord | | 18:52 | Root cause identified; second Discord announcement with critical guidance | | ~19:00 | Fix merged and release process started | | 21:14 | Teku v26.3.0 released and announced on Discord with upgrade and recovery instructions | | 21:22 | Internal nodes upgraded to v26.3.0; forward sync begins | | ~21:59 | Attestation performance metric returns to 100% on Grafana (~37 minutes after upgrade) | ## Communication ### 18:00 CET -- Initial Acknowledgement (Discord) > **Mainnet Issue Investigation** > > The Teku team is aware of an issue impacting some nodes on mainnet and is actively investigating. > > No action is needed from you at this time. We will provide an update here as soon as we have more information. Thank you for your patience. ### 18:52 CET -- Root Cause Identified, Critical Guidance (Discord) > **Update on Mainnet issue: Fix in Progress and CRITICAL User Guidance** > > We have identified the root cause of the mainnet issue. A fix is being prepared, and we will release a new version shortly. > > While you wait, it is critical that you follow this guidance: > > Just leave Teku running. It is okay if it appears to be struggling or metrics look off. > Most importantly, **DO NOT delete your database or attempt to checkpoint sync.** > Taking either of these actions will not fix the issue and will cause significant, unnecessary downtime for your node. The fastest path to recovery is to wait for the new release. > > We will post here as soon as it's available with upgrade instructions. Thank you for your cooperation. The guidance to avoid deleting the database was critical: since the bug was in serialization, re-downloading and re-serializing the state (via checkpoint sync) would have hit the exact same error, causing unnecessary downtime. ### 21:14 CET -- Release Announcement (Discord) > Teku v26.3.0 is now available. > > This is a mandatory update containing the fix for recent mainnet beacon state serialization issue as well as performance improvements and library updates. > > After you upgrade, your node will need to sync to the current head of the chain. While you can let it sync on its own, if you notice that this process is slow, we recommend you upgrade and then perform a checkpoint sync: > > 1. Stop your Teku node. > 2. Upgrade your Teku image or binary to v26.3.0 > 3. Delete your beacon node database (usually the `db` folder inside your Teku data directory) > 4. Make sure your configuration includes a checkpoint sync URL (`--initial-state` option). > 5. Restart Teku > > A quick note: We know we previously advised against deleting your database. That was to protect your node while the issue was active. With this new patched version, this procedure is now safe. ## Root Cause The bug was in the SSZ (Simple Serialize) library's size calculation for lists of fixed-size elements. In `AbstractSszListSchema.getSszVariablePartSize()`, the serialized size of a list was computed in **bits** first, then converted to bytes: ```java return bitsCeilToBytes(length * getSszElementBitSize()); // ^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^ // int * int → int overflow! ``` For the `validators` list, each `Validator` struct is 121 bytes = **968 bits**. The multiplication `length * 968` overflows a 32-bit signed integer when: ```java length > Integer.MAX_VALUE / 968 = 2,147,483,647 / 968 = 2,218,474 ``` The Ethereum beacon state's validator registry is **append-only** -- validators that exit are marked as withdrawn but remain in the list. Following the Pectra upgrade (May 2025), which raised the maximum effective balance from 32 ETH to 2048 ETH, large-scale validator consolidation occurred on mainnet. This created approximately 1.26 million exited validators (`withdrawal_done` status) alongside ~953,000 active validators, pushing the total registry size to **2,219,655** -- just 1,181 entries above the overflow threshold. The overflowed negative size was passed to `new byte[size]`, causing the `NegativeArraySizeException`. ## Why This Wasn't Caught Earlier The overflow threshold of 2,218,475 validators was specific to mainnet conditions: | Network | Validator Registry Size | Overflow Threshold | At Risk? | |---|---|---|---| | Mainnet | **2,219,655** | 2,218,475 | **Yes** | | Hoodi testnet | 1,282,359 | 2,218,475 | No | The Hoodi testnet, which is the primary testnet for staking and infrastructure testing, has a validator registry about 42% smaller than what is needed to trigger the overflow. The bit-level size calculation was a pre-existing pattern in the SSZ library that worked correctly for over 5 years -- it only became a problem when the mainnet validator registry grew large enough due to post-Pectra consolidation inflating the append-only list with exited entries. Additionally, the overflow specifically affected **serialization** (writing state to storage), not deserialization or state transition, meaning it would not surface in consensus spec reference tests which focus on state validity rather than serialization mechanics. ## Fix The fix promotes the intermediate multiplication to `long` before the bit-to-byte conversion: ```java // Before (overflows int with large registries): return bitsCeilToBytes(length * getSszElementBitSize()); // After (uses long arithmetic for intermediate calculation): return Math.toIntExact(bitsCeilToBytes((long) length * getSszElementBitSize())); ``` A `long` overload of `bitsCeilToBytes()` already existed in the codebase but was not being used at this call site. The final byte-count result (~266 MB for 2.2M validators) fits comfortably in an `int`, so only the intermediate bit-count multiplication needed widening. ## Resolution Users were instructed to upgrade to Teku v26.3.0, released as a mandatory update on 2026-03-05. After upgrading, nodes would forward-sync from their last persisted state. For nodes where forward sync was too slow, a fresh checkpoint sync was recommended. ## Lessons Learned 1. **Append-only data structures require long-term capacity planning.** The validator registry never shrinks, and Pectra consolidation roughly doubled its size by adding exited entries alongside new consolidated validators. Arithmetic overflow thresholds for such structures should be evaluated proactively against mainnet growth projections. 2. **Testnet coverage gaps for mainnet-scale data.** Testnets with significantly smaller state sizes cannot catch overflow bugs tied to mainnet-scale data. Targeted unit tests with boundary values based on mainnet projections should be added for critical serialization paths. 3. **Bit-level arithmetic is overflow-prone.** Computing sizes in bits (8x the byte value) significantly reduces the overflow headroom. Intermediate calculations should use `long` when input ranges can approach `Integer.MAX_VALUE / 8`. 4. **Monitoring artifacts can mislead during incidents.** The "attestation included" metric on Grafana dropped to near-zero during the incident, causing alarm about validator performance losses. In reality, attestations were correct and included on-chain -- the metric was wrong because the performance tracker depends on the local store, which was not being updated. Future work should consider tracking attestation inclusion independently of local storage state, or clearly flagging when the store is unhealthy so that derived metrics are not trusted. 5. **Release urgency categories need refinement.** Our existing release categories -- Optional, Recommended, and Mandatory -- did not adequately convey the urgency of this situation. While "Mandatory" indicates a required upgrade, it does not communicate time sensitivity. This release needed to be installed immediately, as every additional slot without the fix meant more unpersisted state and longer recovery sync times. A new **Critical** category should be introduced for releases that require immediate action to prevent ongoing or escalating impact.