# Lodestar Holesky Rescue Retrospective
## Hard fork transition
### What happened to Lodestar validators during the Holesky transition?
Holesky hard-forked was scheduled on Epoch `115968` at 2025-02-24 21:55:12 UTC. During the transition, some Lodestar genesis (index 695000-794999) and testing validators (running `v1.27.0`) failed to sync through the hard fork. Subsequently some started to miss block proposals due to an isolated edge-case, stemming from a [withdrawal credentials bug](https://github.com/ChainSafe/lodestar/pull/7478) which caused a state mismatch which corrupted the database. It was fixed in `v1.27.1` (released approximately 5 hours before the fork). Although it was determined that both versions of `v1.27.0` and `v1.27.1` would be compatible for Pectra, if your node was stuck post fork, it was recommended to wipe the database and upgrade to `v1.27.1` to properly resync.
We also had trouble reproducing, and debugging this consensus problem (which also appeared on Gnosis devnets) that was fixed in `v1.27.1`. The debugging difficulty was due to challenges collecting the pre-state SSZ object alongside a valid and invalid post-state SSZ object for a then-unknown condition that only happens very infrequently, and only on some nodes. By default, none of these are persisted which required us to figure out how to collect them across our fleet to isolate the root cause.
:::info
The issue led to us introducing a pull request which would automatically persist the relevant information for future incident analysis: https://github.com/ChainSafe/lodestar/pull/7482.
:::
We confirmed that some nodes running the `v1.27.1 `hotfix successfully transitioned through the fork. This prompted us to force a checkpoint sync, to the pre-Electra finalized state, after upgrading our nodes to `v1.27.1`. Our goal was to get all of our genesis validators attesting to support finality. Combined, our genesis and our testnet infrastructure validators contribute about 6% of eligible ETH so it was critical that we did this quickly to help push finalization of the first Electra epoch. This immediate action was all before the community realized there was an issue on the execution clients.
### Execution Layer Issue
While upgrading all of our beacon nodes to `v1.27.1`, other clients pointed out a large amount of invalid blocks being seen. As noted in the [Ethereum Foundation's Root Cause Analysis](https://github.com/ethereum/pm/blob/master/Pectra/holesky-postmortem.md#root-cause-analysis), execution clients did not configure the deposit contract address correctly and the chain split with the majority of execution clients agreeing with the problematic block.
For Lodestar, all of our Holesky validators were setup with Geth as the execution client. As the chain forked, our validators had agreed with the majority view of execution clients indicating that the problematic execution payload at slot `3711006` with the beacon block root `0x2db899881ed8546476d0b92c6aa9110bea9a4cd0dbeb5519eb0ea69575f1f359` was valid, even though it contained the problematic execution payload. This also meant all of our validators had submitted attestations to justify the bad chain containing this beacon block. It was determined later that the minority fork consisting of Erigon and Reth execution clients correctly rejected the problematic block.
:::info
Since the incident, ChainSafe has and will plan to run equal distributions of all five execution clients to have better redundancies and faster turnaround if any execution client experiences issues.
:::
### Justification of the problematic block
It was noted at 22:44:00 UTC that the chain was very close to finalizing, with 65% of eligible ETH voting for the bad fork. This was without all of Lodestar's beacon nodes properly in sync due to the `v1.27.0` disruption in our fleet. This may have helped prevent finalizing the bad chain as it kept 5-6% of the eligible ETH offline post fork while we resolved the situation.
Execution clients confirmed their findings for what was causing the chain to fork. By that time we had already restarted and also confirmed that our validators were on the bad fork of the chain. At this time it had already been justified on epoch `115968` with 71% eligible ETH. To prevent finalizing the bad fork, we decided to shut down all of our validators at 23:24:00 UTC. Shoutout to Michael Sproul (Lighthouse) for his foresight in taking down the Lighthouse validators early to prevent finalization.
Unfortunately we pushed, and were able to quickly deploy `v1.27.1` because the information we had at the time. This was meant to enable block proposals and attestations post hard fork. Once we realized that there was a bug and a chain fork all of our genesis validators had already committed to slashable surround votes if we were to try and recover the valid minority chain.
:::info
We learned that during hard fork transitions, although it is important to finalize the chain promptly for stability post-fork, it is important to verify _what_ is being finalized. Invalid blocks on the network post fork should be promptly investigated and actions should be taken to _understand the problem before taking action._ Although rare, we understand the reprocussions of running only majority clients which can pose major issues if the majority client(s) is/are incorrect.
:::
### Patching execution clients
In order to produce blocks on the correct, minority chain we needed to upgrade to the patched execution clients. ChainSafe started patching execution clients with the respective Holesky rescue branches at 2025-02-25 00:05:12 UTC. Nodes were restarted using the finalized checkpoint pre-Electra [served by EthPandaOps](https://checkpoint-sync.holesky.ethpandaops.io/). Due to the problematic block already being in the database, the rescue images got stuck so a database removal was necessary.
Our internal plan was to patch, resync and stabilize our beacon nodes until we could determine how to best move forward. At this time, none of our validators were able to produce blocks and help maintain liveness for the minority fork. At the time we did not have any validators running Erigon or Reth execution clients in our infrastructure.
By the end of the night, the core dev community rallied around the priority to get blocks production stabilized on the "correct" minority chain. Once liveness was adequately restored we could approach attestations and potential slashing issues afterwards.
## Rescuing Holesky
### Syncing and Fork Confirmation Issues
Lodestar was consistently struggling to sync to the correct chain due to the justified epoch and the majority of non-upgraded peers continuing to build on the bad chain. We had three methods of determining if we were on the correct chain:
1. Verify the bad beacon block was not in the canonical chain via the `/eth/v2/beacon/blocks/3711917` API endpoint
2. Check that the state root matches those who are on the correct chain which was a small minority of beacon nodes that were running their consensus clients with Erigon or Reth
3. [Using a script](https://gist.github.com/samcm/e2da294dab77e93ad0ee0e815580294f) developed by Sam at EthPandaOps to check block roots against the beacon node for the correct fork
:::info
Using the beacon endpoint `/eth/v1/debug/fork_choice`, you can also see how the invalid block was processed in fork choice, leading to the conclusion of importing this problematic block optimistically, leading Lodestar to have problems pivoting to the minority chain.
:::
There were multiple viable forks which added exponential work to all consensus clients. This environment created a resource-heavy load on every node on the network. Dapplion's Devcon SEA presentation on ["How long non-finality could kill Ethereum"](https://youtu.be/z2jafwPFLaQ?si=4W413UGUfEYGpEFX) accurately predicted many of the problems.
:::info
In a chaotic forking environment, it becomes very hard to know which fork is the correct chain and finding those healthy peers become increasingly difficult, if not made worse, by unhealthy nodes continuing to build the wrong chain. _Shutting down unhealthy or faulty nodes on the wrong chain_ may help to improve the ability for CLs to properly sync new, healthy beacons on the correct chain.
:::
The most important thing we can do in this environment, is **continue to build blocks to restore liveness on the correct chain.** This means we need to have the ability to spin up healthy nodes that can be sync'd to the **same, correct canonical chain** so the connected validators can produce the desired blocks. One of the hardest problems was making sure we could point Lodestar at healthy peers so we could sync to the minority fork the community rallied around.
:::info
To prevent Lodestar from re-syncing to the incorrect chain, we developed a feature that a checks for blacklisted blocks and disallows processing. We also check that any subsequent blocks of the invalid chain are also disallowed via https://github.com/ChainSafe/lodestar/pull/7498. Although, there is an argument to be made for abusing of this feature, the worst case could lead to minority chain forking if used improperly, which self-inflicts the node operator.
In addition, [a follow up pull request](https://github.com/ChainSafe/lodestar/pull/7580) to this feature adds a new endpoint `/eth/v1/lodestar/blacklisted_blocks` to return the root/slot of any blacklisted blocks. The CLI flag `--chain.blacklistedBlocks 0x12345abcdef` is a feature that allows you to list 0x-prefixed root hexes for blocks that should not be processed.
:::
To find healthy peers in a chaotic environment, communication is essential to share trusted peer identifiers of healthy nodes, while ensuring not to overload them. Healthy nodes bombarded with many requests may lead to performance issues, taking down vital healthy peers in a fragile environment. Sharing healthy `enr:` addresses for consensus clients and purposefully using them as bootnodes was the early strategy to sync up more healthy beacons. Documents [like this](https://hackmd.io/@_iAz6KERTsWIHHNF-wMxAA/r1XlYyickx) were shared by the community to have some trusted schelling point for coordination. In our `holesky-rescue` branch, we added some of these Holesky bootnodes by default to the image/branch build. This also led us to suggest consensus clients implement an endpoint that allows you to add a "trusted" peer which can bypass some peer scoring heuristics: https://github.com/ChainSafe/lodestar/issues/7559
:::info
In Lodestar, you can use the `--bootnodes` flag to add a list of ENR bootnodes and set `--network.connectToDiscv5Bootnodes true` to attempt direct libp2p peer connection to these nodes.
Note that if you want to use Lodestar as a static peer, you must enable `--persistNetworkIdentity` to ensure your ENR (your network identity) remains the same throughout restarts. Otherwise, it will change.
We currently do not support the trusted peer feature.
:::
Lodestar currently struggles at times with syncing in a network where we may only have one or two useful peers. As noted in https://github.com/ChainSafe/lodestar/issues/7558, we may inavertently end up banned for not adhering to the rate limit due to overwhelming data requests. *The overwhelming data requests still needs to be investigated.*
### Resource usage in a non-finality environment
One of the primary weaknesses of consensus clients in a non-finality environment is resource usage. Specifically, out-of-memory (OOM) crashes take down healthy nodes which make it harder for others to sync up to the head of the chain. In addition, if a node needs to resync from a previously finalized checkpoint, it becomes a much harder hill to climb for resyncing.
:::info
Handling memory usage in this environment is essential and timing is critical. Fixing memory handling issues and figuring out how to boot from unfinalized checkpoint states will help us sync healthy nodes faster and keep them stable under duress.
:::
Lodestar has a fixed heap memory usage limit set by the nodeJS environment. Our images and builds set this at 8GB (8192 MB) which is sufficient for a stable network. When we cannot prune unfinalized states to disk, generally a buildup of memory can cause out-of-memory (OOM) errors which crashes the node. In our `holesky-rescue` branch, we increased the heap memory by default in the nodeJS environment to use `--max-old-space-size=32768` to allow for 32GB of heap memory.
While other teams were trying to sync with Lodestar to the minority fork head, they discovered that Lodestar crashes as it gets close to the head and we could not properly prune checkpoint states. We covered this issue in https://github.com/ChainSafe/lodestar/issues/7495.
The following fixes were implemented to correctly handle pruning checkpoint states:
https://github.com/ChainSafe/lodestar/pull/7497
https://github.com/ChainSafe/lodestar/pull/7505

Before the fixes were implemented, we would costantly OOM as memory would grow indefinitely until it exceeded the max limit. In the case for this device, the max was set at 60GB of memory. After these fixes, Lodestar generally held memory pretty constantly at around 10-12GB for most of the non-finality period.
### Reduce storage by pruning checkpoint states
Besides memory, another bottleneck during non-finality is disk storage requirements. For the Holesky network, checkpoint states can reach as much as ~250MB per state, which consumes about 56GB per day and is another reason why beacon nodes fail to remain in sync during long non-finality periods. With https://github.com/ChainSafe/lodestar/pull/7510, we are able to set `DEFAULT_MAX_CP_STATE_ON_DISK` to persist a maximum of `n` epochs on disk to help with storage bloat.
:::info
This type of behaviour is generally _not safe_ to do. If you set `DEFAULT_MAX_CP_STATE_ON_DISK` to `100` (as we did by default for the `holesky-rescue` branch), the node may not be able to process blocks from other potential forks. This is `disabled` by default and may be added as a future feature.
Storage feature improvements are currently being developed, such as using [binary diff states](https://github.com/ChainSafe/lodestar/issues/7535) and/or using [ERA files](https://github.com/ChainSafe/lodestar/issues/7048). This should better address storage requirements without using unsafe features to address chain emergencies.
:::
### Using checkpointState to jumpstart sync
As the non-finality network persisted, it was getting increasingly difficult over time for new nodes to get synced to the head of the correct fork. In Lodestar, you can set an arbitrary checkpoint state to initiate the sync via `--checkpointState /path/to/file/or/url`. We were able to get a SSZ state from Serenia (a trusted node operator) that was saved from a synced beacon node, on the correct chain. This was a huge timesaver for jumpstarting other beacon nodes. By loading this checkpoint, we could minimize the amount of time required to catch up to the head of the chain and avoid syncing from the last finalized checkpoint.
### Using local checkpoint states
During the Holesky rescue, one of the useful features that would help in a non-finality environment is to be able to jumpstart nodes utilizing an unfinalized checkpoint. In addition, you need a way to persist the checkpoint state locally for future use and the ability to share the state for helping other nodes sync.
With https://github.com/ChainSafe/lodestar/pull/7509, Lodestar is able to use a local or remote unfinalized checkpoint state to reduce the syncing time required when a node is restarted. This allowed new nodes to quickly sync to head, especially as the non-finality period dragged on. In combination with https://github.com/ChainSafe/lodestar/pull/7541, we implemented an API endpoint (`eth/v1/lodestar/persisted_checkpoint_state`) to return a specific, or latest safe checkpoint state. This feature will help to bring healthy nodes up faster in a turbulent network. Node operators in this environment generally refrain from restarting synced beacon nodes because of the difficulty to resync. Some important settings generally require an application restart to take effect, so this feature will help node operators feel more comfortable about restarting nodes in future non-finality environments.
### Slot import tolerance adjustments
When trying to sync the correct chain, the hardest parts to sync were the early periods of the rescue where there were no blocks being produced and consecutive skipped slots in the hundreds. Our `slotImportTolerance` was increased from the default `32` because there were periods of greater than one epoch where we would not see one block. The function is important as it prevents Lodestar from building on a head that is older than 32 slots which could cause massive reorgs. However, Holesky was suffering from very serious liveness issues where we wouldn't see a block for more than 32 slots, then the node would fall back into syncing mode, preventing it from producing blocks. Setting this higher by default for the early periods of the rescue helped to ensure we could contribute to block production, even though new blocks were rarely seen.
### Sync stalls and peer churning
Due to the turbulent network, continuous rate limiting of "bad peers" made it very difficult to find good peers that would serve us the data we needed. Lodestar would at times hit the max target peers but none of them were viable for advancing our node. Although we experimented with churning a random subset of connected peers when we were stuck, we ultimately implemented a heuristic to start to pruning peers when it was starved. During this state, it will attempt to prune additional peers but prioritize keeping peers who are reportedly far ahead of our node: https://github.com/ChainSafe/lodestar/pull/7508
:::info
This PR also enhanced our `/eth/v1/lodestar/peers` endpoint to return more details about the status of our connected peers. This includes data such as their `head_slot` which can help analyze the health of the network by identifying how many of our connected peers are actually synced to the head.
Example:
```json
{
"peer_id": "17Uiu2HAmPcVbHnDgeFBTKcFtpaLfLfUJaEasdbpsBeT7nbxmmDeh",
"enr": "",
"last_seen_p2p_address": "/ip4/123.456.789.000/tcp/4096/p2p/17Uiu2HAmPcVbHnDgeFBTKcFtpaLfLfUJaEasdbpsBeT7nbxmmDeh",
"direction": "inbound",
"state": "connected",
"agent_version": "Prysm/v5.0.3/2c6e028600d4ad5659f0d25d8911714aa54f9c25",
"status": {
"fork_digest": "0x019e21ad",
"finalized_root": "0xd0285a8ec914f53cf15e9cce336a07c4f11c666e2addec620592d5ff3640ed34",
"finalized_epoch": "115967",
"head_root": "0x669de4cb66b8444684e59ada8f3cd4729ddb9d4e3d2d7b1d361b59fe0ba96aa0",
"head_slot": "3775735"
},
"metadata": {
"seq_number": "2",
"attnets": "0x0000000000000600",
"syncnets": "0x08"
},
"agent_client": "Prysm",
"last_received_msg_unix_ts_ms": 1741211227122,
"last_status_unix_ts_ms": 1741211227062,
"connected_unix_ts_ms": 1741211226993
},
```
:::
### Pectra attestation bug
As Lodestar continued to build a Holesky fork that was quickly syncable to help rescue Holesky, we [discovered an attestation bug](https://github.com/ChainSafe/lodestar/issues/7546) specific to Electra which led to no aggregate inclusions and producing very bad aggregated attestations ourselves. This made our validator performance very poor while trying to rescue the Holesky chain and we were not as effective with helping to finalize the chain. The [included fix](https://github.com/ChainSafe/lodestar/pull/7543) which uses the correct subnet for validating gossip attestations was then merged to the holesky-rescue branch shortly after Holesky finalized.

This devastating validator performance bug should have been caught during our Pectra testing. Sadly, it was not seen due to our lack of internal testing visibility on devnets.
:::info
ChainSafe now runs devnet nodes with standardized Lodestar metrics used by our testing fleet. Developer ergonomics is key and having familiar tooling, profiling and dashboards is key to proper node monitoring. This will allow us to utilize our internal observational metrics and dashboards to visually inspect performance metrics on EthPandaOps devnets before they occur on public testnets.
:::
### Other Considerations for Future Implementation
Lighthouse included a `disable-attesting` feature which is used to avoid flooding the beacon node while syncing and removes an additional overhead of dealing with attestation requests: https://github.com/sigp/lighthouse/pull/7046
Leader implementation for disaster recovery should be brainstormed. What would it look like if the community needs to socially agree to nudge mainnet down a mutually agreed upon path. Perhaps consensus clients should have the ability to ingest an anchor state, on any fork, in case of emergency.
As an alternative to blacklisting bad block roots, we are attempting to make a "pessimistic sync" work as an alternative via https://github.com/ChainSafe/lodestar/pull/7511.
## Summary
Overall, Holesky was a great exercise and provided valuable fixes for all client teams. We believe that **non-finality devnets should be prioritized as part of hard fork readiness testing** so we have more opportunities to test unhappy cases of network turbulance. This experience helped us ask very important questions, even though this may not be the same emergency response plan we would consider for mainnet. There are many uncontrollable aspects of mainnet where a rescue reminiscent of Holesky would not be feasible. However, some of the features and fixes developed from this incident may be useful if we were to experience chain splits or finalizing bad blocks in the future.