Try   HackMD

Lodestar Holesky Rescue Retrospective

Hard fork transition

What happened to Lodestar validators during the Holesky transition?

Holesky hard-forked was scheduled on Epoch 115968 at 2025-02-24 21:55:12 UTC. During the transition, some Lodestar genesis (index 695000-794999) and testing validators (running v1.27.0) failed to sync through the hard fork. Subsequently some started to miss block proposals due to an isolated edge-case, stemming from a withdrawal credentials bug which caused a state mismatch which corrupted the database. It was fixed in v1.27.1 (released approximately 5 hours before the fork). Although it was determined that both versions of v1.27.0 and v1.27.1 would be compatible for Pectra, if your node was stuck post fork, it was recommended to wipe the database and upgrade to v1.27.1 to properly resync.

We also had trouble reproducing, and debugging this consensus problem (which also appeared on Gnosis devnets) that was fixed in v1.27.1. The debugging difficulty was due to challenges collecting the pre-state SSZ object alongside a valid and invalid post-state SSZ object for a then-unknown condition that only happens very infrequently, and only on some nodes. By default, none of these are persisted which required us to figure out how to collect them across our fleet to isolate the root cause.

The issue led to us introducing a pull request which would automatically persist the relevant information for future incident analysis: https://github.com/ChainSafe/lodestar/pull/7482.

We confirmed that some nodes running the v1.27.1 hotfix successfully transitioned through the fork. This prompted us to force a checkpoint sync, to the pre-Electra finalized state, after upgrading our nodes to v1.27.1. Our goal was to get all of our genesis validators attesting to support finality. Combined, our genesis and our testnet infrastructure validators contribute about 6% of eligible ETH so it was critical that we did this quickly to help push finalization of the first Electra epoch. This immediate action was all before the community realized there was an issue on the execution clients.

Execution Layer Issue

While upgrading all of our beacon nodes to v1.27.1, other clients pointed out a large amount of invalid blocks being seen. As noted in the Ethereum Foundation's Root Cause Analysis, execution clients did not configure the deposit contract address correctly and the chain split with the majority of execution clients agreeing with the problematic block.

For Lodestar, all of our Holesky validators were setup with Geth as the execution client. As the chain forked, our validators had agreed with the majority view of execution clients indicating that the problematic execution payload at slot 3711006 with the beacon block root 0x2db899881ed8546476d0b92c6aa9110bea9a4cd0dbeb5519eb0ea69575f1f359 was valid, even though it contained the problematic execution payload. This also meant all of our validators had submitted attestations to justify the bad chain containing this beacon block. It was determined later that the minority fork consisting of Erigon and Reth execution clients correctly rejected the problematic block.

Since the incident, ChainSafe has and will plan to run equal distributions of all five execution clients to have better redundancies and faster turnaround if any execution client experiences issues.

Justification of the problematic block

It was noted at 22:44:00 UTC that the chain was very close to finalizing, with 65% of eligible ETH voting for the bad fork. This was without all of Lodestar's beacon nodes properly in sync due to the v1.27.0 disruption in our fleet. This may have helped prevent finalizing the bad chain as it kept 5-6% of the eligible ETH offline post fork while we resolved the situation.

Execution clients confirmed their findings for what was causing the chain to fork. By that time we had already restarted and also confirmed that our validators were on the bad fork of the chain. At this time it had already been justified on epoch 115968 with 71% eligible ETH. To prevent finalizing the bad fork, we decided to shut down all of our validators at 23:24:00 UTC. Shoutout to Michael Sproul (Lighthouse) for his foresight in taking down the Lighthouse validators early to prevent finalization.

Unfortunately we pushed, and were able to quickly deploy v1.27.1 because the information we had at the time. This was meant to enable block proposals and attestations post hard fork. Once we realized that there was a bug and a chain fork all of our genesis validators had already committed to slashable surround votes if we were to try and recover the valid minority chain.

We learned that during hard fork transitions, although it is important to finalize the chain promptly for stability post-fork, it is important to verify what is being finalized. Invalid blocks on the network post fork should be promptly investigated and actions should be taken to understand the problem before taking action. Although rare, we understand the reprocussions of running only majority clients which can pose major issues if the majority client(s) is/are incorrect.

Patching execution clients

In order to produce blocks on the correct, minority chain we needed to upgrade to the patched execution clients. ChainSafe started patching execution clients with the respective Holesky rescue branches at 2025-02-25 00:05:12 UTC. Nodes were restarted using the finalized checkpoint pre-Electra served by EthPandaOps. Due to the problematic block already being in the database, the rescue images got stuck so a database removal was necessary.

Our internal plan was to patch, resync and stabilize our beacon nodes until we could determine how to best move forward. At this time, none of our validators were able to produce blocks and help maintain liveness for the minority fork. At the time we did not have any validators running Erigon or Reth execution clients in our infrastructure.

By the end of the night, the core dev community rallied around the priority to get blocks production stabilized on the "correct" minority chain. Once liveness was adequately restored we could approach attestations and potential slashing issues afterwards.

Rescuing Holesky

Syncing and Fork Confirmation Issues

Lodestar was consistently struggling to sync to the correct chain due to the justified epoch and the majority of non-upgraded peers continuing to build on the bad chain. We had three methods of determining if we were on the correct chain:

  1. Verify the bad beacon block was not in the canonical chain via the /eth/v2/beacon/blocks/3711917 API endpoint
  2. Check that the state root matches those who are on the correct chain which was a small minority of beacon nodes that were running their consensus clients with Erigon or Reth
  3. Using a script developed by Sam at EthPandaOps to check block roots against the beacon node for the correct fork

Using the beacon endpoint /eth/v1/debug/fork_choice, you can also see how the invalid block was processed in fork choice, leading to the conclusion of importing this problematic block optimistically, leading Lodestar to have problems pivoting to the minority chain.

There were multiple viable forks which added exponential work to all consensus clients. This environment created a resource-heavy load on every node on the network. Dapplion's Devcon SEA presentation on "How long non-finality could kill Ethereum" accurately predicted many of the problems.

In a chaotic forking environment, it becomes very hard to know which fork is the correct chain and finding those healthy peers become increasingly difficult, if not made worse, by unhealthy nodes continuing to build the wrong chain. Shutting down unhealthy or faulty nodes on the wrong chain may help to improve the ability for CLs to properly sync new, healthy beacons on the correct chain.

The most important thing we can do in this environment, is continue to build blocks to restore liveness on the correct chain. This means we need to have the ability to spin up healthy nodes that can be sync'd to the same, correct canonical chain so the connected validators can produce the desired blocks. One of the hardest problems was making sure we could point Lodestar at healthy peers so we could sync to the minority fork the community rallied around.

To prevent Lodestar from re-syncing to the incorrect chain, we developed a feature that a checks for blacklisted blocks and disallows processing. We also check that any subsequent blocks of the invalid chain are also disallowed via https://github.com/ChainSafe/lodestar/pull/7498. Although, there is an argument to be made for abusing of this feature, the worst case could lead to minority chain forking if used improperly, which self-inflicts the node operator.

In addition, a follow up pull request to this feature adds a new endpoint /eth/v1/lodestar/blacklisted_blocks to return the root/slot of any blacklisted blocks. The CLI flag --chain.blacklistedBlocks 0x12345abcdef is a feature that allows you to list 0x-prefixed root hexes for blocks that should not be processed.

To find healthy peers in a chaotic environment, communication is essential to share trusted peer identifiers of healthy nodes, while ensuring not to overload them. Healthy nodes bombarded with many requests may lead to performance issues, taking down vital healthy peers in a fragile environment. Sharing healthy enr: addresses for consensus clients and purposefully using them as bootnodes was the early strategy to sync up more healthy beacons. Documents like this were shared by the community to have some trusted schelling point for coordination. In our holesky-rescue branch, we added some of these Holesky bootnodes by default to the image/branch build. This also led us to suggest consensus clients implement an endpoint that allows you to add a "trusted" peer which can bypass some peer scoring heuristics: https://github.com/ChainSafe/lodestar/issues/7559

In Lodestar, you can use the --bootnodes flag to add a list of ENR bootnodes and set --network.connectToDiscv5Bootnodes true to attempt direct libp2p peer connection to these nodes.

Note that if you want to use Lodestar as a static peer, you must enable --persistNetworkIdentity to ensure your ENR (your network identity) remains the same throughout restarts. Otherwise, it will change.

We currently do not support the trusted peer feature.

Lodestar currently struggles at times with syncing in a network where we may only have one or two useful peers. As noted in https://github.com/ChainSafe/lodestar/issues/7558, we may inavertently end up banned for not adhering to the rate limit due to overwhelming data requests. The overwhelming data requests still needs to be investigated.

Resource usage in a non-finality environment

One of the primary weaknesses of consensus clients in a non-finality environment is resource usage. Specifically, out-of-memory (OOM) crashes take down healthy nodes which make it harder for others to sync up to the head of the chain. In addition, if a node needs to resync from a previously finalized checkpoint, it becomes a much harder hill to climb for resyncing.

Handling memory usage in this environment is essential and timing is critical. Fixing memory handling issues and figuring out how to boot from unfinalized checkpoint states will help us sync healthy nodes faster and keep them stable under duress.

Lodestar has a fixed heap memory usage limit set by the nodeJS environment. Our images and builds set this at 8GB (8192 MB) which is sufficient for a stable network. When we cannot prune unfinalized states to disk, generally a buildup of memory can cause out-of-memory (OOM) errors which crashes the node. In our holesky-rescue branch, we increased the heap memory by default in the nodeJS environment to use --max-old-space-size=32768 to allow for 32GB of heap memory.

While other teams were trying to sync with Lodestar to the minority fork head, they discovered that Lodestar crashes as it gets close to the head and we could not properly prune checkpoint states. We covered this issue in https://github.com/ChainSafe/lodestar/issues/7495.

The following fixes were implemented to correctly handle pruning checkpoint states:
https://github.com/ChainSafe/lodestar/pull/7497
https://github.com/ChainSafe/lodestar/pull/7505

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Before the fixes were implemented, we would costantly OOM as memory would grow indefinitely until it exceeded the max limit. In the case for this device, the max was set at 60GB of memory. After these fixes, Lodestar generally held memory pretty constantly at around 10-12GB for most of the non-finality period.

Reduce storage by pruning checkpoint states

Besides memory, another bottleneck during non-finality is disk storage requirements. For the Holesky network, checkpoint states can reach as much as ~250MB per state, which consumes about 56GB per day and is another reason why beacon nodes fail to remain in sync during long non-finality periods. With https://github.com/ChainSafe/lodestar/pull/7510, we are able to set DEFAULT_MAX_CP_STATE_ON_DISK to persist a maximum of n epochs on disk to help with storage bloat.

This type of behaviour is generally not safe to do. If you set DEFAULT_MAX_CP_STATE_ON_DISK to 100 (as we did by default for the holesky-rescue branch), the node may not be able to process blocks from other potential forks. This is disabled by default and may be added as a future feature.

Storage feature improvements are currently being developed, such as using binary diff states and/or using ERA files. This should better address storage requirements without using unsafe features to address chain emergencies.

Using checkpointState to jumpstart sync

As the non-finality network persisted, it was getting increasingly difficult over time for new nodes to get synced to the head of the correct fork. In Lodestar, you can set an arbitrary checkpoint state to initiate the sync via --checkpointState /path/to/file/or/url. We were able to get a SSZ state from Serenia (a trusted node operator) that was saved from a synced beacon node, on the correct chain. This was a huge timesaver for jumpstarting other beacon nodes. By loading this checkpoint, we could minimize the amount of time required to catch up to the head of the chain and avoid syncing from the last finalized checkpoint.

Using local checkpoint states

During the Holesky rescue, one of the useful features that would help in a non-finality environment is to be able to jumpstart nodes utilizing an unfinalized checkpoint. In addition, you need a way to persist the checkpoint state locally for future use and the ability to share the state for helping other nodes sync.

With https://github.com/ChainSafe/lodestar/pull/7509, Lodestar is able to use a local or remote unfinalized checkpoint state to reduce the syncing time required when a node is restarted. This allowed new nodes to quickly sync to head, especially as the non-finality period dragged on. In combination with https://github.com/ChainSafe/lodestar/pull/7541, we implemented an API endpoint (eth/v1/lodestar/persisted_checkpoint_state) to return a specific, or latest safe checkpoint state. This feature will help to bring healthy nodes up faster in a turbulent network. Node operators in this environment generally refrain from restarting synced beacon nodes because of the difficulty to resync. Some important settings generally require an application restart to take effect, so this feature will help node operators feel more comfortable about restarting nodes in future non-finality environments.

Slot import tolerance adjustments

When trying to sync the correct chain, the hardest parts to sync were the early periods of the rescue where there were no blocks being produced and consecutive skipped slots in the hundreds. Our slotImportTolerance was increased from the default 32 because there were periods of greater than one epoch where we would not see one block. The function is important as it prevents Lodestar from building on a head that is older than 32 slots which could cause massive reorgs. However, Holesky was suffering from very serious liveness issues where we wouldn't see a block for more than 32 slots, then the node would fall back into syncing mode, preventing it from producing blocks. Setting this higher by default for the early periods of the rescue helped to ensure we could contribute to block production, even though new blocks were rarely seen.

Sync stalls and peer churning

Due to the turbulent network, continuous rate limiting of "bad peers" made it very difficult to find good peers that would serve us the data we needed. Lodestar would at times hit the max target peers but none of them were viable for advancing our node. Although we experimented with churning a random subset of connected peers when we were stuck, we ultimately implemented a heuristic to start to pruning peers when it was starved. During this state, it will attempt to prune additional peers but prioritize keeping peers who are reportedly far ahead of our node: https://github.com/ChainSafe/lodestar/pull/7508

This PR also enhanced our /eth/v1/lodestar/peers endpoint to return more details about the status of our connected peers. This includes data such as their head_slot which can help analyze the health of the network by identifying how many of our connected peers are actually synced to the head.

Example:

    {
      "peer_id": "17Uiu2HAmPcVbHnDgeFBTKcFtpaLfLfUJaEasdbpsBeT7nbxmmDeh",
      "enr": "",
      "last_seen_p2p_address": "/ip4/123.456.789.000/tcp/4096/p2p/17Uiu2HAmPcVbHnDgeFBTKcFtpaLfLfUJaEasdbpsBeT7nbxmmDeh",
      "direction": "inbound",
      "state": "connected",
      "agent_version": "Prysm/v5.0.3/2c6e028600d4ad5659f0d25d8911714aa54f9c25",
      "status": {
        "fork_digest": "0x019e21ad",
        "finalized_root": "0xd0285a8ec914f53cf15e9cce336a07c4f11c666e2addec620592d5ff3640ed34",
        "finalized_epoch": "115967",
        "head_root": "0x669de4cb66b8444684e59ada8f3cd4729ddb9d4e3d2d7b1d361b59fe0ba96aa0",
        "head_slot": "3775735"
      },
      "metadata": {
        "seq_number": "2",
        "attnets": "0x0000000000000600",
        "syncnets": "0x08"
      },
      "agent_client": "Prysm",
      "last_received_msg_unix_ts_ms": 1741211227122,
      "last_status_unix_ts_ms": 1741211227062,
      "connected_unix_ts_ms": 1741211226993
    },

Pectra attestation bug

As Lodestar continued to build a Holesky fork that was quickly syncable to help rescue Holesky, we discovered an attestation bug specific to Electra which led to no aggregate inclusions and producing very bad aggregated attestations ourselves. This made our validator performance very poor while trying to rescue the Holesky chain and we were not as effective with helping to finalize the chain. The included fix which uses the correct subnet for validating gossip attestations was then merged to the holesky-rescue branch shortly after Holesky finalized.

image

This devastating validator performance bug should have been caught during our Pectra testing. Sadly, it was not seen due to our lack of internal testing visibility on devnets.

ChainSafe now runs devnet nodes with standardized Lodestar metrics used by our testing fleet. Developer ergonomics is key and having familiar tooling, profiling and dashboards is key to proper node monitoring. This will allow us to utilize our internal observational metrics and dashboards to visually inspect performance metrics on EthPandaOps devnets before they occur on public testnets.

Other Considerations for Future Implementation

Lighthouse included a disable-attesting feature which is used to avoid flooding the beacon node while syncing and removes an additional overhead of dealing with attestation requests: https://github.com/sigp/lighthouse/pull/7046

Leader implementation for disaster recovery should be brainstormed. What would it look like if the community needs to socially agree to nudge mainnet down a mutually agreed upon path. Perhaps consensus clients should have the ability to ingest an anchor state, on any fork, in case of emergency.

As an alternative to blacklisting bad block roots, we are attempting to make a "pessimistic sync" work as an alternative via https://github.com/ChainSafe/lodestar/pull/7511.

Summary

Overall, Holesky was a great exercise and provided valuable fixes for all client teams. We believe that non-finality devnets should be prioritized as part of hard fork readiness testing so we have more opportunities to test unhappy cases of network turbulance. This experience helped us ask very important questions, even though this may not be the same emergency response plan we would consider for mainnet. There are many uncontrollable aspects of mainnet where a rescue reminiscent of Holesky would not be feasible. However, some of the features and fixes developed from this incident may be useful if we were to experience chain splits or finalizing bad blocks in the future.