What's New in Eth2 - 17 August 2020

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Ben Edgington (PegaSys, ConsenSys — but views expressed are all my own)

Edition 49 at eth2.news

Medalla meltdown special edition

I'm producing this as an off-schedule one-off just to explain some of the events over the weekend around the Eth2 Medalla testnet.

We started Medalla almost two weeks ago, on the 4th of August, as a large, public, multiclient testnet, running the release-candidate Eth2 spec. You can read more about that in the last edition.

The testnet ran wonderfully for 10 days, although participation was a little lower than hoped for - only 70-80% of validators were regularly or constantly on-line. But that's OK; it's designed to cope.

However, on Friday evening, I was watching the dashboard when participation fell off a cliff. In the space of a few minutes, the number of active validators dropped from around 22,000 to about 5,000 - up to 80% of the network had vanished.

This is a quick round-up of the event, the aftermath, and what next.

Update: Prysmatic labs has now published a very thorough walk-through of the incident.

What happened???

It turned out that every Prysm validator in the network had suddenly disappeared. Since Prysm has been the most widely used client, the effect was pretty severe.

The Prysmatic team is maintaining an excellent incident report that contains all the details and their responses. The below is just some highlights and my commentary.

The problem was related to clock sync. Prysm clients were configured to use Cloudflare's Roughtime to work out the time. It's not entirely clear (to me) how it happened, but apparently Roughtime started serving a time four hours into the future, and did so for over an hour. As a result, all Prysm clients suddenly found themselves four hours in the future, and merrily set about making blocks and attestations for a chain that didn't exist yet.

In itself, this was not catastrophic. The remaining clients were able to keep building the original chain, albeit with many missing blocks, and ignoring the flood of attestations from the future. Gradually, Prysm nodes started coming back as their clocks readjusted, and participation began to rise. Yay!

It wasn't until a couple of hours later that all hell broke loose.

Four hours after the original incident, two things happened. First, all the attestations from the future that the Prysm clients had produced started to become valid. Second, the Prysm nodes that had rejoined the network vanished again as their slashing-protection kicked in to stop them making any contradictory attestations. The result of these two things together was chaos. The beacon chain exploded into a forest of branches as the remaining clients struggled to process the crazy information they were receiving. (Update: Raul from the Prysmatic team tells me that this was all exacerbated by a critical bug in Prysm's initial fix.)

For a while, processing all of this was manageable. But, over the next 24 hours or so, the memory and CPU requirements of navigating this increasingly complex mess of forks started to become overwhelming. I spotted a Lighthouse client using 30GB of memory (about 100 times what it typically uses), and Teku was having trouble even with 12GB of Java heap and maxing out the processor.

This was all over the weekend, remember. Kudos to the client teams who have worked hard, and continue to work hard, on optimising memory and efficiency so their nodes can handle the network chaos.

The current status is that the network is slowly recovering. User reports vary, but there are versions of Prysm and Lighthouse just out that are fairly efficiently able to find the right chain head and continue building the beacon chain. Eth2Stats is currently showing some instances of Lighthouse, Prysm, and Teku nodes at or near the chain head. We're doing some more work on Teku to allow it to keep up while using substantially lower resources.

What didn't happen

There was no consensus failure between clients, in the sense that, when they can reach it, all clients agree on the state at the head of the chain. This is great, meaning that the beacon chain is not fundamentally broken, and there's no need for any kind of hard fork^[1].

Lessons

Here are some of my immediate thoughts on lessons learnt. Of course we will be taking time to reflect on this incident properly; the below is just me shooting from the hip.

The importance of time

Mass reliance on third-party time services is a serious vulnerability for the network.

As it happens, Alex Vlasov of the ConsenSys TX/RX research team has written extensively about time synchronisation and its importance in Ethereum 2.0. By and large his work has flown under the radar, but perhaps this is a good opportunity to pay it some more attention. Here is his list of articles and ethresear.ch posts.

The value of diversity

In an ideal world, we would have four or more independent clients, each with less than, say, a 30% share of the network. In those circumstances, a client could go down for a while and we would barely notice it.

Even if we can't achieve that ideal, reducing the network dominance of a single client will lead to a more robust network. If 50% of the validators had vanished rather than 80%, it would have been a lot easier to recover. This is because, when clients drop, it affects block production, attestation inclusion, gossip effectiveness, peering, syncing: all things that have knock-on effects on the performance of the remaining validators.

The usefulness of options

Some stakers were able to switch their signing keys to hot-standby nodes of other clients. This is definitely a good safety-net, although huge care needs to be taken to avoid getting slashed: the new validators may not know the voting history of the old ones and could make contradictory votes. (Update: we are working on a common format for slashing protection info so that this can be moved between clients along with the keys.)

In future, once the new API is finished, it should be possible to switch the validator clients themselves, not just the keys, between different beacon nodes. For example, a Prysm validator could simply be detached from the Prysm beacon node and attached instead to a Teku beacon node, say. This would help to solve the slashing problem above.

The responsibilities of stakers

Eth2 is not yet "set and forget". For some time yet, there will be a need for those staking to pay attention, read the forums, provide feedback to devs, and update their clients at short notice. I am strongly supportive of individuals getting involved and running their own validators, but best to be aware of the commitment ahead of time.

The hazards of haste

Why do things always break on Friday evenings?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Despite the time, the Prysmatic team's response was impressive. See the incident report for details. Nothing of what I say below is intended to reflect badly on the Prysmatic team: they did a genuinely tremendous job. These are observations and lessons to learn for my own team when we end up in a similar situation.

When there are so many users, all losing "money" (even if only testnet money), and the network is stressed, it's only natural to want to respond extremely quickly. But sometimes it really is better to slow down to speed up.

Two things that might have been avoided. First, the initial fix version published, alpha.21 had a flaw that resulted in users being asked to roll back about 17 hours later. According to Raul from Prysmatic, this flaw compounded much of the ensuing network chaos. Second, while managing the situation, the Prysmatic team inadvertantly deleted the database of anti-slashing records for their own 1024 validators, resulting in most if not all of them being slashed.

It could have happened to any of us. The lesson here is about taking time, even when under great pressure. And we all–-dev teams and users alike–-need to remember this. As we seek to recover the network, we're taking time to do it right.

The joys of chaos

Finally, if this hadn't happened, we would have had to make something like this happen. There's no point having a testnet that doesn't test anything. Being constantly in the happy-flow is simply not realistic.

It turns out that this was a superb test! It's pretty much the worst kind of shock the network could suffer (apart from doing it repeatedly, maybe), and likely more severe than anything we might have designed ourselves. Breaking the testnet like this is exactly what we need to make our clients bullet-proof.

I got quoted in an article in The Block last week:

In an email to The Block, PegaSys engineer Ben Edgington wrote that Medalla “is the first testnet that has the scale, scope, and configuration of the planned Mainnet.”

“This is the first large-scale trial of something that previously has been only a spec on a screen, or just a toy network. There are many aspects of the peer-to-peer network that we need to test and optimise,” he wrote. “So far, things are looking good, but we need more time, more scale, and some more network stress until we can be really sure.”

Honestly, be careful what you wish for

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

What next?

At the moment, we're all working on our clients to make them robust against the extreme conditions we are seeing. This is looking promising, and is excellent news for the future. It is very likely that we can recover Medalla into a workable state over the next few days, albeit with some damage to everyone's balances, and some slashed validators.

If, after all this, we find that participation never recovers (it might all have been too much for some), even if we could keep the network running, then we might think about restarting from scratch, and a fresh deposit contract (it might be good to rehearse genesis again). But that's just a backup option at this stage.

Viva Medalla!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Follow me on Twitter to hear when the next edition is out 🙌.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We also have an RSS feed.

Contra this absurd article in "Trust"nodes. ↩︎

What's New in Eth2 - 17 August 2020

Medalla meltdown special edition

What happened???

What didn't happen

Lessons

The importance of time

The value of diversity

The usefulness of options

The responsibilities of stakers

The hazards of haste

The joys of chaos

What next?

Read more

Data recovery: a toy example

BLS12-381 For The Rest Of Us

What's New in Eth2 - 26&nbsp;August&nbsp;2022

What’s New in Eth2

What's New in Eth2 - 26 August 2022