Doppelganger Detection: Lessons Learned

For the past ~5 months, Lighthouse has been implementing Doppelganger Protection. I believe either Dankrad or Superphiz came up with this concept, you can find a loose description here: https://github.com/sigp/lighthouse/issues/2069. The Lighthouse implementation was merged here: https://github.com/sigp/lighthouse/pull/2230.

Doppelganger Protection (DP) is a nice feature, but it's turned out to be rather difficult to implement. It sits in very critical code paths in the validator client. Failures in DP can easily result in full liveness failures on the network. The difficulty with implementation stems from two major points:

Edge cases
Difficulty in testing

I (@paulhauner) created this document to help share some of the edge-cases we found along the way.

Edge-case: Genesis

This one is pretty obvious, I think most teams are familiar with it.

If you wait some constant N epochs after start-up when the VC is starting before genesis, then DP is pointless. All nodes will wait N epochs (missing all blocks) and then will start attesting. No one will detect a doppelganger and you'll miss out on N epochs of validators. We've seen some clients miss the first few epochs on testnets from this.

Mitigations:

Randomize N. You'll still get a liveness failure but you might detect doppelgangers.
Disable doppelganger if starting the VC in the genesis epoch. Maintains liveness but removes DP entirely. This is what Lighthouse has done because we think network liveness is more important in this scenario than DP.

Edge-case: Fast Reboots

Consider it the current epoch is some epoch e. The VC produced an attestation at start_slot(e) and is scheduled to produce a block at end_slot(e).

If you reboot the VC at start_slot(e) + 1, you need to be ensure that you don't detect your already-published attestation as a doppelganger, otherwise you'll miss the block proposal.

Mitigation:

If the VC starts up in epoch e, wait until e + 1 before you start checking for doppelgangers. This is what Lighthouse does.

Note: it's important you don't trigger false-positives with Doppelganger. If you get users used to false-positives, then you're basically telling them to rely on Doppelganger. We want their behaviour to be something along the lines of "oh shit, a doppelganger, i need to drop everything and figure out why there's a duplicate", rather than "oh a doppelganger, i'll just keep doing the same thing until the errors stop."

Edge-case: Time-travel

Consider a user is staking on a laptop. They start their validator in epoch e and then immediately shuts their laptop lid (i.e., suspend). Then, they re-open their lid (i.e., wake from suspend) in epoch e + 10.

Say that you usually wait WAIT_EPOCHS = 2 epochs to check for Doppelgangers before starting signing. It's very easy to write DP in a way that says e + 10 > e + WAIT_EPOCHS, therefore we've finished waiting and we're safe when you've actually never checked for Doppelgangers.

In Lighthouse, we've mitigated by this by saying a validator needs n "checks" before it can start. So, each time we ask the BN if there are any Doppelgangers and it says no, we decrease our "remaining checks" by 1. This makes us safe from the suspend/wake scenario here since we avoid the timing assumption.

Lighthouse went to some effort to ensure that validators undergoing DP still post subnet subscriptions to the BN. This means the BN subscribes to their attestation subnets and has more chance of seeing Doppelgangers.

Note: doing this means you need to sign selection_proof objects. This is fine since they're not slashable.

Something to consider: When is an epoch checked?

Because of how gossip propagation check are specified, generally with DP you're asking "have we seen this validator produce a block/attestation in some epoch?"

If that's the question, then we must also consider when to ask the question. In Lighthouse, we have a thing called "epoch satisfaction"; it's when we're satisified that our validator wasn't seen in some epoch e.

We say that an epoch e is "satisified" when it is end_slot(e + 1) and the BN has told us that the validator has not been seen in epoch e. By this time, you cannot include an attestation from epoch e in any new block so in all likelyhood if the validator was live we'd already seen them.

Doppelganger Detection: Lessons Learned

Edge-case: Genesis

Edge-case: Fast Reboots

Edge-case: Time-travel

Something to consider: Subscribe to subnets

Something to consider: When is an epoch checked?

Read more

SnowyHitch Panics

Windows `cache_arena.rs` panic

Serialization Overview for Ethereum CL Devs

Debugging