For the past ~5 months, Lighthouse has been implementing Doppelganger Protection. I believe either Dankrad or Superphiz came up with this concept, you can find a loose description here: https://github.com/sigp/lighthouse/issues/2069. The Lighthouse implementation was merged here: https://github.com/sigp/lighthouse/pull/2230.
Doppelganger Protection (DP) is a nice feature, but it's turned out to be rather difficult to implement. It sits in very critical code paths in the validator client. Failures in DP can easily result in full liveness failures on the network. The difficulty with implementation stems from two major points:
I (@paulhauner) created this document to help share some of the edge-cases we found along the way.
This one is pretty obvious, I think most teams are familiar with it.
If you wait some constant N
epochs after start-up when the VC is starting before genesis, then DP is pointless. All nodes will wait N
epochs (missing all blocks) and then will start attesting. No one will detect a doppelganger and you'll miss out on N
epochs of validators. We've seen some clients miss the first few epochs on testnets from this.
Mitigations:
N
. You'll still get a liveness failure but you might detect doppelgangers.Consider it the current epoch is some epoch e
. The VC produced an attestation at start_slot(e)
and is scheduled to produce a block at end_slot(e)
.
If you reboot the VC at start_slot(e) + 1
, you need to be ensure that you don't detect your already-published attestation as a doppelganger, otherwise you'll miss the block proposal.
Mitigation:
e
, wait until e + 1
before you start checking for doppelgangers. This is what Lighthouse does.Note: it's important you don't trigger false-positives with Doppelganger. If you get users used to false-positives, then you're basically telling them to rely on Doppelganger. We want their behaviour to be something along the lines of "oh shit, a doppelganger, i need to drop everything and figure out why there's a duplicate", rather than "oh a doppelganger, i'll just keep doing the same thing until the errors stop."
Consider a user is staking on a laptop. They start their validator in epoch e
and then immediately shuts their laptop lid (i.e., suspend). Then, they re-open their lid (i.e., wake from suspend) in epoch e + 10
.
Say that you usually wait WAIT_EPOCHS = 2
epochs to check for Doppelgangers before starting signing. It's very easy to write DP in a way that says e + 10 > e + WAIT_EPOCHS
, therefore we've finished waiting and we're safe when you've actually never checked for Doppelgangers.
In Lighthouse, we've mitigated by this by saying a validator needs n
"checks" before it can start. So, each time we ask the BN if there are any Doppelgangers and it says no, we decrease our "remaining checks" by 1
. This makes us safe from the suspend/wake scenario here since we avoid the timing assumption.
Lighthouse went to some effort to ensure that validators undergoing DP still post subnet subscriptions to the BN. This means the BN subscribes to their attestation subnets and has more chance of seeing Doppelgangers.
Note: doing this means you need to sign selection_proof
objects. This is fine since they're not slashable.
Because of how gossip propagation check are specified, generally with DP you're asking "have we seen this validator produce a block/attestation in some epoch?"
If that's the question, then we must also consider when to ask the question. In Lighthouse, we have a thing called "epoch satisfaction"; it's when we're satisified that our validator wasn't seen in some epoch e
.
We say that an epoch e
is "satisified" when it is end_slot(e + 1)
and the BN has told us that the validator has not been seen in epoch e
. By this time, you cannot include an attestation from epoch e
in any new block so in all likelyhood if the validator was live we'd already seen them.