MaxEB slashing risks

# MaxEB slashing risks ## Slashings background In the beacon chain, a validator can be slashed in response of evidence of the following pairs of signed messages: - Attestation double vote: two distinct attestations that vote for the same target - Attestation surround vote: a vote with source and target surrounding or surrounded by the other vote - Double block proposal: two distinct proposals at the same slot These slashing conditions address the nothing-at-stake problem by forcing attesters to participate in a single fork. Honest participants running a well implemented validator client should never sign slashable messages thanks to a built-in slashing protection mechanism. However, there's no protection if the same key is run in two independent instances without a shared slashing protection database. There are other risks associated with slashing messages like fork-choice implementation bugs, slashing protection mechanism bugs, but those are out of scope of this document. ## Running keys twice The most common cause of slashing due to operator error is running the same key in separate instances that _could_ have different views of the chain. Specifically a slashing offence may occur when the `AttestationData` produced by each node diverges. An attestation contains the following _subjective_ data for an attestation at `slot`: 1. LMD Ghost head vote = what's the head of the chain at `slot` 2. FFG vote target = what's the checkpoint's of head chain at `slot` 3. FFG vote source = what's the latest justified checkpoint The head vote can diverge if at the 4 second mark each node has a different view of what's the head. In healthy network conditions that happens when blocks are produced late, and some node receives the block latter than others. ## Divergent rate EF devops (@ethpandaops) deployed a set of sentry nodes and checked for each slot if their view of the chain was distinct 4 seconds into the slot. We define "rate of distinct attestation data" and the sum of those events against all observed slots. For Ethereum mainnet this rate is quite low, allowing few opportunities to produce slashable messages. ![upload_7dec727856d8a141eb6aa2c90764a2fa](https://hackmd.io/_uploads/HkmvErMB6.jpg) ## Slot assignments Consider an operator running $N$ keys. Each active key controls a validator index, which is assigned a slot to attest during each epoch. You can imagine this operation like a random lottery distributing the $N$ keys in 32 buckets every epoch, where each bucket can have between $0$ and $N$ keys. The probabilities of each bucket are uniform, so each bucket has $N/32$ keys on average. ![upload_055630ec84e3f9416e289884e21475da](https://hackmd.io/_uploads/r1VKErfrp.png) Next, each slot has a second independent lottery where adverse network conditions increase the likelyhood of diverging views. A 1% chance of diverging views means 1 slot every 20 minutes, or 1 slot every 3 epochs. Note that adverse network conditions or client issues can significantly increase this rate. ## Typical slashing incident timeline Let's break down the typical timeline of a slashing incident caused by operational error: - **t0**: Some operational error that causes keys to be running in multiple independent instances. Let's assume no corrective action is done before the first slashing event happens. - **t1**: Some slot with assigned indexes "wins" the divergent chain views lottery, signs conflicting messages and its keys are slashed. Ethereum mainnet is monitored by many parties, who sound alarms in response to slashing events. - **t2**: In response to the alerts, some action is done to correct the operational error: such as a human operator stoping a docker container. ![upload_337d7a0faf7814c062bca451644b982b](https://hackmd.io/_uploads/r1X9EBzrT.png) Let's define the **total incident cost** as the sum of those red bars, i.e. the total sum of slashed stake over the incident duration. The total incident cost can be split in two components: - Initial _guaranteed_ slashing of at least 1 index (to trigger alarms) - Sum of subsequent slashing events between **t1** and **t2** ### Initial slashing modeling Assuming there's at least one slashing incident before operational response: how much stake is slashed on this one slot? To compute its cost we need the probability distribution of how many indexes are assigned at that slot, then scale up by stake per index. Each index has an independent probability of $1/32$ to be assigned to a slot. Notice this is a yes-no question for independent experiments so we can use the binomial distribution with $p = 1/32$, $n = N$. ![upload_defbda6921d3bdcd4a82467a45efdb43](https://hackmd.io/_uploads/HyNsErfBa.png) We want the expected number of validators assigned to a slot, given that at least one validator is assigned. The probability mass function (PMF) of this conditional distribution is given by $$ P(X=k∣X>0) = {P(X=k) \over {1−P(X=0)}} \;\; for \;\; k=1,2,…,M $$ where $P(X=k)$ is the probability mass function (PMF) of the binomial distribution $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$. The expected value for this conditional distribution is calculated by summing over all possible values of $k$, multiplied by their conditional probabilities: $$ E[X∣X>0] = \sum_{k=1}^{M} k * P(X=k∣X>0) $$ This function starts at $1$ for the single index case, where the only existing index is guaranteed to be slashed. The function tends to $M/32$ for large enough $M$. ![image](https://hackmd.io/_uploads/SJXsn4GST.png) <details> <summary>Chart source code</summary> ```python import numpy as np from scipy.stats import binom def conditional_expected_value_single_slot(M, p_assign): # Binomial distribution for the number of validators in a slashable slot k_values = np.arange(1, M + 1) binom_probs = binom.pmf(k_values, M, p_assign) # Probability of at least one validator being assigned to a slot prob_at_least_one = 1 - binom.pmf(0, M, p_assign) # Conditional probability distribution conditional_probs = binom_probs / prob_at_least_one # Expected value calculation return np.sum(k_values * conditional_probs) ``` </details> $$ SlashedStake_{t1} = E[X∣X>0] * S_{index} $$ The risk profile increases only for smaller $N$ values. So, ***what is $N$***? The number of indexes per host or per operator? Let's postpone this question for a latter section. ### Subsequent slashing events modeling Assuming the operational error is eventually corrected, how much stake is slashed during a finite amount of time (**t2** - **t1**)? Every epoch the count of not yet slashed indexes is randomly assigned a slot in the epoch. Then each slot has some probability $p_{div}$ of causing diverging views. Since $p_{div}$ is quite low, it's likely to not have any slashable event through an entire epoch ![image](https://hackmd.io/_uploads/HkDJIBzHa.png) In contrast to the initial slashing event, here we can consider all values, inlcuding the case where zero validators are assigned to a slot. Intuitively, if an operator has few indexes, it has a higher likelyhood of having zero indexes assigned to slots with diverging views. The expected value of the incident cost can be computed for an incident duration of $M$ slots, an $S_{index}$: $$ SlashedStake_{t2-t1} = M * p_{div} * 1/32 * N * S_{index} $$ An operator with total stake $S_{operator}$ an distributing equal stake to each index, runs $N = S_{operator} / S_{index}$ $$ SlashedStake_{t2-t1} = M * p_{div} * 1/32 * S_{operator} $$ ### Total cost of the incident The total cost of the incident, with fixed $S_{operator}$ and $p_{div}$ is a function of three parameters: - Reaction time to alerts $M = t2-t1$, decides $SlashedStake_{t2-t1}$ - Consolidation factor $c = S_{operator} / N / 32$, decides $SlashedStake_{t1}$ $$ SlashedStake_{total}(M,c) = SlashedStake_{t1}(c) + SlashedStake_{t2-t1}(M) $$ ## Key distribution > Is $N$ the number of keys per host, or the number of keys per operator? From the previous section [Initial slashing modeling](#Initial-slashing-modeling), consolidating to $N >= 128$ increases the expected value by a negligible less than 2%. If $N$ can be considered as the number of keys per operator, much higher regimes of consolidation are possible without increasing risk. ![image](https://hackmd.io/_uploads/rkExIBfHp.png) Can both cases be considered equal for the slashing risk modeling done above? Some factors: 1. Do all hosts experience network divergence equally? 2. Operational errors affect multiple hosts at once? 3. A slashing alert response corrects operational errors on all hosts at the same time? **Do all hosts experience network divergence equally?** TODO needs more research, but probably yes **Operational errors affect multiple hosts at once?** Yes. Operators with large numbers of keys run (sets of) hosts via automation tools typically with a specific provider. So yes, it's expected that operational errors affect a set of hosts larger than one. **A slashing alert response corrects operational errors on all hosts at the same time?** Most likely yes. The typical fast response to operational errors is to stop all suspected instances. Operators usually have equal access to all infrastructure so they have the capacity to stop the entire fleet. ## Historical slashing incidents TODO - Launchnodes Slashing Incident (October 11 2023) [post-mortem](https://blog.lido.fi/post-mortem-launchnodes-slashing-incident/) - RockLogic GmbH Slashing Incident (April 13 2023) [post-mortem](https://blog.lido.fi/loe-rocklogic-gmbh-slashing-incident/) - Gateway.fm Slashing Incident (Gnosis chain) (July 31st 2023)