"Reorg" handling in DAL p2p network

## Goals * Clarify how to deal with messages from the past. * Make sure a node cannot be spammed with messages apparently from the future. * Make sure the mechanism to deal with the previous point does not lead to punishing honest (but maybe slightly lagging) nodes. ## Curent status wrt routing of shards * As soon as a commitment is published on L1, all available shards are published on the DAL. Note that the commitment is not final yet. * Any valid message is routed. * Besides `Valid` and `Invalid`, there is the `Unknown` validity status of a message in Gossipsub, which is used as follows: * the message is not routed (as opposed to the `Valid` case) * the peer is not punished (as opposed to the `Invalid` case) ```ocaml! type message_id = { commitment : Cryptobox.Commitment.t; level : int32; slot_index : int; shard_index : int; pkh : Signature.Public_key_hash.t; } let valid msg msg_id = let {share; shard_proof} = message in let {commitment; shard_index; _} = message_id in let shard = Cryptobox.{share; index = shard_index} in Cryptobox.verify_shard commitment shard shard_proof ``` ## General remarks Assumptions: * Tenderbake is used, together with its assumptions, e.g. * clocks are synchronized ### Remark on the sync status of the L1 node The DAL node could (and probably should) check the sync status of the L1 node. Recall [the rules](https://tezos.gitlab.io/shell/sync.html#formal-description-of-the-heuristic): * The status is `Synced` if there are more than `threshold` peers with heads that are more recent than `latency` (a head is recent if its timestamp is bigger than `now - latency`). * The status is `Stuck` if the `threshold` most recent peer heads have the same timestamp that is older than `latency`. * The status is `Unsynced` otherwise. The default values are `threshold = 4` and `latency = 150` (should probably be at least halved). ## Solution with delaying propagation by one level The main idea is to publish and route a message only if the corresponding commitment is **published and final**. This means delaying propagation overall by one level.  This solution allows to remove the commitment from the message id, because we can recover it from the store. This solution allows for a *strong validity* check: we not only check that the `shard` is valid with respect to its commitment (with `verify_shard`), we also check that the shard corresponds to a published commitment. Without delaying propagation it is not possible to have this strong validity check. The main downside of this approach, besides the overall increased latency, is that a slightly lagging peer that is also "critical" to the topology [TODO: I need to find the technical term; eg a node whose disconnection splits the network] may delay the whole network. Indeed, in contrast to the immediate propagation strategy (see next "solution"), this peer will first wait for the necessary L1 block to validate the message, and only propagate the message afterwards. On the other hand, gossipsub should ensure that there are no such "critical" peers. ### Messages in the future **Main idea**: To deal with messages from the future, a node keeps, per peer, a bounded set of messages in the future, which it only forwards when the relevant information from the L1 node becomes available. **Details**. Say node A is at level `n`, and it receives a message for level `n' >= n`. Node A cannot check its validity. We call such a message "a message in the future" for A. First, A checks whether the message is plausably valid: Node A computes `t'_min`, the minimal timestamp of a block at level `n'`, by assuming all blocks strictly between `n` and `n'` have round 0. If `t'_min - now > clock_drift`, the message is invalid: the referred block would not have been accepted by the sender's L1 shell. (`clock_drift` is a shell constant, currently `5`sec). If the message passes this check, the node puts it in a bounded cache. Several elimination policies (activated when the cache is full) are possible, for instance: 1. newest messages have priority 2. messages with a higher level have priority 3. messages with a lower level have priority We suggest using policy 3 because with messages with the lower level are the first whose validity can be checked, and therefore this policy allows to free the cache quicker. Second, when a new final block because available, the node checks the relevant messages in each cache for validity, it forward the valid ones, and punishes the peer if there are invalid ones. **A reasonable bound for the cache**. In the best case, the cache should be able to hold all possible messages for at least one level. Estimating the number of messages per level can be done based on peer A's and the remote peer's subscribed topics (TODO: do the computation and the size estimation). ### Messages in the past Note on GS: the cache stores message ids for the last `seen_message_slots` (currently `120`) heartbeats. This means two things: * (re)sending a message "older than 120s" is not accounted for by GS; it's like a "fresh" message * when (re)starting the DAL node, the cache is empty: all messages sent in the first 120 heartbeats are considered "fresh", no matter how "old" they actually are. This suggests that the DAL node should deal itself with "old" messages. How? Main question before going further: how old can a message be to be still considered worth propagating? Some remarks: * if messages older than `seen_message_slots` are allowed to be propagated, then a peer can easily "flood" the network with such older messages * mitigation: allow a few such messages, but the older they are the fewer should be allowed (if the limit is exceeded, punish the peer); solution: keep a counter of such received messages per period of length `seen_message_slots`; need to see what the max per window is, and for this need to see what the max per first (normal) period is; we could also group by topic * do not send messages if the L1 node is not synced, as otherwise a peer could be punished by the above mechanism * TODO: the parameters need to match We next present two orthogonal approaches; they can be used together. #### Minimalistic approach **Main idea**. The DAL nodes only propagate "fresh" messages, those that belong to the current attestation window. Said otherwise, with this aproach the role of the DAL network is just to propagate recent shards, so to ensure publication of shards, but not necessarily their retrievability, which could be ensured instead by specialized DAL nodes through RPCs. This approach seems sufficient in theory: if attestors are honest, and there's a single honest player that stores published data, then this honest player can make the data available. However this approach does not seem ideal, as it would be better to ensure that as many nodes can store slot data even if it was not available to them at publication time. Also, it's not an open, dynamic approach: one needs to keep track of RPC endpoints, trust them, etc; that's what the p2p network aims to solve. **Side note on Ethereum**: Ethereum uses (through "domains", implemented by different libp2p protocols; see [protocol ids](https://docs.libp2p.io/concepts/fundamentals/protocols/#protocol-ids) and [binary streams](https://docs.libp2p.io/concepts/fundamentals/protocols/#binary-streams)) two different protocols to exchange fresh versus old messages: - [gossipsub](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/p2p-interface.md#the-gossip-domain-gossipsub) for fresh messages - Ethereum GS uses a heartbeat interval of 0.7 seconds, and a `seen_ttl` of 550 heartbeats, that is 385 seconds, which is 1sec greater than the duration of an epoch (32 slots * 12 seconds per slot). Only messages referring to a slot after the last finalized slot are propagated (which, IIUC, are usually from the last epoch, but could be from older epochs if there are finality issues). Older messages are ignored. - [Req/Resp](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/p2p-interface.md#the-reqresp-domain) for old messages; there's still a limit `MIN_EPOCHS_FOR_BLOCK_REQUESTS` related to weak-subjectivity, so a bit like the 5 cycle in the past checkpoint in Tezos. So an Ethereum node is a bit like a Tezos node running in the rolling history mode, except that Tezos nodes exchanges the above data using the same p2p protocol. **Details**. We put the following constraints on the L1/DAL/GS parameters: * `seen_ttl` is a bit higher than `attestation_lag * minimal_block_delay` * `sync.latency` is higher than `seen_ttl`. We add the rule that if the DAL node is not synced, than it does not publish, nor forward messages. Under this approach, we fix two constants `a < b` used as follows. Say node A is at level `n`: * A message from levels `n - a` to `n-1` is forwarded, where `a := attestation_lag - 1`. If the message is from `n-a` and receiver C is an attestor also at level `n`, then the message is useful when C builds its attestation. (If the message is from a lower level, then the message is not useful in this sense anymore.) * The peer sending a message with a level smaller than `n-b` is punished, where `b` is say `3 + sync.latency / minimal_block_delay`. The peer should not have send this message in the first place, see above. * A message from levels `n-b` to `n-a-1` is ignored. #### Maximalistic approach **Main idea**. Here the goal is to allow propagating non-fresh data on the p2p network. There still needs to be a bound on how old a message can be. For instance, for the optimistic rollups use case, such a bound could be given by the size of the refutation window (ie around 2 weeks). Because we cannot keep track (with the message cache and the `seen_ttl` parameter) of all exchanged messages in such a big time window, and also to keep the bandwidth reasonable, we cannot use gossipsub as is. Instead we introduce a new message `IWantOld` that allows peers to ask for specific messages from the past. Its payload is the same as `IWant`, and it is handled similarly, but not by the GS, instead by the application, as follows: * because the message would not (normally) be in the GS cache, the GS asks the application for validity * the message is valid if * the level is within the desired bounds (e.g. between current level `n-a-1` and `n-c`, `c` being some constant) * the commitment was published and matches the shard * the peer has not exceeded the upper bound on the number of IWant requests per level or time window, which can be set to say 10% of all messages the peer could be interested in * If the message id is valid and the shard is present in the node's store (`Old_and_valid msg`), then the message `,msg` is sent to the requesting peer. In addition to `IHaveOld` a node could also attempt to reconstruct missing shards, and then send `IHaveOld` messages to the peers in its topic meshes. The workflow could be as follows: - for each level between `n - attestation_lag` and `n-c` - if the number of shards in the store is sufficient to reconstruct, then reconstruct and send a `IHaveOld` message to all peers in the relevant mesh with the ids of the previously missing shards - otherwise, randomly select a number of ids of missing shards (such that the upper bound fixed above is not exceeded) and send `IWantOld` messages ## Solution without delaying propagation ### Main ideas Whenever we receive a message whose current validity is plausible and cannot be checked, put it an a pool of such messages (bounded per peer), and reconsider their validity when new information is available. Additionally (and independently), we can specify a time window in which we consider a message as valid. Messages that are too far in the past would be considered invalid. The following "solution" does not consider these two aspects independently. TODO: improve that In contrast to the previous solution, a message cannot be considered invalid when the commitment is not published (even when this can be check) because a peer publishes messages whose commitments may not be published in the end (they are published in the current block, but this block may not be agreed upon in the end). ### Description Let `delta` be a small time value representing the network latency, say 5 seconds. Let `aw := attestation_lag * time_between_blocks` (`aw` for "attestation window"). We define two/three parameters depending on `aw`, with the following intuitions: - `w0`: if "message delay" (to be defined later) is within `[0,w0]`, then the message is timely, it's good to forward; e.g. `w0 := aw - delta` - `w1`: if the message delay is within `(w0,w1]` then message is late and not useful, do not forward it; if the message delay is above `w1` then the message should not have been sent in the first place, punish the peer; e.g. `w1 := 2 * aw` Note that if a shard is received `aw` seconds after its publication time, then, assuming the chain advanced normally, the shard cannot be considered "available", in the sense the it will not be taken into account when the attestor attests the availalability of the corresponding slot. We extend the `message_id` type with the *round* of the block containing the published slot header. We could also introduce the *predecessor round* to better deal with one particular case (see case `n=n'` below)! Also introduce `Useless` validity status besides `Unknown`: it means we know that the message is valid, but it is just too old. It has the same effect as `Unknown` (do not route, do not punish), it's just used for clarity of the terminology (we could merge the two into `Useless_or_unknown`). **Ideal property**: Below we'll define a time-based validity check, that should (ideally/hopfully) hold (or not) at a given time independently of (the head of) the peer that checks it. ### Publishing If the head's timestamp `t` is smaller than `now - w0`, then do not publish the message, as it will not be useful and we risk being punished (see next section). ### Routing Let A be some node, with its head at level `n`, round `r` and timestamp `t`. Node A receives a shard from node B with level/round `(n',r')`. #### Case `n' < n` Given the timestamp of the block at level `n'` in A's chain, A can determine the timestamp `t'` of the `(n',r')` block. Recall that at the block at level `n-2` is final, so the timestamp of all blocks for levels up to `n-1` are completely determined by their round. Let `pr` be the round of the block on A's chain at level `n-1`. Let `Delta := now - t'`. Make the following case distinction: * If `t' > pt + delta`, where `pt` is the timestamp of the block at level `n'+1` on A's chain, then consider the message as invalid. There was probably no such block proposed, because the proposer should have seen first the block at level `n'+1` on A's chain, which has a higher fitness. (NB: this takes care of cases where `r'` has large values.) * If `Delta > w1 ` then declare message invalid: it is too old, it should not have been sent. * If `r' <> pr`, then consider the message as "useless"; maybe there was such a block, but it was not agreed upon. * If `r' = pr`, then ~~check that the commitment was published. If not, the message is invalid. Otherwise,~~ * If `Delta in (w0,w1]`: consider message as useless. * If `Delta in (0,w0]`: consider message as valid. TODO: Check that if `Delta <= 0` then we're in the first case. ### Case `n' = n` Let `r0` be the round the L1 node is, based on `now`. We have `r <= r0`. * If `r' = r`, then ~~check whether the commitment is published. If not, the message is invalid. If it is, then~~ classify according to `Delta := now - t` as before. * Otherwise, note that we cannot know for sure the timestamp `t'` of the block the message id refers to. TODO: we need to make an analysis based on an estimate of `t'`. similar to the n * Can the message be too old?? * If `t'_min > now + clock_drift`, where `clock_drift` is some tolerated clock drift (see the corresponding shell constant; currently 5s?), then we consider message as invalid. * Otherwise, informally, we're late and we should assume B is honest (unless proven otherwise). ### Case `n' > n` Node A is late or spammed. We don't know `t'`, but we can compute `t'_min`, the minimal timestamp of the block corresponding to B's shard, by assuming all blocks strictly between `n` and `n'` have round 0. As above, if `t'_min - now > clock_drift`, the message is invalid: the referred block is in the future, it would not have been accepted by B's L1 shell. Otherwise, we forward based on a similar classification as for the past, looking at `t'_min - t` (note that `t < now`). In particular, we don't forward if the difference is big (the difference between `t' - t` is even bigger), because: * either B is honest and thus A is indeed late; then our message will be useless or we'll get punished * or B is spamming us, there's no need to flood the network TODO: complete the classification #### Bounded pool of Unknown messages per peer This is somehow similar to what is presented in the section [Messages in the future](#Messages-in-the-future). Given 1min max lag, 3s per round, and 1s increment, we have at most `k` rounds per level with `k` such that `3 + ... + n = n*(n+1)/2 - 3` is around 60, which gives 10-11 rounds at most in 1min (`k = 10` in 52sec, `k=11` gives 63 sec). With current Mainnet parameters, `k=3`. So, for each peer, we'd need to store `num_slots * num_shards * attestation_lag * k` commitments. This means for example `256 * 2048 * 4 * 3 * 48` bytes, that is, around 300MB per peer (in the topic mesh). However, we can put a smaller bound on the pool of commitments to be stored. We simply do not store, nor forward once this bound is reached. Note also: for a *valid* message, the worker first forwards the message on the p2p and then notifies the app, which in turn stores the shard (TODO: but currently not the commitment, right?). Using this pool, we can also check that a peer does to send us two different messages referring to the same `(level,round,slot_index,shard_index)`. If it does we punish it. ## Other solutions/remarks ### Add-on: also send Head messages DAL nodes also send L1 heads on the p2p layer. This would hint (but not prove) that there exists a block with the same level and round as those of the received message. It also allows to know the timestamp of the block, there is no need to estimate it. However, this does not ensure that the commitment was indeed published in the given block. Note to prove that the block really exists one would need to check the baker's rights and its signature, and also the endorsements (for the next block...). ### Add-on: also Merkle proofs of inclusion of Publish ops Idea: send, along with a shard's commitment, the Merkle proof that the commitment was published, that is, that the corresponding publish operation belongs to the referred block's set of operations. TODO (clarify): We also need the block head, right? The Merkle proof ensures inclusion in some set of operations, but we need to also make sure that this set of operations is that of a valid block. Pros: In this way we can check that a shard "from the future" is valid (it was indeed published). Cons: The size of the message id becomes prohibitively big? ### On the presence of the commitment in the message id This seems important. Without it we cannot even check that the shard corresponds to some commitment. This means that a peer can easily spam with invalid shards in the future (in case we admit them in principle). Note that we can retrieve the commitment stored at the given level, but if there is no stored commitment or the commitment does not match, we cannot conclude anything. One could say that checking the commitment/shard match does not actually bring anything given that fake pairs can be easily generated (can they?). However, storing commitments does prevent sending multiple ones for the same block, slot index and shard index; and it takes significantly less space than storing shards.