On the DoS resistance of sparse mempool design

# On the DoS resistance of sparse blobpool design In the following we will explore the DoS resistance of the sparse blobpool, laying out weaknesses of the current design as well as suggesting potential fixes. We begin by noting that the sparse blobpool design is essentially a way to achieve vertical sharding of the blobpool and the main DoS resistance mechanism is built on the following two principles: 1. A node only propagates a blob if it sees it as available. 2. Provider nodes check availability trivially. 3. Sampling nodes rely on the availability signal they receive from their neighboring providers. Fundamentally the sparse blobpool design hinges on provider nodes for propagating the blob data in the network. As a result the significance of malicious behavior by providers is greater compared to the malicious behavior of nodes in todays blobpool. This is true both for their neighboring provider nodes but even more so for their neighboring sampling nodes as the latter cannot independently determine availability. This opens up the possibility of attacks on individual sampling nodes by virtue of successfully signaling availability of an otherwise (for the rest of the network) unavailable blob. An example of such an attack has been highlighted by [Bosul](https://hackmd.io/@yZzECM-fQt2ePeYIwC4CUw/ByufNL90le). In this attack a malicious party connects a number of peers $A_1,...,A_k$ to a victim node $V$. By signaling availability of blob $B_1$ to $V$ exclusively, if $V$ ends up sampling $B_1$ it will end up holding a blob in its local blobpool that no other node does. This is due to the fact that $B_1$ will *not* propagate in the network as all other nodes don't see it as available. This can be repeated at no cost (see details in Bosul's note) to the attacker, ending up filling the local blobpool of $V$ with blobs that will not end up on chain. It should be noted that the same attack can be performed in today's blobpool albeit requiring control over all of $V$'s peers. ## Availability signal We can think of at least two different ways in order to make such targeted attacks more difficult. Fundamentally the problem here is that availability and propagation of a blob is essentially determined by a small subset of provider nodes making targeted attacks at individual nodes easier compared to today's blobpool. Thus, one way of addressing this would be to introduce further communication between nodes vis a vis availability. In the current sparse blobpool design a node has no information about whether or not its peers see a given blob as available unless they take the role of a provider. We can change this by introducing a further "availability" message exchange between nodes. This can be achieved in the following way: - Upon fetching/sampling a new blob, a provider/sampler node will announce this transaction to its peers (*including* those from which they received the tx) via `NewPooledTransactionHashes`. Receiving an announcement from a given peer is then seen as a message that this peer sees the blob as available - Each node holds `availability_of_peers={"P_1": (1,0,1,1,...,1), "P_2": (1,1,...,0),...,"P_N":(0,1,1,...,1)}` where `availability_of_peers["P_j"][k]=0/1` indicates whether the $j^{th}$ peer has announced availability for the $k^{th}$ blob. - Based on this a node can choose to keep or drop a given blob. This can be done via a (potentially weighted by peer-score) majority vote. With this mechanism in place an attack that is based on establishing differing views regarding availability of a blob is made significantly more challenging. Concretely the attacker will need to control sufficiently many of the victim's peers in order to force a decision in the majority vote. We note that though this provides an improvement compared to the current proposal, it does not reduce to the situation in today's blobpool where in order to achieve this the attacker would need to control *all* of the victims peers. ## Enforcing propagation This described mechanism is based on the principle of *dropping* blobs that seem unavailable within the node's neighborhood. Another, potentially complementary mechanism we can use is based on the principle of *enforcing propagation* of such blobs. This can be done as follows: - Introduce a peer-scoring mechanism and have the availability signal of peers count towards it. - Modulate the probability of being a provider $p$ by the score of the peer announcing a blob. If the score is high, i.e we have high confidence in the peer, keep $p=0.15$. If the score is small increase $p$ accordingly. We can think of $p$ taking continuous values or, simply $p \in \{0.15, 1 \}$ depending on whether the peer score is above or below some threshold. The logic here is very simple, if a peer as a low score you cannot trust that their claim on availability is valid and thus you should be less likely to sample. Concretely, if your peers availability signal indicates unavailability of a given blob you could switch roles from sampler to provider in order to check availability explicitly and enforce the propagation of the Blob, if available. ## From local attacks to global DoS The most important question at this point would be whether or not such targeted attacks on nodes can in fact lead to DoS on system level. The answer to this is "yes but only if the attacker controls a relatively large fraction of nodes". Concretely, assuming an attacker that controls a majority of the nodes, its not difficult to see that they could use the attack discussed above to impose differing local blobpools for each node. As a result we have essentially no useful blobpool pre-propagation which we interpret as a successful DoS attack on the system. What is *not* possible is the type of DoS attack based on sending an unavailable blob and by virtue of the sharding mechanism, it being propagated through the network unhindered (this can happen for example if one considers the most naive form of vertical sharding, as nodes have no way to verify availability). Let $k$ be the threshold number of peers a node needs to observe announcing a blob before it fetches it. Its clear then that a malicious party can make a target node fetch some blob $B_1$ by controlling $k$ of its peers. With the addition of the "availability signal mechanism" discussed above, the target node will keep $B_1$ iff a majority of its peers signal availability for it. In order for the malicious party to achieve this they can either: 1. Directly control a majority of the targets peers i.e $D/2$ (where $D$ is the number of peers) in total. OR 2. Propagating $B_1$ to a majority of the target peers by the same mechanism. This would require controlling $kD/2$ nodes ($k$ for each of the $D/2$). With 1., in order to have $B_1$ in the blobpool of $M$ nodes, the attacker would have to repeat the attack, independently for all of them. This roughly means that they need to control $MD/2$ peers in total. Note: Here we are assuming that the peer sets are basically random. The expected overlap between peer sets is therefore small. Let $S_A$ be the peer set of $A$. If $B$ is in $S_A$ then the overlap of $S_A$ and $S_B$ is with high probability only the edge $A-B$. This means that the sets of nodes the attacker needs to control for 1. are basically independent. With 2,, its clear that we get a kind of nested structure. In order to have $B_1$ remain in the blobpool of those $kD/2$ peers, the attacker would have to repeat the same logic to each of them independently. Its clear that this scales way less favorably (from the attackers perspective) than 1. So we ignore this strategy here. So in order to have a situation in which a system-size dependent subset of nodes cannot provide the service, $M=f(N$) and thus also the number of nodes the attacker needs to control scales by the same factor. The logic of this analysis as well as the attack described by Bosul are primarily concerned with depleting the local blobpool capacity of target nodes. It is clear that a far more pressing issue has to do with depleting the target nodes bandwidth. The resources (in terms of the number of nodes) required for such a bandwidth targeted attack will depend on the speed with which availability signals can be propagated and processed. *In the worst case* (i.e if the delay is large) the resource requirements would go as $Mk$ rather than $MD/2$ which is a factor of 10 worse. To see this begin with the situation in today's blobpool: In today's blobpool a malicious party can send invalid blobs to a node, using up their bandwidth. The target node in this case can check the validity instantaneously and decide to drop the attacking peer immediately. This means that in today's blobpool to run a bandwidth depletion attack successfully an attacker needs to connect $L$ malicious peers where $$ L\times \mathrm{blob\ size} \geq \mathrm{target's\ bandwidth } $$ In the sparse blobpool, fundamentally a sampling node cannot determine the validity of a blob. It therefore has to make use of an external availability signal discussed above, in order to decide to drop a malicious peer. So the question is how large the lag, introduced by waiting for the availability signal is and how many peers, $L'$, the attacker would need in this case to achieve its goal given by $$ L' \times \frac{\mathrm{blob\ size}}{8} \geq \mathrm{target's\ bandwidth } \\ \ \ ; \ \ L'\geq k $$ where the latter condition comes from the fact that the target node only starts fetching if it observes a blob being announced by at least $k$ peers. By the same argument as before, its easy to see that in order to attack $M$ peers, the malicious party needs to control $ML'$ many nodes. The worst case scenario of course is $L'=k$. This would be the case if the lag is very high and the node cannot disconnect from a malicious peer before they have depleted their bandwidth. It is therefore important to choose an availability signal which does not introduce too much lag. For example we might consider not waiting for all the peers availability messages to take the majority vote but only wait for a predefined time window (determined by an upper limit of tolerable lag) and make the choice on availability dependent on the information available by then. # Summary To summarize, we can establish a certain degree of DoS resistance in the sparse blobpool by having a mechanism that aims at establishing a universal view of the blobpool. As we have seen, the weak-point in achieving this is are the sampling nodes. Since sampling nodes cannot decide on the availability of blobs on their own, they are susceptible to attacks by malicious provider nodes. We provided two mechanisms by which this effect can be alleviated. In both mechanisms sampling nodes receive availability signals from their peers and decide availability on the basis of that. If a blob is deemed unavailable, the node can either: - Drop the blob from its local blobpool. - Enforce propagation by switching its role to provider. In both approaches a node needs a way to "penalize" and defend itself against malicious peers. To achieve this we require a form of peer-scoring mechanism on the basis of which nodes can determine their trust in a given peer. This can be used as a mechanism for selectively increase/decrease the probability of being a sampling node or, in the most extreme cases even drop a peer entirely. As we saw, both approaches can in principle be understood as mechanisms for establishing a universal blobpool. However, we have to further investigate how best to incorporate and combine them. Intuitively it seems that leaving it to the nodes to decide whether to drop a blob or enforce its propagation is not a viable option. While the end goal is the same they operate on opposing mechanisms which can lead to the inefficacy of either. # Unconditional payments Whether or not the above described mechanisms are viable should be decided on the basis of our required security guarantees. If we deem that those are not sufficient, we propose to juxtapose a form of "unconditional payments" mechanism to the sparse blobpool mechanism. We will not go into the details of the unconditional payments mechanism here and defer the reader to the various other resources on the topic. However, let us briefly revisit the underlying principle. The underlying principle for the unconditional payment mechanism is that transactions that get propagated in the blobpool and thus consume real user resources (bandwidth etc.) have to pay fees. The idea here is to take an agnostic stance vis-a-vis the inclusion of the blob into one of the following blobs. A blob that is unavailable, and thus cannot be included, but pays the "unconditional" fee (which can be understood as a fee for the blobpool propagation and use of resources therefore) is not treated differently than a blob that is available. Concretely, such an "unconditional" fee transaction, associated with a given blob, can then be propagated in the blobpool in the same way as normal transactions are being propagated today. Thus, the unconditional fee transaction propagates with the same safety guarantees as any transaction in today's mempool.