PeerDAS v1 without peer sampling, aka SubnetDAS as a stepping stone

Any DAS construction has to implement three components: distribution, sampling and reconstruction. In PeerDAS, distribution is through GossipSub topics, sampling is through req/resp with peers, and reconstruction is left to nodes that participate in all topics. Out of these phases, sampling seems to be by far the trickiest component to get right, because it involves lots of delicate decisions around timing of queries, timeouts, when/whether to do parallel queries and how much, when/whether to increase the number of samples to allow for some failures etc… To avoid having such a complex and time-sensitive process in the critical path of consensus, it has already been decided to utilize peer sampling more as a second layer of security, only employed with a delay on the order of an epoch, outside of the critical path (see here and here). For this to work, we need a sufficiently high custody requirement, such that the distribution phase itself provides meaningful security guarantees. In other words, the construction becomes a hybrid of SubnetDAS and PeerDAS (distribution = custody = "subnet sampling").

In the last ACDC call, it has been proposed to go further, and to consider a stepping stone to PeerDAS which does not use peer sampling at all, and where custody requirements for full nodes are increased somewhat (e.g. somewhere between 8 and 16), to replace the lack of peer sampling. In other words, we would go back to the idea of SubnetDAS as an intermediate step, allowing us to introduce DAS in the protocol in a more iterative manner, while getting some scalability benefits at each step. Peer sampling could then be introduced at a later point (without even needing a coordinated rollout). The rest of the document will explain why such an iterative rollout strategy is considered to be viable from a security perspective. A useful background reading might be this document on the various roles that sampling plays in the fork-choice, and on how exactly they enhance security.

Subnet vs peer sampling

To understand the impact of adding or removing peer sampling, we have to first understand how it differs from "subnet sampling" (again, this is what we call custody in the specs). First of all, sampling gives us a very useful global property, i.e., that with high probability only a small fraction of the sampling entities can be tricked into thinking that something unavailable is available, where what exactly small means is determined by the amount of sampling. This applies both to validators and to full nodes, and is crucial in ensuring many of the security properties we care about. From now on, let's call it the global sampling property.

Importantly, when it comes to this property, it is completely irrelevant how sampling is performed (e.g. subnet sampling vs peer sampling). The only thing that matters is that the choice of samples is random, and the rest just follows from a probabilistic argument. As far as this property is concerned, it would even be ok to just request all of the samples directly from the adversary. In other words, as far as this property is concerned, it is completely fine to remove peer sampling from the system, as long as we set the custody requirement high enough (replace peer sampling with subnet sampling).

In what way do peer sampling and subnet sampling differ, then, and why do we want to have peer sampling in the protocol at all? There's essentially two answers to this:

They differ in bandwidth efficiency: peer sampling does not suffer from the message amplification factor of GossipSub, and so getting the same amount of samples through peer sampling takes nearly an order of magnitude less bandwidth than through subnet sampling. This alone is a big reason why we ultimately would like to keep custody at a minimum (ideally just enough to provide a stable backbone for the GossipSub topics) and do peer sampling for security instead. However, in the short term it can be completely fine to start with the simpler approach and take the scalability hit.
They differ in how hard it is to trick a specific node. With subnet sampling, this is extremely easy, because topic subscriptions are public, so tricking a node just entails publishing only exactly the data corresponding to the topics they are subscribed to. On the other hand, tricking a node that does peer sampling is a bit harder, because some of the queries might at least at first be directed to nodes that the adversary does not control. Still, there is a liveness/safety tradeoff here: if the sampling strategy prescribes to keep trying to sample from other peers (or even to find new peers if necessary) if queries are failing, the adversary eventually learns all of the queries and satisfies them, causing a safety fault, whereas the adversary can cause a liveness fault if the sampling strategy prescribes to not expand the set of queried nodes too much. Ultimately, a full resolution of this issue requires solving the query unlinkability problem, which is itself a long term research question that's definitely not in scope for a first rollout of PeerDAS.

With this in mind, it should be apparent that subnet sampling is just as effective when it comes to global security properties (things that depend on "at most a small fraction of nodes/validators can be tricked", i.e., the global sampling property), and that peer sampling is not much better when it comes to individual security properties (things that depend on "it is hard to trick a specific node/validator"). This is essentially the core of why we think it is ok to remove peer sampling from the first version of DAS. Still, let's run through various scenarios and consider what security properties we get with and without peer sampling.

Security with honest majority

If we assume that (slightly more than) a majority of the stake is honest, then a sufficient amount of sampling gives us everything we need: an unavailable block can never be finalized, nor can it ever accrue a majority of the available voting weight. This is entirely due to the global sampling property, applied to validators, so it makes no difference whether we rely on peer or subnet sampling. In fact, using the trailing fork-choice with a high custody requirement already meant replacing peer sampling with subnet sampling for consensus purposes.

In short, availability considerations have a very limited effect on either the consensus protocol or the security of transaction confirmations. Moreover, rollups are secure, because the canonical Ethereum chain is available.

Security against a malicious majority but not supermajority

Let's now consider the case where there is a malicious majority. In this case, we cannot have any guarantees that the consensus protocol will be live, nor that any synchronous confirmation rule (attempting to confirm transactions that are not yet finalized) will be safe. In other words, the adversary can keep reorging the chain at will. Note that DAS is completely irrelevant here: not even fully downloading can save us.

Still, if there is no malicious supermajority, we have the guarantee that an unavailable checkpoint will neither be justified nor finalized. This means two things:

Availability is irrelevant to the confirmation of finalized transactions, because any finalized checkpoint is guaranteed to be available
Availability is irrelevant to the FFG votes of honest validators: any justified checkpoint is guaranteed to be available, so voting for it as source cannot lead to getting stuck on an unavailable branch.

Moreover, at most a small fraction of full nodes can be tricked into following an unavailable chain (full nodes also do a sufficiently high amount of custody), so full nodes still serve as a layer of defense against the corrupted validator set. This is ultimately what rollup security rests on.

Security against a malicious supermajority

As before, DAS cannot help us to guarantee liveness or safety of the consensus protocol. Also as before, rollup safety is ultimately guaranteed by the presence of enough full nodes doing sampling, regardless of how exactly they do it. Differently from before, an unavailable checkpoint can be both justified and finalized, with the following potential consequences:

A full node might confirm a finalized transaction on an unavailable branch, which is never made available. Eventually the social layer coordinates on a fork which moves away from the unavailable checkpoint, and the full node experiences a double spend.
A validator might vote with an unavailable FFG source, and get stuck on an unavailable branch.

Before further discussion of these cases, note that we are talking about absolute worst case scenarios, where a supermajority of validators attack the protocol and force the social layer to intervene. In all but this extreme situation, a sufficiently high custody already gives us all the security guarantees we need. Still, we want to be as resilient as possible, including to worst case attacks. The question then becomes: does adding peer sampling make us more resilient here? This goes back to the point around individual security and query unlinkability. As we already discussed there, the answer is "kinda but not really if the adversary tries hard enough". Given that we are considering a situation where the adversary controls a supermajority of the stake, it seems sensible to conclude that peer sampling (in its current form, without any query unlinkability protection) does not make these attacks meaningfully harder.

It is also worth noting that these attack vectors are hardly themselves sufficient to incentivize such a large scale attack:

Anyone transacting large amounts (e.g. an exchange or liquidity provider for a bridge) can subscribe to all subnets. This will still be very much accessible, even to many home connections.
There's very little to gain from getting a small fraction of validators stuck on an unavailable branch. Moreover, any node running many validators would by default subscribe to all subnets and not affected by such an attack, so the potential victims are even fewer.

Aside: accountable validator custody

We could also improve our defense against these attacks by making validator custody be accountable, i.e., by having a public mapping of validator index to columns to custody (or a private one to be revealed if necessary, in an emergency situation like the one described above). This way, we get two things:

Honest validators can always prove that they are in the right, i.e., that they voted for something unavailable only because the data they were custodying was available. A social fork can then allow them to keep participating and not slash them.
Malicious validators, which voted to finalize a chain where the data they were supposed to custody was not available, can be identified and slashed as part of a social fork, making any such attack possibly as costly as double finality.