PeerDAS - peer selection during sampling

Thanks Emmanuel Nalepa for the review and contribution.

TL;DR

This doc aims to analyze PeerDAS sampling spec and propose a change to the spec to mitigate the underlying issue.

Background

EIP-7594: PeerDAS - Peer Data Availability Sampling proposes an intermediate step towards full Danksharding, its design separates network interaction between nodes into two phaes: distribution and sampling.

During distribution phase, blobs were extended horizontally (1D PeerDAS), and then divided into NUMBER_OF_COLUMS (currently 128) sections and distributed across DATA_COLUMN_SIDECAR_SUBNET_COUNT (currently 32) subnets. Each node is required to custody at least CUSTODY_REQUIREMENT (currently 1) subnets (4 columns).

During sampling phase, quote the current spec

A node SHOULD maintain a diverse set of peers for each column and each slot 
by verifying responsiveness to sample queries. 

At each slot, a node makes SAMPLES_PER_SLOT queries for samples 
from their peers via DataColumnSidecarsByRoot request. 

A node utilizes get_custody_columns helper to determine
which peer(s) to request from. If a node has enough good/honest
peers across all rows and columns, this has a high chance of success.

Problem

There is 1 issue with the current spec:

a diverse set of peers for each column and each slot does not provide worst case guarantee that all the peers the node has covers them all. In fact, from calculations we'll see in later sections, it's far from enough and mathematically impossible if we just select random peers. Let's do some calculations:

NUMBER_OF_COLUMNS = 128
DATA_COLUMN_SIDECAR_SUBNET_COUNT = 32
TARGET_NUMBER_OF_PEERS = 70
CUSTODY_REQUIREMENT = 1 => custody 4 columns at least

Let's calculate the expected number of columns our peer set could cover:

Probability that a Specific Number is Not Covered:

$P (not covered in one peer) = \frac{127 * 126 * 125 * 124}{128 * 127 * 126 * 125} = \frac{31}{32}$
Probability that a Specific Number is Not Covered in 70 peers:

$P (not covered in 70 peers) = {(\frac{31}{32})}^{70}$
Probability that a Specific Number is covered at Least Once in 70 peers:

$P (covered) = 1 - {(\frac{31}{32})}^{70} = 89.2% .$
expected number of distinct numbers covered

$E (distinct numbers) = 128 * P (covered) = 114$

With E(distinct number) = 114, this means that with 70 expected peers, they're only expected to cover 114 of total 128 columns, leaving 14 columns not covered. And increasing peer number here does not have significant effect on the outcome (See figure below).

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The expectation brings the issue in front of us: how do we select columns to sample?

Option 1 (recommended)

Approach: randomly select number of columns from what our peer set custodies (minus the columns the node currently custodies)

Pros:

Easier error handling => the additional guarantee (samples have peer that custodies them) greatly simplifies client implementation, clients don't need to handle situations where the column we want to sample does not have peer that custodies them (should we retry selection, try to find peer that custodies to them, etc)
Better sampling time bound guarantee => we could parallelize sampling right after peer selection and decide on sampling result with all the sampling returns.

Cons:

if the peer numbers are low, it could reduce the sample pool make the mathematical guarantees from sampling unsound. => However we argue that the alternative (random sampling) has the same problem in practice as we can generate random columns to sample, but we cannot sample them if there's no peer that custodies them connects to our node.

Option 2

Approach: we randomly select SAMPLES_PER_SLOT columns from 0 to 127.

The drawback is that with expected distinct columns covered at 114, the chance that they are all covered are very low, clients need to consider how to handle this mismatch situation and this greatly complicates implementation (in terms of retry, error handling and parallelization).

P (all covered) = (\frac{114}{128})^{8} = 39.59% .