EIP 7594 security rationale

# EIP 7594 security rationale Relevant parameters: - `SAMPLES_PER_SLOT = 8` - `NUMBER_OF_CUSTODY_GROUPS = 128` A custody group is just a contiguous set of columns (with the current parameters, it is in fact just a single column, but this is not important for security). However, for the purpose of this discussion, we can think of custody groups as just another way (like columns) to horizontally partition the matrix into units of equal size (an equipartition) which can be individually verified (because made up of cells, each with a corresponding cell proof). Let's call this a verifiable horizontal equipartition. Given any horizontal equipartition, the erasure coding of blobs ensures that availability of half of the parts is sufficient to reconstruct the whole matrix. If the equipartition is moreover verifiable, it can be used for data availability sampling, with parts being the samples. In our case, a custody group is a sample, and every node downloads at least `SAMPLES_PER_SLOT` custody groups out of a total of `NUMBER_OF_CUSTODY_GROUPS`, with this selection done pseudorandomly by the node based on its `node_id`. Crucially, as long as a node fails to get its `SAMPLES_PER_SLOT` samples, it does not consider the blob data to be available and does not import the block in its fork-choice. Concretely, this means that the node cannot ever consider this block as part of the canonical chain. Still, an individual node might get all of its samples even if the data is overall not available, for example if the only samples that are available are exactly the 8 (`SAMPLES_PER_SLOT`) it requires. However, when we consider many nodes that are sampling, this can be the case for at most a small minority of the nodes. Letting $n$ be the total number of sampling nodes, $m$ being the total number of samples (for us `NUMBER_OF_CUSTODY_GROUPS = 128`) and $k$ being the (minimum) number of samples that a node requires, we have the following bound for the probability of convincing a fraction $\epsilon$ of the nodes that unavailable data is available (meaning, making available all samples required by at least $n\epsilon$ nodes): $$\mathbb{P}(n\epsilon \text{ nodes can be tricked}) \le \binom{n}{n\epsilon}\binom{m}{\frac{m}{2}-1}2^{-kn\epsilon}$$ The first term is the number of possible ways to choose a subset of $n\epsilon$ nodes whose sampling queries should be satisfied (the nodes to be tricked). The second term is the number of ways to choose a maximally large subset of samples to be made available to satisfy the sampling queries of the $n\epsilon$ nodes, without making the whole data reconstructible. That is, the number of ways to choose $\frac{m}{2}-1$ out of $m$ samples, since any more would allow for reconstruction. Finally, for any such choices, the third term is the probability of success, i.e., the probability that the sampling queries of all chosen $n\epsilon$ nodes are satisfied by the chosen subset of up to $\frac{m}{2}-1$ samples. Since less than half of the samples are available, and each sampling query is chosen uniformly at random, independently from the other ones, the probability to satisfy any single query is at most $\frac{1}{2}$, and the probability to satisfy all of the $kn\epsilon$ of the $n\epsilon$ nodes is bounded by $2^{-kn\epsilon}$ (to be precise, each node samples without replacement, so queries are not exactly independent, but this only helps us, see below). Formally, letting $N$ be the set of $n$ nodes and $M$ the set of $m$ samples, we have: \begin{align*} & \Pr\!\bigl(n\epsilon\text{ nodes can be tricked}\bigr) \\ &= \Pr\!\Bigl( \bigcup_{\substack{S\subseteq N\\|S| = n\epsilon}} \{S \text{ can be tricked}\} \Bigr) \\[6pt] &\le \sum_{\substack{S\subseteq N\\|S| = n\epsilon}} \Pr\!\bigl(S \text{ can be tricked}\bigr) \\[6pt] &= \sum_{\substack{S\subseteq N\\|S| = n\epsilon}} \Pr\!\Bigl( \bigcup_{\substack{T\subseteq M\\|T| = \frac{m}{2}-1}} \!\!\bigl\{ \text{all } kn\epsilon \text{ samples from }S \text{ fall in } T \bigr\} \Bigr) \\[6pt] &\le \sum_{\substack{S\subseteq N\\|S| = n\epsilon}} \sum_{\substack{T\subseteq M\\|T| = \frac{m}{2}-1}} \Pr\!\Bigl( \text{all } kn\epsilon \text{ samples from }S \text{ fall in } T \Bigr) \\[6pt] &\le \sum_{\substack{S\subseteq N\\|S| = n\epsilon}} \sum_{\substack{T\subseteq M\\|T| = \frac{m}{2}-1}} \Bigl(\tfrac{|T|}{|M|}\Bigr)^{kn\epsilon} \\[6pt] &< \binom{n}{n\epsilon} \binom{m}{\tfrac{m}{2}-1}\, 2^{-kn\epsilon}. \end{align*} Note that we don't have an equality in the penultimate line, because the $kn\epsilon$ samples are not independent, since each node's $k$ samples are done without replacement. However sampling without replacement only helps us to reduce the probability of all samples falling within the same subset $T$, and so the probability given by sampling with replacement, where all $kn\epsilon$ samples are independent, gives us an upper bound. Note that here it does not matter whether the adversary knows the queries in advance: all we care about is the queries being chosen randomly (in our case pseudorandomly to be precise). Therefore, for the purpose of this analysis it is ok for nodes to do the random choice of samples once and then always get the same samples in every slot, and even to advertise their choice, as is the case in our protocol. $$ \boxed{ \Pr\!\bigl(X>S_m\bigr) \;=\; \Pr\!\bigl(X-S_m \ge 0\bigr) \;\le\; \exp\!\left( -\,\frac{2u^{2}}{(m+1)n} \right) \;=\; \exp\!\left( -\,\frac{2\,n\,p^{2}\,(m-1)^{2}}{m+1} \right) } \tag{‡}$$