DAS with individual guarantees

# Problem description In its current form, DAS with only comes with security guarantees of the form \begin{equation}\tag{1} \Pr[\mathrm{ \alpha \ see \ data \ as \ available|data \ is \ unavailable}] \leq... \end{equation} where $\alpha$ is a fraction of the validator set. Which in turn allows us to obtain bounds on the system-level security guarantees given a sufficiently large number of validators. It would however, be advantageous to have a protocol which comes with stronger, individual security guarantees. That is guarantees of the form: Given $n$ samples \begin{equation}\tag{2} \Pr[\mathrm{ see \ data \ as \ available|data \ is \ unavailable}] \leq f(n) \end{equation} Where $f(n)$ scales "favourably"(e.g inverse poly). It is clear that the current DAS protocol does not give rise to such a property due to the simple fact that the samples corresponding to each individual validator can be inferred and therefore a malicious party could always simply make those queried cells available while making the Blob itself unavailable. Evidently if this property where to hold, this would have far reaching consequences. In many proposals concerning the future of DAS we see attempts at reducing the requirements on the Block-proposer by not requiring them to get entire blobs but rather to also sample them. Thus, for all these proposals there needs to be a mechanism in place by which the proposer can decide availability. One way of doing this is by means of an availability committee which, due to Eq. (1) has a high probability of verifying data availability. Similarly, individual level guarantees as provided by Eq. (2) would allow the Block proposer to directly asses availability without the need of a committee. In the rest of this note we will sketch out a proposal for DAS with individual-level guarantees. # Preliminaries We begin by abstracting away from the concrete details of DAS and simply consider a prover $P$ and a verifier $V$. Here $P$ and $V$ need not be individual actors but rather can themselves decompose into parts. If those parts are uncoordinated and thus uncorrelated we will write this explicitly. In particular, the strongest results we can get are for the setting where a malicious prover is assumed to have a fully coordinated strategy over a number of actors while the verifier does not. In this case we will write $V=\bigotimes_{i=1}^k V_i$. Sampling with individual-level guarantees then reduces to an interactive protocol between $P$ and $V$ which satisfies two properties. - Completeness: If $P$ is honest, they should be able to convince $V_i$ (for all $i\in[k]$) that "data is available" with probability $p_1\geq 1-\epsilon$. - Soundness: No dishonest $P'$ should be able to convince $V_i$ (for all $i\in[k]$) that "data is available" with probability $p_2\geq \epsilon'$. "Data is available" here will simply mean that $P$ has access to some vector $v=(v_1,...,v_n)$. In DAS validators can verify the content of cells. This translates to $V$ being able to check whether the reply $r$ to a query $i$ satisfies $r=v_i$. To begin with completeness, an honest $P$ upon receiving any query $i$ can simply return $v_i$ as a response. Thus any protocol which leads to $V_i$ accepting if all (or a large proportion) of queries are valid will satisfy completeness. Moving to soundness, things become a bit more interesting. It is clear that unless $V$ queries all $i\in [n]$, a dishonest prover $P'$ (with only partial access to $v$) is indistinguishable from the honest one. If we do not require perfect soundness, one idea worth exploring would be something like querying both individual entries as well as sums of entries. That is querying, for example $i$ and $i+j$. In a multi-round setting a dishonest prover has the choice to answer to the second query with $v_i+v_j$ or can choose to answer with $v_i+c$. Unless the verifier queries $j$ in a subsequent round, both answers are equivalent. However, it is not so clear how such a multi-round protocol could be easily employed in the setting of DAS thus we leave this aside for now. If we restrict the protocol to a single-round interaction and $V$'s action to simply querying some (sub)set of indexes $i\in [n]$ we can therefore directly conclude that $V$ inevitably has to query all indexes. This is not surprising since in DAS this corresponds to the setting where validators don't actually sample but rather download entire Blobs (today's setting). On the other hand it corresponds to the setting where $V$ is composed by a number of coordinated validators which query different entries so that their union covers the entirety of the Blob (basically what is proposed to be done by the availability committee, up to some stochastic factor). If we treat an uncoordinated verifier $V=\bigotimes_{i=1}^k V_i$ such strategies are not available. However, a strategy that is available is one by which the order of queries coming from $V$ is randomized. In a deterministic protocol the queries sent to $P$ by $V$ have to be ordered in some predetermined way. Wlog, these can be ordered by the label of the individual component $V_i$ i.e a query will be a tuple: \begin{equation} q=(q_1^{V_1},...,q_l^{V_1},...,q_l^{V_k}). \end{equation} In this case it is clear that a dishonest $P'$ can pass the test for some of the $V_i$'s (e.g. it might reply correctly to $(q_1^{V_1},...,q_l^{V_1})$ thus leading $V_1$ to accepting the interaction while not replying correctly to the remaining queries). If $V$ instead adopts a randomized strategy, a query to $P$ will instead be given by: \begin{equation} q=\pi\left((q_1^{V_1},...,q_c^{V_1},...,q_c^{V_k})\right) \end{equation} with $\pi$ some permutation. As long as $\pi$ is unknown to the prover it is clear that any dishonest $P'$ can only succeed probabilistically, with the probability decreasing with $k$. In particular the probability of success is given by \begin{equation} p_2=\frac{\binom{ck-c}{l-c}}{\binom{ck}{l}}=\prod_{i=0}^{c-1}\frac{l-i}{ck-i} \leq \prod_{i=0}^{c-1}\frac{n-i-1}{ck-i}. \end{equation} Here $l$ is the number of queries that $P'$ replies to correctly, which is bounded by $\leq n-1$ and $c$ is the number of individual queries made by each $V_i$. The bound becomes non-trivial once $ck>n-1$. To get a feeling let's fix $c=8, n=64$ ![Probability_individual_guarantees](https://hackmd.io/_uploads/BJ4PPj6Yel.png) Which shows exactly the kind of behavior we would have expected. # Protocol Of course the above discussion is highly idealised. In the following section we will discuss how we could develop a sampling protocol that is inspired by this randomized approach. Indeed, the main ingredient, keeping the permutation $\pi$ private in a P2P setting where individual participants can be either part of $P$ or of $V$ is a challenging problem. In the non-trivial parameter regime $ck>n-1$ randomizing the order of query $q$ can be interpreted as a way to break the link between individual elements of $q$ and the corresponding $V_i$. This is particularly evident in the case where $\bigcup_{i,j}q_i^{V_j}=[n]$ since here all entries of the vector are being queried but the dishonest $P'$ can only probabilistically try to attack an individual $V_i$. The protocol we envision is precisely one that breaks this connection, at least with high probability. The idea of the protocol is to have subtopics corresponding to the different elements $i\in [k]$. When $V_i$ wants to sample an element $i$ they should connect to a small number of peers $N(V_i)$ in the corresponding subtopic. If all $N(V_i)\in V$, there is no way for $P'$ to know about the link between $V_i$ and $i$. If, however, there exists some peer of $V_i$, such that $N(V_i)\in P'$ the link between $V_i$ and the query $i$ is revealed to the dishonest prover who can then choose to act honestly on this entry. Let $P'$ be uniformly distributed over the different subnets. It is easy to see that due to the fact that all $V_i$'s sample uniformly, this strategy is optimal from the perspective of $P'$. Let us focus on an individual entry and its corresponding subtopic. If the subtopic has $N$ participants and $K$ of them are in $P'$ the probability that a random node $u$ is connected to one of them is given by \begin{equation} \Pr[u \ \mathrm{connected \ to \ some\ } w\in P']=1-\frac{\binom{N-K}{d}}{\binom{N}{d}}=1-\prod_{i=0}^{d-1}\frac{N-K-i}{N-i}\leq \frac{d K}{N-d} \end{equation} where $d$ is the degree of the subnet's underlying graph. The overall the probability of success for $P'$ can be expressed as \begin{align} p_2=\sum_{i=0}^c\Pr[u \ \mathrm{connected \ to \ some\ } w\in P' \mathrm{\ in} \ i \ \mathrm {subnets} ]&\times \Pr[P' \ \mathrm{selects \ remaining \ } c-i \ \mathrm{subnets}] \\ \leq \sum_{i=0}^c \left (\frac{d K}{N-d}\right)^i&\times \prod_{j=0}^{c-i-1}\frac{n-j-1}{ck-j} \end{align} where once again we will use $n=64$, $c=8$. We will also choose $K=aN$ where $0<a<d^{-1} N$. ![Probability_2](https://hackmd.io/_uploads/ryDRLgCtge.png) ![Probability_3](https://hackmd.io/_uploads/rJxxvxRtel.png) # Implementation In this section we will provide a concrete implementation of the protocol discussed above. Overall, it should be said that everything discussed here can be understood as a specific sub-problem of the "Anonymous Gossiping" challenge which has already been discussed elsewhere. In particular, the question here is: - How to connect to the network connections (of $d$ random peers) in each subnet in such a way that the local connections are only known to the two parties participating in them. One idea of how to achieve this could be to have a network gescribed by the graph $G=(V,E)$ with degree $D$. In the sampling process each node sends their locally sampled list of entries to their neighbors. Then, for each pair of node $u$, $v$ with local lists $L_u,L_v$ a link is maintained for all subtopics corresponding to $i\in L_u \cap L_v$. Collectively, this generates the subnet structure with some probability. The probability that an element $i\in L_u$ does not get matched with one in $L_w$ for some $w \in \mathcal N (u)$ is given by \begin{equation} \Pr[i \mathrm{\ not \ matched}]=(1-\frac{c}{n})^D \end{equation} where again $c$ is the number of samples taken, $n$ is the number of Blob cells (e.g. $c=8$, $n=64$). Hence, the probability that all $c$ indices sampled by $u$ get matched is given by \begin{equation} \Pr[ \mathrm{all\ indices \ matched}]=\left(1-(1-\frac{c}{n})^D \right)^c \end{equation} Using a crude union bound we can see that if we have say $N=10000$ nodes then we would have an upper bound on the failure probability of $0.01$ by choosing $D=100$. In this setting the probability of a dishonest $P'$ being connected to $V_i$ is given by $D/N$ if the choice of the $D$ neighbors is uniform. ====Note: Here I assumed that the adversarial nodes are picked in advanced. Of course things become more difficult if we assume that the adversary can choose their nodes dynamically, based on some knowledge on the connectivity==== ====Question here: How do we get to the aforementioned network? How to connect to the $D$ nodes in an appropriate way?==== - Using the existing discovery protocol, each node can construct a sufficiently large table of node records. - Sample $D$ many of the nodes in the table to establish a connection to. - Propagate the list $L_{u}$ to your $D$ neighbours. In the setting where the adversary is assumed to be able to choose their nodes in the graph dynamically we can slightly improve on the above strategy. First of all, the process can be periodically repeated so that the neighbouring set is not constant. Further, instead of sampling $D$ nodes and propagating to all, we can sample $W\geq D$ and subsample in order to delay the information about the neighbourhood reaching the adversary. ====Question: How easy is it for an adversary to actually learn the connections a target node maintains?====