owned this note
owned this note
Published
Linked with GitHub
# 1/8-1/10: Fixing the Proof of [[DP23]](https://eprint.iacr.org/2023/630.pdf)
Continued from https://hackmd.io/c2eTRG3PSLeverwHTMkNDQ
A similar approach is now integrated into the eprint manuscript.
## The Issue
The issue with the original paper was that it runs a *straight-line extractor* of BSCS16, and explains it as if a single execution of $\mathcal{A}$ is sufficient to extract all $(u_i)$ via BSCS16.
However, this is actually not true - a single execution of $\mathcal{A}$ only gives $\kappa$ column openings, and this is definitely not sufficient. The extractor of BSCS16 allows one to query arbitrary many times, and only guarantees the extraction of merkle trees "up to the queried leaves".
So for example, consider the following scenario.
If one creates $c$ by committing to first $n-1$ columns via merkle as normal but replaces the "merkle root for the final column" with a random value, you can never extract the full $(u_i)$'s that match the definition of the current $\text{Open}$ - since you'll never find the $(u_i)$'s that completely merkle hashes to $c$. However, it can pass non-negligible amount of tests - if the $\kappa$ columns are within the first $n-1$ columns, the adversary can still pass it. Note that we assume that adversary just trolls when creating $c$, but in reality knows valid $t$.
This shows that one execution of $\mathcal{A}$ is definitely not sufficient - even when viewing random oracle accesses that $\mathcal{A}$ used to *generate* $c$, it's not sufficient as well.
## Discussion / Developing Ideas
I discussed with Benjamin Diamond, one of the authors of the DP23 paper.
First, I asked if the above issue was actually an issue. Also, I asked if the following strategy worked - one simply collects verifying proofs until we gathered *all* column merkle proofs.
At first I thought this was a valid fix - since this is now equivalent to the [coupon collector's problem](https://en.m.wikipedia.org/wiki/Coupon_collector%27s_problem), one has a rough estimate on the expected amount of trials we need to gather all columns. The [tail bounds on coupon collectors problem](https://en.m.wikipedia.org/wiki/Coupon_collector%27s_problem#Tail_estimates) is well studied, and has better bounds than the usual Markov Lemma used by the original General Forking Lemma of BCC+16.
However, Benjamin Diamond found out that there is actually a flaw in this idea - the exact same flaw that the $\sqrt{\delta}$ trick in the DP23 proof tries to deal with. The problem is that of course, the distribution of $\kappa$ query columns *conditioned* on $\mathcal{A}$'s success is not uniform. It's also the case, as in the original showcase of the issue, that "not asking the final column in the query" is not an event with a negligible chance. This means that the same $\sqrt{\delta}$ trick doesn't work - in the Lemma's with this trick, we only had to deal with negligible events: here, it isn't.
This leads us back to the original $n-1$ column attack - the reality is, you can't actually deal with this attack with the current definitions. $\mathcal{A}$ can succeed on something like $(1-1/n)^\kappa$ of the queries - this is well above non-negligible ratio - yet one cannot ever fully get $(u_i)$.
Now we wonder if whether full $(u_i)$ is important. We can't simply ignore $(u_i)$'s when opening, since we do have to compare with merkle openings for $c$ against the $\text{Enc}(t)$'s somehow.
However, at the same time, the **only thing that's important is** that
$$d^m((u_i)_{i=0}^{m-1}, (\text{Enc}(t_i))_{i=0}^{m-1}) < d/3$$
which means that, **if we agree on $n-d/3$ columns, then the remaining doesn't matter**. So for example, if we show merkle openings for $n-d/3$ columns such that we fully agree with the $\text{Enc}(t_i)$'s, then why should the remaining $d/3$ columns matter? The current definition of valid $(u_i)$ definitely doesn't care - so why should we even bother to provide a merkle opening?
This leads us to **change our definition of opening** as follows.
The opener provides $t$ and $u_{i_1}, \cdots, u_{i_k}$ for $k > n - d/3$. It shows that
- $u_{i_t}$ is the $i_t$'th column for $c$ by providing a merkle opening
- $\text{Enc}(t_i)$'s $i_t$'th column is exactly equal to $u_{i_t}$.
It's clear that this still forces binding $t$ - same logic from the current proof of binding apply.
## Final Proof Outline
The sketch is as follows - using the BSCS16 extractor we can get the extract the set $M$ of columns that can be actually merkle opened. One can show that $\lvert M^C \rvert < d/3$, since not being able to answer $d/3$ columns leads to negligible success rate when asked to provide $\kappa$ independent column openings. Now we need to stop caring about entries at column $M^C$ - this can be done via considering the projection of the code $V$ onto $M$.
The extractor definition is the same - but the BSCS16 extractor doesn't recover entire $u$.
Denote $\epsilon$ as the success probability of $\mathcal{A}$ - as we know, if $\epsilon$ is negligible we are done.
### Part 1: Merkle Extraction of BSCS16
We note that we have complete knowledge on which points $\mathcal{A}$ queried the random oracle to obtain $c$. This leads us to us knowing which columns have sound merkle openings and which doesn't - note that this process *is* equivalent to the BSCS16's extractor. We now assume that $M$ is the set of columns that $\mathcal{A}$ can provide a valid merkle opening - this can be figured out by our extractor $\mathcal{E}$ at the moment where $\mathcal{A}$ computed $c$ - and this set cannot change, i.e. $\mathcal{A}$ can't somehow provide merkle openings to columns outside of $M$. We also note that $\mathcal{E}$ knows all column values that are inside $M$. Basically the entire $(u_i)$ of the previous proof is now $M$.
### Part 2: Modification of Lemma 4.9 with Projected Codes
First, a practice: $\lvert M^C \rvert < d/3$. This is easy - if $\lvert M^C \rvert \ge d/3$, then since we need all query columns to be inside $M$ for the verification to work, the success probability is at most
$$\left(1 - \frac{\lvert M^C \rvert}{n} \right)^\kappa < \left(1 - \frac{d}{3n} \right)^\kappa$$
which is negligible. Denote $r = \lvert M^C \rvert < d/3$. Now we move on to the main lemma.
We claim that there must be a set of codewords $\text{Enc}(t_i)$'s such that $\text{Enc}(t_i)$'s and $u_i$'s differ **on $M$**, on at most $d/3 - r$ correlated entries. To be more exact, denote the new distance
$$d_M(u, v) = \{i \in M: u_i \neq v_i \}$$
We see that the distance $d_M$ for the code $V$ is well-defined, as if $u, v \in V$ are distinct
$$d_M(u, v) \ge d(u, v) - \lvert M^C \rvert \ge d - r$$
so in a way, we can see this as a "projected" code, something like $[n - r, d - r]$ code.
Assume otherwise, that the distance of projected $V$ and projected $u_i$'s is over $d/3-r$.
In this case, as we have
$$d/3 - r < (d - r)/3 \le d_M / 3$$
we can use the exact same Lemma with the projected code. This means that $\text{Enc}(t')$'s projection and $\sum_{i=1}^m r_i u_i$'s projection would differ at $d/3 - r$ entries with an overwhelming probability. Since the query fails at these $d/3-r$ entries as well as the $r$ entries in $M^C$, we see that the verification succeeds with probability at most
$$\frac{2dl}{3q} + \left( 1 - \frac{d}{3n} \right)^\kappa$$
which is once again negligible. We note that this $t_i$ is also unique - if two distinct $t_i$'s work, then their projected distance is at most $2(d/3-r) < d-r$, so they should be the same.
This is our defintion of $t_0, \cdots, t_{m-1}$ - and our modification of Lemma 4.9.
### Part 3: Modification of Lemma 4.10 with Projection Codes
We show that $t' = ([\otimes_{i=l}^{2l-1} (1 - r_i, r_i)] \cdot [t_0, \cdots, t_{m-1}]^T)$ holds.
Assume otherwise. In that case, write
$$u' = ([\otimes_{i=l}^{2l-1} (1 - r_i, r_i)]) \cdot [u_0, \cdots, u_{m-1}]^T)_M$$
$$v' = ([\otimes_{i=l}^{2l-1} (1 - r_i, r_i)]) \cdot [\text{Enc}(t_0), \cdots, \text{Enc}(t_{m-1})]^T)_M$$
$$d_M(\text{Enc}(t')_M, u') \ge d_M(\text{Enc}(t')_M, v') - d_M(u', v') \ge (d-r) - (d/3-r) = 2d/3$$
so even without considering the $M^C$ cases, we see that the success probability is less than
$$\left( 1 - \frac{2d}{3n} \right)^\kappa$$
which is once again negligible.
### Finale: Lemma 4.11 and Lemma 4.12
We can use the same proof from Lemma 4.11 - note that the only thing that's important is the conditional probabilities itself being negligible - so the same proof works.
We use the same proof from Lemma 4.12 - note that the proof of Lemma 4.12 doesn't deal with $(u_i)$'s or definitions of $(t_i)$'s and so forth - all it requires is that the set of $(r_i)$'s that make the set of $\otimes_{i=l}^{2l-1} (1 - r_i, r_i)$'s linearly dependent has a negligible density. Therefore, one can take the same conclusion of Lemma 4.12, i.e. $\left( [ \otimes_{j=l}^{2l-1} (1 - r_{i, j}, r_{i, j}) ] \right)_{i=0}^{m-1}$ is invertible.
Therefore, we see that the proof is finished - since there are at most $d/3 - r$ correlated disagreements, we see that we have $n - r - (d/3-r) = n - d/3$ columns ready.
## Appendix: "Modification" of BSCS16's extraction
We need a guarantee of the following sort - by viewing all calls to the random oracle $\rho$, at the moment where $\mathcal{A}$ outputs $c$, the extractor $\mathcal{E}$ can extract a set of columns $M$ such that
- $\mathcal{E}$ knows all the column values at $M$
- $\mathcal{A}$ cannot provide column openings for $M^C$ with overwhelming probability
The only thing we have to work with is the random oracleness of $\rho$.
We assume that
- both prover and verifier knows what the merkle tree looks like
- the verifier, handles the authentication path length correctly as well
- since the verifier knows the indexes, it knows what merkle authentication path length the prover needs to provide - so no funny usual merkle tree attacks go through
Assume we express every element and hash as $\lambda$ bits, and say $Q$ distinct invocations were used. The random oracle is now $\rho: \{0, 1\}^{2\lambda} \rightarrow \{0, 1\}^{\lambda}$. We write each invocation as
$$\rho(l_i || r_i) = v_i, \quad q_i = \{l_i, r_i\}$$
where the invocations are written in order. We note $q_i \neq q_j$ for $i \neq j$.
The cases we have to deal with are as follows
- $v_i = v_j$ holding for some $i \neq j$
- $v_j \in q_i$ for some $j \ge i$
we remove all such cases - easy to show that the probability is $\mathcal{O}(Q^2/2^\lambda) < \text{negl}(\lambda)$.
![File](https://hackmd.io/_uploads/H15-Lu9dp.jpg)
so one of the issues in BSCS16 is that they build a DAG, but they describe the algorithm as if they were building a binary tree. However, this is solely because they didn't want the potential exponential blow-up of the binary tree. For example, if we naively build the entire binary tree like the "reality" section, and the queries look like $x_{i+1} = \rho(x_i || x_i)$, then the binary tree size will be as large as $2^Q$, making the extractor exponential time. We can bypass this by working solely on the subtree of the commitment $c$, and bounding the depth of the tree by $\lceil \log (mn) \rceil + 1$ or something like that. Therefore, one only has to work with at most $\mathcal{O}(mn)$ nodes, and there would be no collisions since we already removed those probabilities.
This is sufficient to build the tree, and get our necessary guarantees.