如何判断两个大文件内容是否相等

# 如何判断两个大文件内容是否相等 # 问题两个相隔千里的人Alice和Bob，每人拥有一个非常大的文件，其中各自包含$n$个字符，分别记为：$a=\left(a_{1}, \ldots, a_{n}\right)$， $b = \left(b_{1}, \ldots, b_{n}\right)$。二人的目标是，判断他们的两个文件是不是相等的，即对所有$i=1, \ldots, n$，是否$a_{i}=b_{i}$ 。 # 基准解法 Alice把他的整个文件发给Bob，Bob检验对所有$i=1, \ldots, n$，是否$a_{i}=b_{i}$ 。显然当文件大小很大时，该解法通信复杂量过大。 ## 下界：基于姚氏通信复杂性模型（这一小节其实是写本篇的初衷，但过于技术，恕我不翻译。） > 定理：没有确定性算法可以比上述基准算法通信复杂度更小 ### Protocol & Communication Complexity A **protocol** $\mathcal{P}$ over domain $X \times Y$ with range $Z$ is a binary tree where each internal node $v$ is labeled either by a function $a_{v}: X \rightarrow\{0,1\}$ or by a function $b_{v}: Y \rightarrow\{0,1\}$, and each leaf is labeled with an element $z \in Z$. The value of the protocol $\mathcal{P}$ on input $(x, y)$ is the label of the leaf reached by starting from the root, and walking on the tree. At each internal node $v$ labeled by $a_{v}$ walking left if $a_{v}(x)=0$ and right if $a_{v}(x)=1$, and at each internal node labeled by $b_{v}$ walking left if $b_{v}(y)=0$ and right if $b_{v}(y)=1$. The cost of the protocol $\mathcal{P}$ on input $(x, y)$ is the length of the path taken on input $(x, y)$. The **cost** of the protocol $\mathcal{P}$ is the height of the tree. For a function $f: X \times Y \rightarrow Z$, the (deterministic) communication complexity of $f$ is the minimum cost of $\mathcal{P}$, over all protocols $\mathcal{P}$ that compute $f$. We denote the **(deterministic) communication complexity** of $f$ by $D(f)$. The simplest way for Alice and Bob to evaluate a function $f$ is for one of the players, say Alice, to send all her input to Bob (requiring $\log _{2}|X|$ bits using an appropriate encoding), for Bob to compute $f(x, y)$ privately (with his unlimited computational power), and then for Bob to send the answer back to Alice ( $\log _{2}|Z|$ more bits). We thus have: > Proposition 1: For every function $f: X \times Y \rightarrow Z$, > $$ D(f) \leq \log _{2}|X|+\log _{2}|Z| $$ ### Rectangles > A **combinatorial rectangle** (in short, a rectangle) in $X \times Y$ is a subset $R \subseteq X \times Y$ such that $R=A \times B$ for some $A \subseteq X$ and $B \subseteq Y$. > $R \subseteq X \times Y$ is a rectangle if and only if $$ \left(x_{1}, y_{1}\right) \in R \text { and }\left(x_{2}, y_{2}\right) \in R \Rightarrow\left(x_{1}, y_{2}\right) \in R $$ For every protocol $\mathcal{P}$ and leaf $\ell$ in it, $R_{\ell}$ is a rectangle. A subset $R \subseteq X \times Y$ is called $f$-**monochromatic** (in short, monochromatic) if $f$ is fixed on $R$. ### Fooling Set Method Let $f: X \times Y \rightarrow\{0,1\}$. A set $S \subset X \times Y$ is called a **fooling set** (for $f)$ if there exists a value $z \in\{0,1\}$ such that - For every $(x, y) \in S, f(x, y)=z$. - For every two distinct pairs $\left(x_{1}, y_{1}\right)$ and $\left(x_{2}, y_{2}\right)$ in $S$, either $f\left(x_{1}, y_{2}\right) \neq z$ or $f\left(x_{2}, y_{1}\right) \neq$ $z$. > Lemma 1: If $f$ has a fooling set $S$ of size $t$, then $D(f) \geq \log _{2} t$. ### Tight Lower Bound for Equality Function Assuming Alice and Bob each hold an $n$-bit string, $x, y \in\{0,1\}^{n}$. The equality function, $\operatorname{EQ}(x, y)$, is defined to be 1 if $x=y$ and 0 otherwise. A fooling set of size $2^{n}$ is $$ S=\left\{(\alpha, \alpha) \mid \alpha \in\{0,1\}^{n}\right\} $$ (because for every $\alpha, \mathrm{EQ}(\alpha, \alpha)=1$, whereas for every $\alpha \neq \beta, \mathrm{EQ}(\alpha, \beta)=0$ ). It follows that Lemma 1: $D(\mathrm{EQ}) \geq n$. By also counting 0-rectangles, we conclude $D(\mathrm{EQ}) \geq n+1$. Finally, recall that $D(f) \leq n+1$ for every function $f:\{0,1\}^{n} \times\{0,1\}^{n} \rightarrow\{0,1\}$ (by Proposition 1). Therefore, $D(\mathrm{EQ})=n+1$. # Reed-Solomon指纹识别协议基准解法通信复杂度过大，上述下界又排除了更好的确定性算法的可能性，那么我们应该怎么办呢？我们可以放宽一些条件，允许算法可以以很小的概率去输出错误结果，如此一来上述下界可被突破，可设计出大大减少通信复杂度的随机化算法。 ## Reed-Solomon 编码对每个 $r \in \mathbb{F}_{p}$, 定义 $h_{r}\left(a_{1}, \ldots, a_{n}\right)=\sum_{i=1}^{n} a_{i} \cdot r^{i}$. 哈希函数族 $\mathcal{H}=\left\{h_{r}: r \in \mathbb{F}_{p}\right\} .$ > Reed-Solomon 编码是允许你在做核酸之前可以不用把手机放得端端正正再递上去扫码的唯一原因。 ## 协议 - Alice 随机选择$r \in \mathbb{F}_{p}$，计算$v=h_{r}(a)$ - Alice不直接全文发送$a$到Bob，而是发送$v$和$r$。 - Bob 检查是否$v = h_r(b)$ - 否，则$a\ne b$ - 是，则$a=b$ w.h.p (以很高概率) ## 开销分析 - Alice只发送了两个$\mathbb{F}_{p}$ 中的元素，论比特数是$O(\log n)$ （假设 $p \leq n^{c}$ ） - 这相对于确定性基准协议的$n$比特是指数级别的通信复杂度性能提升。