[Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/2003-Noise-Tolerant_Learning.pdf)

# [Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/2003-Noise-Tolerant_Learning.pdf) ## Background 1. Any $\mathcal{H}$ efficiently learnable under the SQ model is also efficiently PAC learnable with random classification noise (RCN). 2. Before this paper, it is unknown whether SQ learnability is necessary to efficiently PAC learn under RCN. 3. This paper shows that a. $\mathrm{SQ~Learnable} \subset \mathrm{PAC~Learnable~with~RCN}$ b. $k$-wise SQ model with $k=O(\log n)$ is eqivalent to unary SQ model. ## Main Problem 1. Can we learn a parity function with RCN efficiently? a. How about parity functions that only depend on the first $k=O(\log n \log\log n)$ bits of $x \in \{0,1\}^n$? 2. Equivalence of unary SQ model and $k=O(\log n)$-wise SQ model ## Results **Lemma 4** *Let $(x_1,\ell_1),...,(x_s, \ell_s)$ be examples labeled by the target function $c$ and corrupted by random noise of rate $\eta$. Then $\ell_1 +···+ \ell_s$ is the correct value of $(x_1 +···+ x_s) · c$ with probability $\frac{1}{2}(1+(1-2\eta)^s)$* **Algorithm** 1. Request $a2^b$ samples. Divide each data $x$ into $a$ blocks, each containing $b$ bits, so that $ab = n$. 2. Look at the last block of each $x$, those having the same bits in the last block are groupped up together, hence there will be at most $2^b$ groups. 3. In each group, randomly pick a data $x$ and add it (in $F_2$) to every other data in the same group, then discard the randomly picked data. Now every data ends in $0^b$. - e.g. If $a=2, b=2$ and the group is $\{1001, 0001, 1101\}$, randomly pick a data (say we pick $1001$) and add it to every other data, then discard $1001$, resulting in $\{1000, 0100\}$ 4. Repeat this $a-1$ times. Manipulate the labels correspondingly as well. 5. With probability $1-1/e$, one of the remaining data will be $10...0$ with its corresponding label being $\ell$. - If there is no $100...0$, repeat the above process with new samples. 6. $\ell$ indicates whether the first bit is in the target parity function correctly with probability $p=\frac{1}{2}(1+(1-2\eta)^s)$ (Lem. 4) with $s=2^{a-1}$, since in the $r$'th round each sample is the sum of $2^r$ other samples. 7. Repeat this multiple times with new samples to take majority vote, we can determine the correct label $\ell$ w.h.p. Repeat this for every bits, then we can learn the target parity function. **Theorem 2** This Algorithm runs in $\mathrm{poly}((\frac{1}{1-2\eta})^{2^a}, 2^b)$ time/samples. **Corollary 3** If the target parity function only depends on the first $k$ bits, the Algorithm runs in $2^{O(k/\log k)}$ time/samples. ###### tags: `Digest`