# 1-2: Generalization: PAC learning ## Probability Approximate Correct (PAC) Learning ### Definition With the **probability assumption** we've made in [1-1](https://hackmd.io/@jerry1249756/rkS6MFefo), statistical guarantees can be made for some learning problems. We'll start by introducing the formal definition of the Probability Approximate Correct (PAC) Learning Framework as follows: :::success **Definition (PAC learnable)** Let the Hypothesis set $\mathcal{H}$ be realizable, i.e., there exists some $h^*_\mathcal{H}\in\mathcal{H}$ s.t. $R_P(h^*_\mathcal{H})=0$. $\mathcal{H}$ is said to be *Probability Approximate Correct (PAC)-learnable* if $$\forall \epsilon>0, \delta>0, \exists \mathcal{A}: \mathcal{D}^*\rightarrow\mathcal{H} \text{ s.t. } \Pr\{R_P(\hat{h}) > \epsilon\} \leq \delta$$ where $\hat{h} = \mathcal{A}(\mathcal{D}^n)$ is the predictor chosen by the learning algorithm from $n>m_{\mathcal{H}}(\epsilon, \delta)$ samples. $m_{\mathcal{H}}(⋅,⋅)$ is usually termed as the *sample complexity*. **Definition (Agnostic PAC learnable)** The hypothesis set $\mathcal{H}$ is said to be *Agnostic PAC-learnable* if the realizability constraint is removed, that is, $$\forall \epsilon>0, \delta>0, \exists \mathcal{A}: \mathcal{D}^*\rightarrow\mathcal{H} \text{ s.t. } \Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon\} \leq \delta$$ For the following content, we'll term PAC learnable as this more generalized version. ::: We'll leave a simple exercise here: :::info **Example (Monotonicity of Sample complexity)** Suppose that $\mathcal{H}$ is PAC learnable and its sample complexity is given by $m_{\mathcal{H}}(⋅,⋅)$. Show that $m_{\mathcal{H}}$ is monotonically non-increasing in each of its parameters $\epsilon$ and $\delta$. That is, show that given $\delta\in(0,1)$ , and given $0<\epsilon_1\leq\epsilon_2<1$ , we have $m_{\mathcal{H}}(\epsilon_1,\delta)\geq m_{\mathcal{H}}(\epsilon_2,\delta)$. ::: **[Sol]** \begin{aligned} \Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon_1\} \leq \delta, &\forall n>m_{\mathcal{H}}(\epsilon_1,\delta)\\ \Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon_2\} \leq \delta, &\forall n>m_{\mathcal{H}}(\epsilon_2,\delta) \end{aligned} Since $\epsilon_1\leq\epsilon_2$, if the second inequality holds then the first inequality holds. This implies that $m_{\mathcal{H}}(\epsilon_1,\delta)\geq m_{\mathcal{H}}(\epsilon_2,\delta)$. Similar derivation can be made for $\delta$. ### Bounding the estimation error Recall the excess risk decomposition: $$\mathcal{ER}_P(h):=R_P(h)-R_P(h^*)=\underbrace{[R_P(h)-R_P(h^*_\mathcal{H})]}_{\text{estimation error}}+\underbrace{[R_P(h^*_\mathcal{H})-R_P(h^*)]}_{\text{approximation error}}$$ The approximation error is deterministic and mainly caused by $\mathcal{H}$. The estimation error, however, can be bounded by the **generalization gap** if we use ERM: :::success **Definition (Generalization gap)** The *generalization gap* of a predictor $h$ over training data $\mathcal{D}^n$ and testing distribution $P$ is defined as $$G_P^{\mathcal{D}^n}(h) := R_P(h) - \hat{R_n}(h)$$ ::: \begin{aligned} R_P(\hat{h_n})-R_P(h^*_\mathcal{H}) &= R_P(\hat{h_n}) - \hat{R_n}(\hat{h_n}) + \hat{R_n}(\hat{h_n}) -R_P(h^*_\mathcal{H})\\ &\leq [R_P(\hat{h_n}) - \hat{R_n}(\hat{h_n})] + [\hat{R_n}(h^*_\mathcal{H}) -R_P(h^*_\mathcal{H})]\\ &\leq 2 \underset{h\in\mathcal{H}}{\sup}\mid R_P(h) - \hat{R_n}(h)\mid \end{aligned} This immediately can be translated as the following: $$\Pr\{R_P(\hat{h_n})-R_P(h^*_\mathcal{H}) \geq \epsilon \} \leq \Pr\{\underset{h\in\mathcal{H}}{\sup} \mid R_P(h)-\hat{R_n}(h)\mid \geq \frac{\epsilon}{2} \} $$ We'll use the approach called **uniform convergence**, that is, to show that the right hand side of the above equation converges to 0 in probability. Our main focus is to derive a *distribution independent* representation of $G_P^{\mathcal{D}^n}(h)$ in terms of 1. training data size $n$ and 2. size of hypothesis space $\mid \mathcal{H} \mid$ Before we get started, we shall introduce some concentration inequalities to prepare. ### Concentration Inequalites I'll assume that you alredy know the following basic concentration inequalities: :::success **Theorem (Markov's Inequality)** For a non-negative RV $X$ whose expectation exists, $$\forall a>0, \Pr\{X\geq a\}\leq \frac{\mathbb{E}[X]}{a} $$ **Theorem (Chebyshev's Inequality)** For a RV $X$ whose variance exists, $$\forall a>0, \Pr\{\mid X -\mathbb{E}[X]\mid\geq a\}\leq \frac{Var[X]}{a^2} $$ **Theorem (Chernoff Bound)** For a RV $X$ whose moment generating function (MGF) $M_X(t)$ exists, $$\forall a\in \mathbb{R}, \Pr\{ X\geq a\}\leq \inf_{t\geq 0}e^{-at}M_X(t) $$ ::: Let's have some math game. :::success **Lemma (Hoeffding's Lemma)** For bounded RV $X$ with $Pr\{X\in [a,b]\} = 1$, then $$\forall t\in \mathbb{R}, M_{X-\mathbb{E}[X]}(t) = \mathbb{E}[e^{t(X-\mathbb{E}[X])}]\leq \exp{\left(\frac{t^2(b-a)^2}{8}\right)}$$ ::: **[Proof]** Without losing generality, we can assume that $\mathbb{E}[X]=0$ and thus $a\leq0\leq b$. Since $\mathbb{E}[(b-X)(X-a)]=-\mathbb{E}[X^2]+(a+b)\mathbb{E}[X]-ab\geq0$,then the variance can be bounded by the following: (This is called Popoviciu's inequality) \begin{align} Var[X] &= \mathbb{E}[X^2] -\mathbb{E}[X]^2 \leq (a+b)\mathbb{E}[X]-ab -\mathbb{E}[X]^2\\ &=(b-\mathbb{E}[X])(\mathbb{E}[X]-a) \leq \frac{1}{4}(b-a)^2 \end{align} where the last inequality utilizes the AM-GM Inequality. Note that the expectation (resp. variance) is the 1st (resp. 2nd) derivative of the $C_X(t) := \ln M_X(t)$ (**cumulant generating function**, CGF) evaulated at the point $t=0$: $$\frac{d}{dt}C_X(t)\biggr|_{t=0} = \frac{M_X'(0)}{M_X(0)}=\mathbb{E}[X]$$ $$\frac{d^2}{dt^2}C_X(t)\biggr|_{t=0} = \frac{d}{dt}\frac{M_X'(t)}{M_X(t)}\biggr|_{t=0}=\frac{M_X^{''}(0)M_X(0)-M_X^{'}(0)^2}{M_X(0)^2} =Var[X]$$ Taking the Taylor expansion at $t=0$, there exists a $\theta \in [0,t]$ such that $$C_X(t)=C_X(0) + C_X^{'}(0)t+\frac{C_X^{''}(\theta)}{2}t^2\leq \frac{t^2(b-a)^2}{8}$$ Thus, we have $$M_X(t)\leq \exp{\left(\frac{t^2(b-a)^2}{8}\right)}$$ :::success **Theorem (Hoeffding's Inequality)** For independent bounded RVs $X_1, ..., X_n$ with $Pr\{X_i\in [a_i,b_i]\} = 1$, $\forall i=1,...,n$, then $$\forall \epsilon >0, \Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-\frac{2\epsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)}$$ ::: **[Proof]** With the Chernoff bound and the Hoeffding's Lemma, we have $$\Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2e^{-t\epsilon} \prod_{i=1}^n M_{X_i-\mathbb{E}[X_i]}(t)\leq2\exp{\left(-t\epsilon+ \sum_{i=1}^n\frac{t^2(b_i-a_i)^2}{8}\right)}$$ Taking $$t=\frac{4\epsilon}{\sum_{i=1}^n(b_i-a_i)^2}\implies \Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-\frac{2\epsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)}$$ If we apply the result to Bernouli RVs, then the equation becomes $$\Pr\{\frac{1}{n}\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-2n\epsilon^2 \right)}$$ ## Some examples ### Finite hypothesis sets Now let's look at a few examples. :::info **Example (Single Classifier)** Suppose that there is only one classifier in the hypothesis set, i.e, $\mathcal{H}=\{h\}$. For *consistent* learner, i.e, $\hat{R_n}(h)=0$, we have $$\Pr\{R_P(h)>\frac{1}{n}\ln{\left(\frac{1}{\delta}\right)} \}\leq \delta$$ For *inconsistent* learner, we have $$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \sqrt{\frac{1}{2n}\ln{\left(\frac{2}{\delta}\right)}} \}\leq \delta$$ ::: **[proof]** (1) Let's first look at the probability that $R_P(h)>\epsilon$ while $\hat{R_n}(h)=0$: $$\Pr\{R_P(h)>\epsilon, \hat{R_n}(h)=0\} = (1-R_P(h))^n \leq (1-\epsilon)^n\leq e^{-n\epsilon}$$ Substitute $\delta=e^{-n\epsilon}$ leads to $$\Pr\{R_P(h)>\frac{1}{n}\ln{\left(\frac{1}{\delta}\right)} \}\leq \delta$$ (2) Utilizing the Hoeffding's Inequality, $$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon \}\leq 2\exp{\left(-2n\epsilon^2 \right)}$$ Similar substitution method can yield the result above. Note that the consistent case has a tighter bound. :::info **Example (Finite number of Classifiers)** Suppose that the hypothesis set is finite, i.e, $\mid\mathcal{H}\mid<\infty$. Then, $$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|>\sqrt{\frac{1}{2n}\ln{\left(\frac{2\mid\mathcal{H}\mid}{\delta}\right)}}\}\leq \delta$$ ::: **[proof]** By the union bound, \begin{align} \Pr\{\exists h\in\mathcal{H} \text{ s.t. } \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}&= \Pr\{\bigcup_{h\in\mathcal{H}} \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}\\ \leq\sum_{h\in\mathcal{H}}\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}&\leq 2\mid\mathcal{H}\mid\exp{\left(-2n\epsilon^2 \right)} \end{align} Similar substitution method can yield the result above. Note some information in this formula: * For finite $\mathcal{H}$, as $n\to \infty$, $\hat{R_n}(h) \simeq R_P(h)$. * If the algorithm $\mathcal{A}$ can find some $\hat{h}$ with $\hat{R_n}(\hat{h_n})\to0$, we can PAC guarantee that this learning problem is possible! * There is a tradeoff beween the size of $\mathcal{H}$: | $\mid \mathcal{H} \mid$ | $G_P^{\mathcal{D}^n}(h)$ | $\hat{R_n}(h)$ | | ----------------------- | ------------------------ | -------------- | | Small | Large | Small | | Large | Small | Large | ### An infinite dimensional $\mathcal{H}$ :::info **Example (Axis-aligned rectangle)** Consider a class of rectangle classifiers in $\mathbb{R}^d$ denote by $$h_{\textbf{(A,B)}}(\textbf{x})=\begin{cases} 1, &a_i \leq x_i \leq b_i, \forall i=1,...,d\\ 0, &\text{otherwise} \end{cases}$$ Note that $\mathcal{H}$ is an uncountable set. (1) Assume that $R_P(h^*_\mathcal{H})=0$ (consistent). Let $\mathcal{A}$ be an learning algorithm that returns the tightest rectangle enclosing all positive points. Show that $\mathcal{A}$ is an ERM learner. (2) Show that this problem is PAC learnable with sample complexity of $$m_{\mathcal{H}}(\epsilon,\delta) = \frac{2d}{\epsilon}\ln{\left(\frac{2d}{\delta}\right)}$$ (3) Show the runtime of $\mathcal{A}$ is polynomial in $d$, $1/\epsilon$ and $\ln{(1/\delta)}$. ::: **[Sol]** (1) Observe that the tightest rectangle that encloses all positive points can reach the minimal risk on the test data. We conclue that $\mathcal{A}$ is an ERM learner. (2) Suppose that the target (true) classifier $h^*$ with border $[a_i^*, b_i^*]$ has positive-labelled space $V^*:=\{x \mid h^*(x)=1\}$. Consider $h=\mathcal{A}(\mathcal{D}^n)$ with border $[a_i, b_i]$ and positive-labelled space $V:=\{x \mid h(x)=1\}$. Let's also consider $h_0$ (with positive-labelled space $V_0$), a tighter case with the border defined as \begin{align} s_i &:= \inf_{s} \Pr\{x\in V^* \land x_i\in[a_i^*,s]\}\geq \frac{\epsilon}{2d}\\ t_i &:= \inf_{t} \Pr\{x\in V^* \land x_i\in[t,b_i^*]\}\geq \frac{\epsilon}{2d}\\ \end{align} It can be shown that $V \subset V^*$. **(Why?)** The example in the case $d=2$ is shown below: ![](https://i.imgur.com/PKf4cFU.jpg =500x) Apply the union bound, we can see that the hypothesis returned by $\mathcal{A}$ has risk of at most $\epsilon$: $$R_P(h) \leq R_P(h_0) = 2d\cdot\frac{\epsilon}{2d}=\epsilon$$ However, this requires $V_0 \subset V$, i.e, $n$ samples does not fall in the gap between $V_0$ and $V$, which has probability bound $$\Pr\{V_0 \not\subset V\} \leq 2d\cdot\left(1-\frac{\epsilon}{2d}\right)^n\leq 2d\exp{\left(-\frac{n\epsilon}{2d}\right)}$$ Thus $$\Pr\{R_P(h)>\epsilon\}\leq2d\exp{\left(-\frac{n\epsilon}{2d}\right)}$$ Or equivalently, the sample complexity has $$n\geq m_{\mathcal{H}}(\epsilon,\delta) = \frac{2d}{\epsilon}\ln{\left(\frac{2d}{\delta}\right)}$$ (3) For each dimension, the algorithm has to find the minimum and maximum values among $n$ instances. For the sample complexity above, one can show that the runtime shall be polynomial in $d$, $1/\epsilon$ and $\ln{(1/\delta)}$. ###### tags: `machine learning`