# 1-2: Generalization: PAC learning
## Probability Approximate Correct (PAC) Learning
### Definition
With the **probability assumption** we've made in [1-1](https://hackmd.io/@jerry1249756/rkS6MFefo), statistical guarantees can be made for some learning problems. We'll start by introducing the formal definition of the Probability Approximate Correct (PAC) Learning Framework as follows:
:::success
**Definition (PAC learnable)**
Let the Hypothesis set $\mathcal{H}$ be realizable, i.e., there exists some $h^*_\mathcal{H}\in\mathcal{H}$ s.t. $R_P(h^*_\mathcal{H})=0$.
$\mathcal{H}$ is said to be *Probability Approximate Correct (PAC)-learnable* if
$$\forall \epsilon>0, \delta>0, \exists \mathcal{A}: \mathcal{D}^*\rightarrow\mathcal{H} \text{ s.t. } \Pr\{R_P(\hat{h}) > \epsilon\} \leq \delta$$
where $\hat{h} = \mathcal{A}(\mathcal{D}^n)$ is the predictor chosen by the learning algorithm from $n>m_{\mathcal{H}}(\epsilon, \delta)$ samples. $m_{\mathcal{H}}(⋅,⋅)$ is usually termed as the *sample complexity*.
**Definition (Agnostic PAC learnable)**
The hypothesis set $\mathcal{H}$ is said to be *Agnostic PAC-learnable* if the realizability constraint is removed, that is,
$$\forall \epsilon>0, \delta>0, \exists \mathcal{A}: \mathcal{D}^*\rightarrow\mathcal{H} \text{ s.t. } \Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon\} \leq \delta$$
For the following content, we'll term PAC learnable as this more generalized version.
:::
We'll leave a simple exercise here:
:::info
**Example (Monotonicity of Sample complexity)**
Suppose that $\mathcal{H}$ is PAC learnable and its sample complexity is given by $m_{\mathcal{H}}(⋅,⋅)$.
Show that $m_{\mathcal{H}}$ is monotonically non-increasing in each of its parameters $\epsilon$ and $\delta$. That is, show that given $\delta\in(0,1)$ , and given $0<\epsilon_1\leq\epsilon_2<1$ , we have $m_{\mathcal{H}}(\epsilon_1,\delta)\geq m_{\mathcal{H}}(\epsilon_2,\delta)$.
:::
**[Sol]**
\begin{aligned}
\Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon_1\} \leq \delta, &\forall n>m_{\mathcal{H}}(\epsilon_1,\delta)\\
\Pr\{R_P(\hat{h})-R_P(h^*_\mathcal{H}) > \epsilon_2\} \leq \delta, &\forall n>m_{\mathcal{H}}(\epsilon_2,\delta)
\end{aligned}
Since $\epsilon_1\leq\epsilon_2$, if the second inequality holds then the first inequality holds. This implies that $m_{\mathcal{H}}(\epsilon_1,\delta)\geq m_{\mathcal{H}}(\epsilon_2,\delta)$. Similar derivation can be made for $\delta$.
### Bounding the estimation error
Recall the excess risk decomposition:
$$\mathcal{ER}_P(h):=R_P(h)-R_P(h^*)=\underbrace{[R_P(h)-R_P(h^*_\mathcal{H})]}_{\text{estimation error}}+\underbrace{[R_P(h^*_\mathcal{H})-R_P(h^*)]}_{\text{approximation error}}$$
The approximation error is deterministic and mainly caused by $\mathcal{H}$.
The estimation error, however, can be bounded by the **generalization gap** if we use ERM:
:::success
**Definition (Generalization gap)**
The *generalization gap* of a predictor $h$ over training data $\mathcal{D}^n$ and testing distribution $P$ is defined as
$$G_P^{\mathcal{D}^n}(h) := R_P(h) - \hat{R_n}(h)$$
:::
\begin{aligned}
R_P(\hat{h_n})-R_P(h^*_\mathcal{H}) &= R_P(\hat{h_n}) - \hat{R_n}(\hat{h_n}) + \hat{R_n}(\hat{h_n}) -R_P(h^*_\mathcal{H})\\
&\leq [R_P(\hat{h_n}) - \hat{R_n}(\hat{h_n})] + [\hat{R_n}(h^*_\mathcal{H}) -R_P(h^*_\mathcal{H})]\\
&\leq 2 \underset{h\in\mathcal{H}}{\sup}\mid R_P(h) - \hat{R_n}(h)\mid
\end{aligned}
This immediately can be translated as the following:
$$\Pr\{R_P(\hat{h_n})-R_P(h^*_\mathcal{H}) \geq \epsilon \} \leq \Pr\{\underset{h\in\mathcal{H}}{\sup} \mid R_P(h)-\hat{R_n}(h)\mid \geq \frac{\epsilon}{2} \} $$
We'll use the approach called **uniform convergence**, that is, to show that the right hand side of the above equation converges to 0 in probability.
Our main focus is to derive a *distribution independent* representation of $G_P^{\mathcal{D}^n}(h)$ in terms of
1. training data size $n$ and
2. size of hypothesis space $\mid \mathcal{H} \mid$
Before we get started, we shall introduce some concentration inequalities to prepare.
### Concentration Inequalites
I'll assume that you alredy know the following basic concentration inequalities:
:::success
**Theorem (Markov's Inequality)**
For a non-negative RV $X$ whose expectation exists,
$$\forall a>0, \Pr\{X\geq a\}\leq \frac{\mathbb{E}[X]}{a} $$
**Theorem (Chebyshev's Inequality)**
For a RV $X$ whose variance exists,
$$\forall a>0, \Pr\{\mid X -\mathbb{E}[X]\mid\geq a\}\leq \frac{Var[X]}{a^2} $$
**Theorem (Chernoff Bound)**
For a RV $X$ whose moment generating function (MGF) $M_X(t)$ exists,
$$\forall a\in \mathbb{R}, \Pr\{ X\geq a\}\leq \inf_{t\geq 0}e^{-at}M_X(t) $$
:::
Let's have some math game.
:::success
**Lemma (Hoeffding's Lemma)**
For bounded RV $X$ with $Pr\{X\in [a,b]\} = 1$, then
$$\forall t\in \mathbb{R}, M_{X-\mathbb{E}[X]}(t) = \mathbb{E}[e^{t(X-\mathbb{E}[X])}]\leq \exp{\left(\frac{t^2(b-a)^2}{8}\right)}$$
:::
**[Proof]**
Without losing generality, we can assume that $\mathbb{E}[X]=0$ and thus $a\leq0\leq b$.
Since $\mathbb{E}[(b-X)(X-a)]=-\mathbb{E}[X^2]+(a+b)\mathbb{E}[X]-ab\geq0$,then the variance can be bounded by the following: (This is called Popoviciu's inequality)
\begin{align}
Var[X] &= \mathbb{E}[X^2] -\mathbb{E}[X]^2 \leq (a+b)\mathbb{E}[X]-ab -\mathbb{E}[X]^2\\
&=(b-\mathbb{E}[X])(\mathbb{E}[X]-a) \leq \frac{1}{4}(b-a)^2
\end{align}
where the last inequality utilizes the AM-GM Inequality.
Note that the expectation (resp. variance) is the 1st (resp. 2nd) derivative of the $C_X(t) := \ln M_X(t)$ (**cumulant generating function**, CGF) evaulated at the point $t=0$:
$$\frac{d}{dt}C_X(t)\biggr|_{t=0} = \frac{M_X'(0)}{M_X(0)}=\mathbb{E}[X]$$
$$\frac{d^2}{dt^2}C_X(t)\biggr|_{t=0} = \frac{d}{dt}\frac{M_X'(t)}{M_X(t)}\biggr|_{t=0}=\frac{M_X^{''}(0)M_X(0)-M_X^{'}(0)^2}{M_X(0)^2} =Var[X]$$
Taking the Taylor expansion at $t=0$, there exists a $\theta \in [0,t]$ such that
$$C_X(t)=C_X(0) + C_X^{'}(0)t+\frac{C_X^{''}(\theta)}{2}t^2\leq \frac{t^2(b-a)^2}{8}$$
Thus, we have
$$M_X(t)\leq \exp{\left(\frac{t^2(b-a)^2}{8}\right)}$$
:::success
**Theorem (Hoeffding's Inequality)**
For independent bounded RVs $X_1, ..., X_n$ with $Pr\{X_i\in [a_i,b_i]\} = 1$, $\forall i=1,...,n$, then
$$\forall \epsilon >0, \Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-\frac{2\epsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)}$$
:::
**[Proof]**
With the Chernoff bound and the Hoeffding's Lemma, we have
$$\Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2e^{-t\epsilon} \prod_{i=1}^n M_{X_i-\mathbb{E}[X_i]}(t)\leq2\exp{\left(-t\epsilon+ \sum_{i=1}^n\frac{t^2(b_i-a_i)^2}{8}\right)}$$
Taking
$$t=\frac{4\epsilon}{\sum_{i=1}^n(b_i-a_i)^2}\implies \Pr\{\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-\frac{2\epsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)}$$
If we apply the result to Bernouli RVs, then the equation becomes
$$\Pr\{\frac{1}{n}\biggr| \sum_{i=1}^n \left( X_i -\mathbb{E}[X_i]\right) \biggr| \geq \epsilon\}\leq 2\exp{\left(-2n\epsilon^2 \right)}$$
## Some examples
### Finite hypothesis sets
Now let's look at a few examples.
:::info
**Example (Single Classifier)**
Suppose that there is only one classifier in the hypothesis set, i.e, $\mathcal{H}=\{h\}$.
For *consistent* learner, i.e, $\hat{R_n}(h)=0$, we have
$$\Pr\{R_P(h)>\frac{1}{n}\ln{\left(\frac{1}{\delta}\right)} \}\leq \delta$$
For *inconsistent* learner, we have
$$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \sqrt{\frac{1}{2n}\ln{\left(\frac{2}{\delta}\right)}} \}\leq \delta$$
:::
**[proof]**
(1)
Let's first look at the probability that $R_P(h)>\epsilon$ while $\hat{R_n}(h)=0$:
$$\Pr\{R_P(h)>\epsilon, \hat{R_n}(h)=0\} = (1-R_P(h))^n \leq (1-\epsilon)^n\leq e^{-n\epsilon}$$
Substitute $\delta=e^{-n\epsilon}$ leads to
$$\Pr\{R_P(h)>\frac{1}{n}\ln{\left(\frac{1}{\delta}\right)} \}\leq \delta$$
(2)
Utilizing the Hoeffding's Inequality,
$$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon \}\leq 2\exp{\left(-2n\epsilon^2 \right)}$$
Similar substitution method can yield the result above.
Note that the consistent case has a tighter bound.
:::info
**Example (Finite number of Classifiers)**
Suppose that the hypothesis set is finite, i.e, $\mid\mathcal{H}\mid<\infty$. Then,
$$\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|>\sqrt{\frac{1}{2n}\ln{\left(\frac{2\mid\mathcal{H}\mid}{\delta}\right)}}\}\leq \delta$$
:::
**[proof]**
By the union bound,
\begin{align}
\Pr\{\exists h\in\mathcal{H} \text{ s.t. } \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}&= \Pr\{\bigcup_{h\in\mathcal{H}} \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}\\
\leq\sum_{h\in\mathcal{H}}\Pr\{ \biggr|\hat{R_n}(h) - R_P(h) \biggr|> \epsilon\}&\leq 2\mid\mathcal{H}\mid\exp{\left(-2n\epsilon^2 \right)}
\end{align}
Similar substitution method can yield the result above.
Note some information in this formula:
* For finite $\mathcal{H}$, as $n\to \infty$, $\hat{R_n}(h) \simeq R_P(h)$.
* If the algorithm $\mathcal{A}$ can find some $\hat{h}$ with $\hat{R_n}(\hat{h_n})\to0$, we can PAC guarantee that this learning problem is possible!
* There is a tradeoff beween the size of $\mathcal{H}$:
| $\mid \mathcal{H} \mid$ | $G_P^{\mathcal{D}^n}(h)$ | $\hat{R_n}(h)$ |
| ----------------------- | ------------------------ | -------------- |
| Small | Large | Small |
| Large | Small | Large |
### An infinite dimensional $\mathcal{H}$
:::info
**Example (Axis-aligned rectangle)**
Consider a class of rectangle classifiers in $\mathbb{R}^d$ denote by
$$h_{\textbf{(A,B)}}(\textbf{x})=\begin{cases}
1, &a_i \leq x_i \leq b_i, \forall i=1,...,d\\
0, &\text{otherwise}
\end{cases}$$
Note that $\mathcal{H}$ is an uncountable set.
(1) Assume that $R_P(h^*_\mathcal{H})=0$ (consistent). Let $\mathcal{A}$ be an learning algorithm that returns the tightest rectangle enclosing all positive points. Show that $\mathcal{A}$ is an ERM learner.
(2) Show that this problem is PAC learnable with sample complexity of
$$m_{\mathcal{H}}(\epsilon,\delta) = \frac{2d}{\epsilon}\ln{\left(\frac{2d}{\delta}\right)}$$
(3) Show the runtime of $\mathcal{A}$ is polynomial in $d$, $1/\epsilon$ and $\ln{(1/\delta)}$.
:::
**[Sol]**
(1) Observe that the tightest rectangle that encloses all positive points can reach the minimal risk on the test data. We conclue that $\mathcal{A}$ is an ERM learner.
(2)
Suppose that the target (true) classifier $h^*$ with border $[a_i^*, b_i^*]$ has positive-labelled space $V^*:=\{x \mid h^*(x)=1\}$.
Consider $h=\mathcal{A}(\mathcal{D}^n)$ with border $[a_i, b_i]$ and positive-labelled space $V:=\{x \mid h(x)=1\}$.
Let's also consider $h_0$ (with positive-labelled space $V_0$), a tighter case with the border defined as
\begin{align}
s_i &:= \inf_{s} \Pr\{x\in V^* \land x_i\in[a_i^*,s]\}\geq \frac{\epsilon}{2d}\\
t_i &:= \inf_{t} \Pr\{x\in V^* \land x_i\in[t,b_i^*]\}\geq \frac{\epsilon}{2d}\\
\end{align}
It can be shown that $V \subset V^*$. **(Why?)**
The example in the case $d=2$ is shown below:

Apply the union bound, we can see that the hypothesis returned by $\mathcal{A}$ has risk of at most $\epsilon$:
$$R_P(h) \leq R_P(h_0) = 2d\cdot\frac{\epsilon}{2d}=\epsilon$$
However, this requires $V_0 \subset V$, i.e, $n$ samples does not fall in the gap between $V_0$ and $V$, which has probability bound
$$\Pr\{V_0 \not\subset V\} \leq 2d\cdot\left(1-\frac{\epsilon}{2d}\right)^n\leq 2d\exp{\left(-\frac{n\epsilon}{2d}\right)}$$
Thus
$$\Pr\{R_P(h)>\epsilon\}\leq2d\exp{\left(-\frac{n\epsilon}{2d}\right)}$$
Or equivalently, the sample complexity has
$$n\geq m_{\mathcal{H}}(\epsilon,\delta) = \frac{2d}{\epsilon}\ln{\left(\frac{2d}{\delta}\right)}$$
(3)
For each dimension, the algorithm has to find the minimum and maximum values among $n$ instances. For the sample complexity above, one can show that the runtime shall be polynomial in $d$, $1/\epsilon$ and $\ln{(1/\delta)}$.
###### tags: `machine learning`