# 1-4: Rademacher Complexity
## Fundamental Theorem of Statistical Learning
Let's sum up what we've learned so far.
:::success
**Theorem (Fundamental Theorem of Statistical Learning)**
Consider the binary classification settings with $\mathcal{Y}=\{\pm 1 \}$ and 0/1 loss. The following statements are equivalent: (Proof omitted)
* The Hypothesis set $\mathcal{H}$ is PAC learnable.
* The Hypothesis set $\mathcal{H}$ has finite VC dimension.
* The Hypothesis set $\mathcal{H}$ has uniform convergence property.
* ERM is a successful learning algorithm for the Hypothesis set $\mathcal{H}$.
:::
## Rademacher Complexity
<font size=8>
Warning of Hard Math!!!
</font>
### Motivation
One important thing of the VC dimension is that it is *distribution independent*. Hence, it allows to get bounds that holds for any distribution. On the other hand, the bounds may not be tight for some specific distributions. Furthermore, the concepts used in defining VC dimension apply only to binary classification, but we are often interested in generalization bounds for multiclass classification and regression as well.
**Rademacher complexity** is a more modern notion of complexity that is *distribution and dataset dependent* and defined for any class real-valued functions. We'll provide a generalization bound based on this measure in the following.
However, the compuation of it is NP-hard, so we'll later relate it to growth function that we've learned.
We'll continue to use $\mathcal{H}$ as the hypothesis space and $h$ as an element of $\mathcal{H}$. In the following, we'll use $\mathcal{G}$ as a *familiy of loss functon associated with* $\mathcal{H}$.
Given a space $\mathcal{Z}$ and a sample $\mathcal{D}^n = (z_1,z_2, ..., z_n)$ be i.i.d. samples drawn from $\mathcal{Z}$ (We can just imagine that $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$) with some unknown distribution $P$.
Let $\mathcal{G}$ be a family of functions $g: \mathcal{Z} \rightarrow \mathbb{R}$. (we can just view it as the loss function)
:::success
**Defintion (Rademacher Complexity)**
The *Empirical Rademacher Complexity* of $\mathcal{G}$ is defined as
$$\hat{\mathfrak{R}_{\mathcal{D}^n}}(\mathcal{G}):=\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_ig(z_i) \right]$$
where $\boldsymbol\sigma = (\sigma_1, ..., \sigma_n)$, $\sigma_1, ..., \sigma_n$ are independent uniform random variables chosen from $\{-1, +1\}$, which are called *Rademacher random variables*.
The *Rademacher Complexity* of $\mathcal{G}$ is then the expectation of the Empirical Rademacher Complexity:
$$\mathfrak{R}_n(\mathcal{G}):= \mathbb{E}_{\mathcal{D}^n\sim P^n} [\hat{\mathfrak{R}_{\mathcal{D}^n}}(\mathcal{G})]$$
:::
### Generalization Bound
In deriving generalization bounds using Rademacher complexity, we will make use of the following concentration bound:
:::success
**Theorem (McDiarmid's Inequality)**
Let $Z_1, ..., Z_n$ be independent random variables and let $c_1, ..., c_n$ be positive real numbers. If a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ satisfies the *bounded difference property*
$$\forall 1 \leq i\leq n, \underset{z_1, ..., z_n, z_i'}{\sup}\mid f(z_1, ..., z_i, ... z_n) - f(z_1, ..., z_i', ... z_n)\mid \leq c_i$$
Then,
$$\Pr\{ | f(X_1, ..., X_n) - \mathbb{E}[f(X_1, ..., X_n)] |\geq \epsilon \}\leq 2\exp{\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right)}$$
:::
We are now ready to present the generalization bounds based on Rademacher complexity. Note that now the function can be any real-valued functions.
:::success
**Theorem (One-sided Rademacher Complexity Bound)**
Let $\mathcal{G}$ be a family of functions $g: \mathcal{Z} \rightarrow [a,b] (a<b)$.
For all $g\in \mathcal{G}$, we have the following statements:
1. With probability at least $1-\delta$,
$$\mathbb{E}[g(Z)] \leq \frac{1}{n}\sum_{i=1}^ng(Z_i) + 2\mathfrak{R}_n(\mathcal{G}) +(b-a)\sqrt{\frac{\ln{(1/\delta)}}{2n}}$$
2. With reference to the particular drawn dataset $\mathcal{D}^n$, with probability at least $1-\delta$,
$$\mathbb{E}[g(Z)] \leq \frac{1}{n}\sum_{i=1}^ng(Z_i) + 2\hat{\mathfrak{R}_{\mathcal{D}^n}}(\mathcal{G}) +3(b-a)\sqrt{\frac{\ln{(2/\delta)}}{2n}}$$
:::
**[proof]**
Let's apply McDiarmid's Inequality to the following function:
$$\Phi(\mathcal{D}^n) = \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}[g(Z)] - \hat{\mathbb{E}_{\mathcal{D}^n}}[g(Z)] \right) := \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}[g(Z)] - \frac{1}{n}\sum_{i=1}^n g(Z_i) \right)$$
Since $\sup_xf_1(x)- \sup_xf_2(x) \leq \sup_x \left( f_1(x)-f_2(x) \right)$, if we consider another sample $\mathcal{D}^{n'}$, we will have the following:
\begin{align}
\Phi(\mathcal{D}^n) - \Phi(\mathcal{D}^{n'}) &= \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}[g(Z)] - \frac{1}{n}\sum_{i=1}^n g(Z_i) \right) - \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}[g(Z)] - \frac{1}{n}\sum_{i=1}^n g(Z_i') \right)\\
&\leq \underset{g\in\mathcal{G}}{\sup} \frac{1}{n}\sum_{i=1}^n\left( g(Z_i')-g(Z_i) \right) \leq \frac{b-a}{n}
\end{align}
Simiarly, we have
$$\Phi(\mathcal{D}^{n'}) - \Phi(\mathcal{D}^n)\leq \frac{b-a}{n} \implies |\Phi(\mathcal{D}^n) - \Phi(\mathcal{D}^{n'}) |\leq \frac{b-a}{n}$$
So we can use the inequality since it satisfies the bounded difference property. By McDiarmid's Inequality (for one-sided, so the coefficient 2 in front is removed):
$$\Pr\{ \Phi(\mathcal{D}^n) - \mathbb{E}[\Phi(\mathcal{D}^n)] \geq \epsilon \}\leq \exp{\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)}$$
Substitute $\delta = \exp{\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)}$, then it remains to show that $\mathbb{E}[\Phi(\mathcal{D}^n)] \leq 2\mathfrak{R}_n(\mathcal{G})$.
We'll utilize the symmetrization lemma. We introduce a “ghost sample” $D^{n'} = \{Z'_1, ... , Z'_n\}$ (different than before) independently drawn identically to $D^n$.
\begin{align}
\mathbb{E}[\Phi(\mathcal{D}^n)] &= \mathbb{E}_{\mathcal{D}^n}\left[ \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}[g(Z)] - \frac{1}{n}\sum_{i=1}^n g(Z_i) \right) \right]\\
&= \mathbb{E}_{\mathcal{D}^n}\left[ \underset{g\in\mathcal{G}}{\sup} \left( \mathbb{E}_{\mathcal{D}^{n'}}[\frac{1}{n}\sum_{i=1}^n g(Z_i')] - \frac{1}{n}\sum_{i=1}^n g(Z_i) \right) \right]\\
&\leq \mathbb{E}_{\mathcal{D}^n, \mathcal{D}^{n'}}\left[ \underset{g\in\mathcal{G}}{\sup} \frac{1}{n}\sum_{i=1}^n \left( g(Z_i') - g(Z_i) \right)\right]\\
&=\mathbb{E}_{\mathcal{D}^n, \mathcal{D}^{n'}, \boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup} \frac{1}{n}\sum_{i=1}^n \sigma_i\left( g(Z_i') - g(Z_i) \right)\right]\\
&=\mathbb{E}_{\mathcal{D}^{n'},\boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup} \frac{1}{n}\sum_{i=1}^n \sigma_i g(Z_i') \right] + \mathbb{E}_{\mathcal{D}^n,\boldsymbol\sigma} \left[ \underset{g\in\mathcal{G}}{\sup} \frac{1}{n}\sum_{i=1}^n (-\sigma_i) g(Z_i) \right]=2\mathfrak{R}_n(\mathcal{G})
\end{align}
Some techniques in the proof:
* 2nd equality: Expectation of empirical error = generalziation error
* 3rd inequality: Jenson's inequality for convex function (supremum)
* 4th equality: $g(Z_i') - g(Z_i)$ is symmetric RV ($X$ and $-X$ have same distribution)
* 5th equality: $\sigma_i$ is symmetric RV
The second part of the theorem can be done by using the concentration inequality w.r.t to the Empirical Rademacher Complexity. That is, with probability $1-\delta$,
$$\mathfrak{R}_n(\mathcal{G}) \leq\hat{\mathfrak{R}_{\mathcal{D}^n}}(\mathcal{G}) + \sqrt{\frac{\ln{(1/\delta)}}{2n}}$$
If we replace the above all steps with $1-\frac{\delta}{2}$, then we can get the second part of the theorem proved.
### Relation with Initial Hypothesis Space
We uses $\mathcal{G}$, the loss class, in the above derivation. However, the class is quite related to the original hypothesis space, $\mathcal{H}$.
:::success
**Lemma (Relation between $\mathcal{G}$ and $\mathcal{H}$)**
Let $\mathcal{G} = l \circ \mathcal{H} := \{ g | g(x,y)=l(h(x), y), \forall h\in \mathcal{H}\}$ be the space induced from $\mathcal{H}$.
If the output space $\mathcal{Y}=\{\pm 1\}$ is in the binary classification settings, then
$$\mathfrak{R}_n(\mathcal{G}) = \frac{1}{2}\mathfrak{R}_n(\mathcal{H})$$
:::
**[proof]**
Utilizing the analytical expression of the indicator function,
\begin{align}
\mathfrak{R}_n(\mathcal{G}) &= \mathbb{E}_{\boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_i \mathbb{I}\{Y_i \neq h(X_i)\} \right]\\
&=\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_i \frac{1-Y_ih(X_i)}{2} \right]\\
&= \frac{1}{2}\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{g\in\mathcal{G}}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_iY_ih(X_i) \right] = \frac{1}{2}\mathfrak{R}_n(\mathcal{H})
\end{align}
Note that $\sigma_i$ and $\sigma_iY_i$ have the same distribution.
### Relation with Growth Function
We can bound the Rademacher Complexity in terms of growth function.
:::success
**Lemma (Massart’s)**
Let $A\subseteq\mathbb{R}^n$ , $|A|\leq\infty$ and $r=\sup_{\textbf{u}\in A} ||\textbf{u}||_2$. ($|| \cdot ||_2$ means $L^2$ norm). Then,
$$\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{\textbf{u}\in A}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_iu_i \right] \leq \frac{r\sqrt{2\ln{|A|}}}{n}$$
:::
**[proof]**
\begin{align}
\exp{\left(t\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{\textbf{u}\in A}{\sup}\sum_{i=1}^n \sigma_iu_i \right]\right)} &\leq \mathbb{E}_{\boldsymbol\sigma}\left[\exp{\left(t \underset{\textbf{u}\in A}{\sup}\sum_{i=1}^n \sigma_iu_i \right)}\right] \\
&= \mathbb{E}_{\boldsymbol\sigma}\left[\underset{\textbf{u}\in A}{\sup}\exp{\left(t \sum_{i=1}^n \sigma_iu_i \right)}\right]\\
&\leq \sum_{\textbf{u}\in A}\mathbb{E}_{\boldsymbol\sigma}\left[\exp{\left(t \sum_{i=1}^n \sigma_iu_i \right)}\right] \\
&= \sum_{\textbf{u}\in A}\prod_{i=1}^n \mathbb{E}_{\boldsymbol\sigma}\left[ \exp{\left(t\sigma_iu_i \right)}\right]\\
&\leq \sum_{\textbf{u}\in A}\prod_{i=1}^n \exp{\left(\frac{t^2(2u_i)^2}{8}\right)} \leq |A| \exp{\left(\frac{t^2r^2}{2}\right)}
\end{align}
Some techniques in the proof:
* 1st inequality: Jenson's inequality
* 3rd inequality: Bound the supremum by the summation,
* 4th equality: Independence of the $\sigma_i$'s.
* 5th inequality: Since $\sigma_iu_i$ is bounded by $[-u_i, u_i]$, we can apply Hoeffding's lemma to get a bound
Taking logarithm on both side and divided by $t$, we have
$$\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{\textbf{u}\in A}{\sup}\sum_{i=1}^n \sigma_iu_i \right] \leq \frac{\ln{|A|}}{t}+\frac{tr^2}{2}$$
Take $t=\frac{\sqrt{2\ln{|A|}}}{r}$, which minimizes this bound, leads to
$$\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{\textbf{u}\in A}{\sup}\sum_{i=1}^n \sigma_iu_i \right] \leq r\sqrt{2\ln{|A|}}$$
Dividing both sides with $n$ completes the proof.
:::success
**Corollary (Relation with Growth function)**
If the output space $\mathcal{Y}=\{\pm 1\}$ is in the binary classification settings, then one has
$$\mathfrak{R}_n(\mathcal{H})\leq \sqrt{\frac{2\ln{G_{\mathcal{H}}(n)}}{n}} $$
:::
**[proof]**
Let $\mathcal{H}_S:=\{(h(x_1),...,h(x_n))\mid h\in \mathcal{H}\}$ denote all possible realization that $S$ can be classified. Since the output is only $\mathcal{Y}=\{\pm 1\}$, the maximum norm of output is $r=\sqrt{n}$. Then by the definition of growth function, we have
$$\mathfrak{R}_n(\mathcal{G}):= \mathbb{E}_{\mathcal{D}^n} \left[\mathbb{E}_{\boldsymbol\sigma}\left[ \underset{\textbf{u} \in\mathcal{H}_S}{\sup}\frac{1}{n}\sum_{i=1}^n \sigma_iu_i \right]\right]\leq \mathbb{E}_{\mathcal{D}^n} \left[\frac{\sqrt{n}\sqrt{2\ln{|\mathcal{H}_S|}}}{n}\right]\leq \sqrt{\frac{2\ln{G_{\mathcal{H}}(n)}}{n}} $$
###### tags: `machine learning`