# Rademacher Complexity
Denotation
---
* $\mathcal{F} = l \circ \mathcal{H} = \{z \longmapsto \mathcal{l}\left(h,z \right) : h \in \mathcal{H} \}$ be the
* Given $$f \in \mathcal{F}$$
* Define Loss of training set with distribution and loss function $L_{\mathcal{D}} \left ( f\right ) = \mathbb{E}_{z \sim \mathcal{D}}\left[ f(z)\right ]$
* Define loss of sample set $L_{S} \left ( f\right ) = \frac{1}{m}\sum_{i = 1}^{m}{f(z_{i})}$
* ### ***Representativeness*** define as follow
* $$ Rep_{\mathcal{D}}(\mathcal{F},S) = sup_{f \in \mathcal{F}}\left( L_{\mathcal{D}} \left ( f\right ) - L_{S} \left ( f\right ) \right)$$
* ### ***The set of all possible evaluations a functions.***
* $$ \mathcal{F} \circ S = \{ \left ( f(z_1),f(z_2), ... , f(z_m) \right) : f \in \mathcal{F} \} $$
* ### ***Rademancher complexity***
* $$R(\mathcal{F} \circ S) = \frac{1}{m}\mathbb{E}_{\sigma \sim \{\pm 1\}^m }\left [ sup_{f \in \mathcal{F}} \sum_{i = 1}^{m} \sigma_{i}f(z_i) \right]$$
######
* $$R(A) = \frac{1}{m} \mathbb{E}_{\sigma}\left[ sup_{a \in A} \sum_{i = 1}^{m} \sigma_{i}a_i \right]$$
* ### $\phi$ definition
* $\phi = \left\{ \phi_1,...,\phi_m \right\}, \forall{\phi} = \pm1$
Basic properties
---
### Definition 26.1
* A training set $S$ called $\epsilon-representative$ with $(Z,\mathcal{H},l)$ and distribution $D$. if
$$
sup_{h \in \mathcal{H}} \left | L_D(h) - L_S(h) \right | \leq \epsilon
$$
### **Lemma 26.2**
$$ \mathbb{E}_{S \sim \mathcal{D}^{m}}\left[Rep_{\mathcal{D}}(\mathcal{F},S)\right] \leq 2\mathbb{E}_{S \sim \mathcal{D}^{m}}\left[R(\mathcal{F} \circ S)\right]$$
### **Lemma 26.3**
* $$\mathbb{E}_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{S}(ERM_{\mathcal{H}}(S))\right] \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S) $$
* $$\mathbb{E}_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{\mathcal{D}}(h^*)\right] \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S), \forall h^* \in \mathcal{H}$$
Hence, the form of above can be more further as, for all $$ h^* = argmin_{h \in \mathcal{H}}(L_{\mathcal{D}}(h^*)) $$
We have $$ P_{S \sim \mathcal{D}}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{\mathcal{D}}(h^*) \leq \frac{ \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S)}{\sigma}\right ] \geq 1 - \sigma $$
### Lemma 26.4
* (McDiarmid's Inequality)
* Let $V$ be some set and let $f: V^m \rightarrow \mathbb{R}$ be a function ò $m$ variables such that for some $c > 0$, for all $i \in \left[m\right]$ and for all $x_1,..x_m\in V$ we have
$$\left | f(x_1,...,x_m) - f(x_1,...,x_{i - 1}, x^{'}_{i},x_{i+1},...,x_m) \right | \leq c $$
* Let $X_1,...,X_m$ be $m$ independent random variables taking values in $V$. Then
$$
P_{X_i}\left [ \left | f(x_1,...,x_m) - \mathbb{E}\left [f(X_1,...,X_{m}) \right] \right | \leq c\sqrt{ln(\frac{2}{\sigma})\frac{m}{2}} \right] \geq 1 - \sigma
$$
**For 26.4 we can prove some result below**
### Theorem 26.5:
* Assume that for all $z$ and $h \in \mathcal{H}$ we have that $\left | l(h,z) \right | \leq c$. Then, *with probability of at least $1 - \sigma$, for all $h \in \mathcal{H}$*
1.
$$
L_{\mathcal{D}}(h) - L_S(h) \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S^{'}) + c\sqrt{\frac{2ln(\frac{2}{\sigma})}{m}}
$$
2.
$$
L_{\mathcal{D}}(h) - L_S(h) \leq 2 R(\mathcal{l} \circ \mathcal{H} \circ S^) + 4c\sqrt{\frac{2ln(\frac{4}{\sigma})}{m}}
$$
3.
$$
L_{\mathcal{D}}(h) - L_S(h) \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S^{'}) + c\sqrt{\frac{2ln(\frac{2}{\sigma})}{m}}
$$
---
## Rademancher Calculus
### Theorem 26.6:
* For any $A \subset \mathbb{R}^n$ and $A^{'} = \left\{ ca + a_0 : a \in A \right\}$ with $a_{0} \in \mathbb{R}^n$ and $c$ is scalar then
$$ R(A^{'}) \leq \left|c\right|R(A) $$
**Following lemma states that the convex hull of $A$ is have a same Rademancher Complexity as $A$**
### Theorem 26.7:
* Let $A$ be a subset of $\mathbb{R}^n$ and let $A^{'} = \left\{\sum_{j = 1}^{N}\alpha_{i}a^{(j)} : N \in \mathbb{N}, \forall j,a^{(j)}\in A,\alpha_j \geq 0, \left |\left |a \right | \right |_{1} = 1\right\}$. Then $R(A) = R(A^{'})$
### Lemma 26.8:
* (Massart Lemma) The Rademancher Complexity of of a finite set is growth logarithmically with the size of the set.
* Formally, let $A = \left\{a_1,...,a_N\right\}$ be a finite set of vectors in $\mathbb{R}^m$. Define $\overline{a} = \frac{1}{N}\sum_{i = 1}^{n}a_i$. Then,
$$ R(A) = \mathbb{E}_{\sigma}\left[max_{a \in A}(\left||a - \overline{a}\right||) \right] \frac{\sqrt{2log(N)}}{m} $$
### Lemma 26.9
* (Contraction Lemma) For each $i \in [m]$, let $\phi_{i}: \mathbb{R} \rightarrow \mathbb{R}$ be $\rho-Lipschitz$ function. For all $a \in \mathbb{R}^n$ let $\phi(a)$ denote the vector $(\phi_{1}(a_{1}),...,\phi_{m}(a_m))$. Let $\phi\circ A = \left \{\phi(a):a\in A \right \}$. Then,
$$
R(\phi \circ A) \leq \rho R(A)
$$
Rademancher Complexity for Linear Classes
---
Define 2 following classes:
$\mathcal{H}_1 = \left\{x \mapsto \left\langle w,x \right\rangle : ||w||_1 \leq 1 \right\}$ $\mathcal{H}_2 = \left\{x \mapsto \left\langle w,x \right\rangle : ||w||_2 \leq 1 \right\}$
### Lemma 26.10
* (Bounding RC for $\mathcal{H}_2$) Let $S = \left\{x_1,...,x_m\right\}$ be a vector in **Hilbert space** $\mathcal{H}_2 \circ S = \left\{\left\langle w,x_1 \right\rangle,...,\left\langle w,x_m \right\rangle : ||w||_2 \leq 1 \right\}$. Then,
$$
R(\mathcal{H}_2 \circ S) \leq \frac{max_{i}||x_i||_2}{\sqrt{m}}
$$
### Lemma 26.11
* (Bounding RC for $\mathcal{H}_1$) Let $S = \left\{x_1,...,x_m\right\}$ be a vector in $\mathbb{R}^n$ $\mathcal{H}_1 \circ S = \left\{\left\langle w,x_1 \right\rangle,...,\left\langle w,x_m \right\rangle : ||w||_2 \leq 1 \right\}$. Then,
$$
R(\mathcal{H}_2 \circ S) \leq max_{i}||x_i||_{\infty}\sqrt{\frac{2log(2n)}{m}}
$$
Generalization Bounds for SVM
---
* Let $\mathcal{H} = \left\{w, ||w|| \leq b\right\}$
### Theorem 26.12
* Suppose $\mathcal{D}$ be a distribution over $Z=$ $\mathcal{X} \times \mathcal{Y}$ such that with probability $1$ we have that $||x||_2 \leq R$. Let $\mathcal{l}:\mathcal{H}\times Z \rightarrow \mathbb{R}$ be a loss function of the form $l(w,(x,y))= \phi(\langle w,x \rangle,y)$ where $\phi:\mathbb{R} \times \mathcal{Y}$ is a $\rho$-Lipschitz function, and such that for all $y \in \mathcal{Y}$ we have $max_{a \in \left[-BR,BR \right]}|\phi(\langle w,x \rangle,y)| \leq c$. Then for any $\delta \in (0,1)$, with the probability at least $1 - \delta$ over choice i.i.d sample of size m,
$$
\forall{w} \in \mathcal{H}, L_{\mathcal{D}}(w) - L_{S}(w) \leq \frac{2\rho BR}{\sqrt{m}} + c\sqrt{\frac{2ln(\frac{2}{\delta})}{m}}
$$
### Theorem 26.13
* Consider $\mathcal{D}$ is a distribution over $Z=$ $\mathcal{X} \times \left\{\pm 1\right\}$ such that with probability $1$ there is exists $w^{*}$ with $\mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y\langle w^{*},x\rangle \geq 1]$ such that that $||x||_2 \leq R$. Let $w_S$ be the output of hard-SVM.Then for any $\delta \in (0,1)$, with the probability at least $1 - \delta$ over choice i.i.d sample of size m,
$$
\mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y \neq sign(\langle w_{s},x\rangle) ] \leq \frac{2R||w^{*}||}{\sqrt{m}} + (1 + R||w^*||)\sqrt{\frac{2ln(\frac{2}{\delta})}{m}}
$$
### Theorem 26.14
Assume that the conditions of **Theorem 26.13** hold. Then, with probability of at least $1-\delta$ over the choice $S \thicksim \mathcal{D}^m$, we have
$$
\mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y \neq sign(\langle w_{s},x\rangle) ] \leq \frac{4R||w_S||}{\sqrt{m}} + \sqrt{\frac{ln(\frac{4log_{2}(||w_S||)}{\delta})}{m}}
$$