# Rademacher Complexity Denotation --- * $\mathcal{F} = l \circ \mathcal{H} = \{z \longmapsto \mathcal{l}\left(h,z \right) : h \in \mathcal{H} \}$ be the * Given $$f \in \mathcal{F}$$ * Define Loss of training set with distribution and loss function $L_{\mathcal{D}} \left ( f\right ) = \mathbb{E}_{z \sim \mathcal{D}}\left[ f(z)\right ]$ * Define loss of sample set $L_{S} \left ( f\right ) = \frac{1}{m}\sum_{i = 1}^{m}{f(z_{i})}$ * ### ***Representativeness*** define as follow * $$ Rep_{\mathcal{D}}(\mathcal{F},S) = sup_{f \in \mathcal{F}}\left( L_{\mathcal{D}} \left ( f\right ) - L_{S} \left ( f\right ) \right)$$ * ### ***The set of all possible evaluations a functions.*** * $$ \mathcal{F} \circ S = \{ \left ( f(z_1),f(z_2), ... , f(z_m) \right) : f \in \mathcal{F} \} $$ * ### ***Rademancher complexity*** * $$R(\mathcal{F} \circ S) = \frac{1}{m}\mathbb{E}_{\sigma \sim \{\pm 1\}^m }\left [ sup_{f \in \mathcal{F}} \sum_{i = 1}^{m} \sigma_{i}f(z_i) \right]$$ ###### * $$R(A) = \frac{1}{m} \mathbb{E}_{\sigma}\left[ sup_{a \in A} \sum_{i = 1}^{m} \sigma_{i}a_i \right]$$ * ### $\phi$ definition * $\phi = \left\{ \phi_1,...,\phi_m \right\}, \forall{\phi} = \pm1$ Basic properties --- ### Definition 26.1 * A training set $S$ called $\epsilon-representative$ with $(Z,\mathcal{H},l)$ and distribution $D$. if $$ sup_{h \in \mathcal{H}} \left | L_D(h) - L_S(h) \right | \leq \epsilon $$ ### **Lemma 26.2** $$ \mathbb{E}_{S \sim \mathcal{D}^{m}}\left[Rep_{\mathcal{D}}(\mathcal{F},S)\right] \leq 2\mathbb{E}_{S \sim \mathcal{D}^{m}}\left[R(\mathcal{F} \circ S)\right]$$ ### **Lemma 26.3** * $$\mathbb{E}_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{S}(ERM_{\mathcal{H}}(S))\right] \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S) $$ * $$\mathbb{E}_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{\mathcal{D}}(h^*)\right] \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S), \forall h^* \in \mathcal{H}$$ Hence, the form of above can be more further as, for all $$ h^* = argmin_{h \in \mathcal{H}}(L_{\mathcal{D}}(h^*)) $$ We have $$ P_{S \sim \mathcal{D}}\left[L_{\mathcal{D}}(ERM_{\mathcal{H}}(S) ) - L_{\mathcal{D}}(h^*) \leq \frac{ \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S)}{\sigma}\right ] \geq 1 - \sigma $$ ### Lemma 26.4 * (McDiarmid's Inequality) * Let $V$ be some set and let $f: V^m \rightarrow \mathbb{R}$ be a function ò $m$ variables such that for some $c > 0$, for all $i \in \left[m\right]$ and for all $x_1,..x_m\in V$ we have $$\left | f(x_1,...,x_m) - f(x_1,...,x_{i - 1}, x^{'}_{i},x_{i+1},...,x_m) \right | \leq c $$ * Let $X_1,...,X_m$ be $m$ independent random variables taking values in $V$. Then $$ P_{X_i}\left [ \left | f(x_1,...,x_m) - \mathbb{E}\left [f(X_1,...,X_{m}) \right] \right | \leq c\sqrt{ln(\frac{2}{\sigma})\frac{m}{2}} \right] \geq 1 - \sigma $$ **For 26.4 we can prove some result below** ### Theorem 26.5: * Assume that for all $z$ and $h \in \mathcal{H}$ we have that $\left | l(h,z) \right | \leq c$. Then, *with probability of at least $1 - \sigma$, for all $h \in \mathcal{H}$* 1. $$ L_{\mathcal{D}}(h) - L_S(h) \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S^{'}) + c\sqrt{\frac{2ln(\frac{2}{\sigma})}{m}} $$ 2. $$ L_{\mathcal{D}}(h) - L_S(h) \leq 2 R(\mathcal{l} \circ \mathcal{H} \circ S^) + 4c\sqrt{\frac{2ln(\frac{4}{\sigma})}{m}} $$ 3. $$ L_{\mathcal{D}}(h) - L_S(h) \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m}R(\mathcal{l} \circ \mathcal{H} \circ S^{'}) + c\sqrt{\frac{2ln(\frac{2}{\sigma})}{m}} $$ --- ## Rademancher Calculus ### Theorem 26.6: * For any $A \subset \mathbb{R}^n$ and $A^{'} = \left\{ ca + a_0 : a \in A \right\}$ with $a_{0} \in \mathbb{R}^n$ and $c$ is scalar then $$ R(A^{'}) \leq \left|c\right|R(A) $$ **Following lemma states that the convex hull of $A$ is have a same Rademancher Complexity as $A$** ### Theorem 26.7: * Let $A$ be a subset of $\mathbb{R}^n$ and let $A^{'} = \left\{\sum_{j = 1}^{N}\alpha_{i}a^{(j)} : N \in \mathbb{N}, \forall j,a^{(j)}\in A,\alpha_j \geq 0, \left |\left |a \right | \right |_{1} = 1\right\}$. Then $R(A) = R(A^{'})$ ### Lemma 26.8: * (Massart Lemma) The Rademancher Complexity of of a finite set is growth logarithmically with the size of the set. * Formally, let $A = \left\{a_1,...,a_N\right\}$ be a finite set of vectors in $\mathbb{R}^m$. Define $\overline{a} = \frac{1}{N}\sum_{i = 1}^{n}a_i$. Then, $$ R(A) = \mathbb{E}_{\sigma}\left[max_{a \in A}(\left||a - \overline{a}\right||) \right] \frac{\sqrt{2log(N)}}{m} $$ ### Lemma 26.9 * (Contraction Lemma) For each $i \in [m]$, let $\phi_{i}: \mathbb{R} \rightarrow \mathbb{R}$ be $\rho-Lipschitz$ function. For all $a \in \mathbb{R}^n$ let $\phi(a)$ denote the vector $(\phi_{1}(a_{1}),...,\phi_{m}(a_m))$. Let $\phi\circ A = \left \{\phi(a):a\in A \right \}$. Then, $$ R(\phi \circ A) \leq \rho R(A) $$ Rademancher Complexity for Linear Classes --- Define 2 following classes: $\mathcal{H}_1 = \left\{x \mapsto \left\langle w,x \right\rangle : ||w||_1 \leq 1 \right\}$ $\mathcal{H}_2 = \left\{x \mapsto \left\langle w,x \right\rangle : ||w||_2 \leq 1 \right\}$ ### Lemma 26.10 * (Bounding RC for $\mathcal{H}_2$) Let $S = \left\{x_1,...,x_m\right\}$ be a vector in **Hilbert space** $\mathcal{H}_2 \circ S = \left\{\left\langle w,x_1 \right\rangle,...,\left\langle w,x_m \right\rangle : ||w||_2 \leq 1 \right\}$. Then, $$ R(\mathcal{H}_2 \circ S) \leq \frac{max_{i}||x_i||_2}{\sqrt{m}} $$ ### Lemma 26.11 * (Bounding RC for $\mathcal{H}_1$) Let $S = \left\{x_1,...,x_m\right\}$ be a vector in $\mathbb{R}^n$ $\mathcal{H}_1 \circ S = \left\{\left\langle w,x_1 \right\rangle,...,\left\langle w,x_m \right\rangle : ||w||_2 \leq 1 \right\}$. Then, $$ R(\mathcal{H}_2 \circ S) \leq max_{i}||x_i||_{\infty}\sqrt{\frac{2log(2n)}{m}} $$ Generalization Bounds for SVM --- * Let $\mathcal{H} = \left\{w, ||w|| \leq b\right\}$ ### Theorem 26.12 * Suppose $\mathcal{D}$ be a distribution over $Z=$ $\mathcal{X} \times \mathcal{Y}$ such that with probability $1$ we have that $||x||_2 \leq R$. Let $\mathcal{l}:\mathcal{H}\times Z \rightarrow \mathbb{R}$ be a loss function of the form $l(w,(x,y))= \phi(\langle w,x \rangle,y)$ where $\phi:\mathbb{R} \times \mathcal{Y}$ is a $\rho$-Lipschitz function, and such that for all $y \in \mathcal{Y}$ we have $max_{a \in \left[-BR,BR \right]}|\phi(\langle w,x \rangle,y)| \leq c$. Then for any $\delta \in (0,1)$, with the probability at least $1 - \delta$ over choice i.i.d sample of size m, $$ \forall{w} \in \mathcal{H}, L_{\mathcal{D}}(w) - L_{S}(w) \leq \frac{2\rho BR}{\sqrt{m}} + c\sqrt{\frac{2ln(\frac{2}{\delta})}{m}} $$ ### Theorem 26.13 * Consider $\mathcal{D}$ is a distribution over $Z=$ $\mathcal{X} \times \left\{\pm 1\right\}$ such that with probability $1$ there is exists $w^{*}$ with $\mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y\langle w^{*},x\rangle \geq 1]$ such that that $||x||_2 \leq R$. Let $w_S$ be the output of hard-SVM.Then for any $\delta \in (0,1)$, with the probability at least $1 - \delta$ over choice i.i.d sample of size m, $$ \mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y \neq sign(\langle w_{s},x\rangle) ] \leq \frac{2R||w^{*}||}{\sqrt{m}} + (1 + R||w^*||)\sqrt{\frac{2ln(\frac{2}{\delta})}{m}} $$ ### Theorem 26.14 Assume that the conditions of **Theorem 26.13** hold. Then, with probability of at least $1-\delta$ over the choice $S \thicksim \mathcal{D}^m$, we have $$ \mathbb{P}_{(x,y) \thicksim \mathcal{D}}[y \neq sign(\langle w_{s},x\rangle) ] \leq \frac{4R||w_S||}{\sqrt{m}} + \sqrt{\frac{ln(\frac{4log_{2}(||w_S||)}{\delta})}{m}} $$