1-3: Growth Function and VC Dimension

# 1-3: Growth Function and VC Dimension ## Growth Function and VC dimension ### Definition For the example above, we can see that for an infinite dimensional class, it's hard to use $\mid\mathcal{H}\mid$ to bound the generalization gap. The idea is to introduce a new quantity that can measure the size of the hypothesis space -- by "grouping" the similar functions based on the given sample. :::warning The following content about VC theory is in the binary classification setting, i.e, $\mathcal{Y}=\{\pm 1 \}$ and 0/1 loss. ::: Let $\mathcal{D}^n:=\{(x_1,y_1), ..., (x_n, y_n)\}$ and $S=\{x_1, ..., x_n\}$. Also, let $$\mathcal{H}_S:=\{(h(x_1),...,h(x_n))\mid h\in \mathcal{H}\}$$ denote all possible realization that $S$ can be classified. (We assume that every $h\in \mathcal{H}$ has a finite range, i.e, $h(x_i) < \infty$) :::success **Definition (Growth function)** The *growth function* of a hypothesis set is the maximum number of the possible outcomes. $$G_{\mathcal{H}}(n):=\sup_{\mid S\mid=n}\mid \mathcal{H}_S \mid$$ ::: :::success **Definition (Shattering, VC dimension)** We say that $\mathcal{H}$ *shatters* $S$ if $\mid \mathcal{H}_S \mid = 2^{\mid S\mid}$. The *Vapnik–Chervonenkis (VC) dimension* of a hypothesis set is the largest cardinality of the set that $\mathcal{H}$ can shatter: $$VCDim(\mathcal{H}):=\underset{n}{\arg\max} \: G_{\mathcal{H}}(n)=2^n$$ If no such $n$ exists, then $VCDim(\mathcal{H})=\infty$. ::: Typical steps to show that $VCdim(F) = d$: 1 Show that no $(d + 1)$ points can be shattered. 2 Find $d$ points that can be shattered. ### Some examples Let's look at a few examples. :::info **Example (Threshold Classifiers)** Let $\mathcal{X}=\mathbb{R}$, $\mathcal{H}=\{h: x\to \text{sign}(x-a) \mid a\in \mathbb{R} \}$. Then, $$G_{\mathcal{H}}(n)=n+1$$ $$VCDim(\mathcal{H})=1$$ ::: **[Idea]** Think that there are $n$ points on the line. From the leftmost, move the threshold to the right. Since it will pass $n$ points, it generates $n+1$ different labelings. :::info **Example (Interval Classifiers)** Let $\mathcal{X}=\mathbb{R}$, $\mathcal{H}=\{h: x\to \text{sign}(a-x)(x-b) \mid a\leq b\in \mathbb{R} \}$. Then, $$G_{\mathcal{H}}(n)=1+n+(n-1)+...+1 =\frac{n^2+n+2}{2}$$ $$VCDim(\mathcal{H})=2$$ ::: **[Idea]** Think that the number of points that is positively labelled can differ from 0 to n. Each points that is labeled positive should be consecutive in the order. Thus there are $1, n,n-1,...,1$ possibility for each cases, leading to the result above. :::info **Example (Linear Classifiers in $\mathbb{R}^d$)** Let $\mathcal{X}=\mathbb{R}^d$, $\mathcal{H}=\{h: \textbf{x}\to \text{sign}(\textbf{w}^T\textbf{x}) \mid \textbf{w}\in \mathbb{R}^d \}$ Then, $$G_{\mathcal{H}}(n)=\frac{n^2+n+2}{2}$$ $$VCDim(\mathcal{H})=d$$ ::: Let $\textbf{I}_d$ be the identity matrix with size $d\times d$ with the unit vector $\textbf{e}_i, i=1,...,d$ defined as $$\textbf{I}_d = \begin{bmatrix} \vert & & \vert \\ \textbf{e}_1 &... & \textbf{e}_d \\ \vert & & \vert \end{bmatrix} $$ Then for any output labeling $\textbf{y}=(y_1,...,y_d)^T$, take $\textbf{w}=\textbf{y}$, then $d$ points $\{\textbf{e}_1, ..., \textbf{e}_d\}$ are shattered. Consider any $d+1$ points $\{\textbf{x}_1, ..., \textbf{x}_{d+1}\}$. Since the dimension of $\mathbb{R}^d$ is $d$, then these vectors are linearly dependent, i.e, $$\exists \textbf{a} = (a_1,...,a_{d+1})^T \neq\textbf{0}\text{ s.t. } \sum_{i=1}^{d+1}a_i\textbf{x}_i=\textbf{0}$$ Denote the set of indices $I=\{i \mid a_i>0\}$ and $J=\{j\mid a_j<0\}$. Then $I$ and $J$ are disjoint. We then have $$\sum_{i\in I}a_i\textbf{x}_i=\sum_{j\in J}\mid a_j \mid\textbf{x}_j$$ If these $d+1$ points can be shattered, then there exists some $\textbf{w}\in \mathbb{R}^{d+1}$ such that \begin{align} \textbf{w}\cdot\textbf{x}_i&>0, \forall i \in I\\ \textbf{w}\cdot\textbf{x}_j&<0, \forall j \in J\\ \end{align} Without loss of generality, we can assume that both $I$ and $J$ are non-empty (this just require some scalar multiplication on $\textbf{x}_i$). Then this yields $$0<\sum_{i\in I}a_i\textbf{w}\cdot\textbf{x}_i=\textbf{w}\cdot\left(\sum_{i\in I}a_i\textbf{x}_i\right)=\textbf{w}\cdot\left(\sum_{j\in J}\mid a_j \mid\textbf{x}_j\right)=\sum_{j\in J}\mid a_j \mid\textbf{w}\cdot\textbf{x}_j<0$$ a contradiction. Finally note that if either $I$ or $J$ is empty, then one of the inequalities above is an equality, but the contradiction still follows. :::warning If $\mathcal{H}=\{h: \textbf{x}\to \text{sign}(\textbf{w}^T\textbf{x}+w_0) \mid \textbf{w}\in \mathbb{R}^d, w_0\in\mathbb{R} \}$, then it becomes an *affine classifier*, which can be viewed as adding a basis $1$. Let the augmented vector $$\textbf{z}:=[1, \textbf{x}]^T\in\mathbb{R}^{d+1}$$ And the parameters redefined as $$\textbf{u}:=[1,\textbf{w}]^T\in\mathbb{R}^{d+1}$$ Then the classifier can be rewritten as $\mathcal{H}=\{h: \textbf{z}\to \text{sign}(\textbf{u}^T\textbf{z}) \mid \textbf{u}\in \mathbb{R}^{d+1} \}$, thus $$VCDim(\mathcal{H})=d+1$$ ::: We'll leave an exercise here. :::info **Example (Axis aligned rectangles)** Show that the axis aligned rectangles in $\mathbb{R}^2$ has $VCDim(\mathcal{H})=4$. ::: ### Properties of VC dimension ::: success **Lemma** For a finite class of hypothesis, $G_{\mathcal{H}}(n)\leq\mid\mathcal{H}\mid$ and $VCDim(\mathcal{H})\leq\log_2{\mid\mathcal{H}\mid}$. ::: **[proof]** The first statement is trivial since there are only $\mathcal{H}$ functions, then there are at most $\mathcal{H}$ possible labelings. For the second statement, notice that $$VCDim(\mathcal{H})=\underset{n}{\arg\max } G_{\mathcal{H}}(n)=2^n\leq \underset{n}{\arg\max } \mid \mathcal{H} \mid=2^n$$ Therefore $VCDim(\mathcal{H})\leq\log_2{\mid\mathcal{H}\mid}$. :::success **Lemma (Sauer)** If $\mathcal{H}$ is a class of functions with binary outputs and its VC dimension is $VCDim(\mathcal{H})=d$. Then for all $n\in\mathbb{N}$, $$G_{\mathcal{H}}(n)\leq\sum_{i=0}^d {n\choose i}=\Phi_d(n)=O(n^d)$$ Furthermore, for $n\geq d$, $$G_{\mathcal{H}}(n)\leq\left(\frac{ne}{d}\right)^d$$ ::: For $n<d$, we'll give a proof by the induction on $n+d$. **Base Case:** For $d=0$, there is only 1 bahavior of functions in $\mathcal{H}$, i.e, $G_{\mathcal{H}}(n)=\Phi_0(n)=1$. For $n=0$, labeling an empty set has one possibility, i.e, $G_{\mathcal{H}}(0)=\Phi_d(0)=1$. **Induction Step:** Assume that the theorem holds for all $n'+d'<n+d$. Given $S=\{x_1, ..., x_n\}$, and without loss of generality, we can assume that all functions in $\mathcal{H}$ labels the points differently (that is, $\forall h, h' \in \mathcal{H}, (h(S)!=h'(S)$). We define $S'=\{x_1, ..., x_{n-1}\}$ and two classes of functions: \begin{align} \mathcal{H}_1&:=\{h\in\mathcal{H} \mid \forall h, h' \in \mathcal{H}_1, h(S')!=h'(S')\}\\ \mathcal{H}_2&:=\{h \in \mathcal{H} \mid \exists h'\in\mathcal{H}_1 \text{ s.t. } h(x_i)=h'(x_i) \forall i<n, h(x_n)\neq h'(x_n)\} \end{align} Note that $\mathcal{H}_1$ and $\mathcal{H}_2$ forms a *partition* of $\mathcal{H}$. $\mathcal{H}_2$ contains the function that, has the exact same output in the first $(n-1)$ inputs but only differs in the last one for some function in $\mathcal{H}_1$. See the following for a clear example: ![](https://i.imgur.com/mpgByST.jpg) Since $\mathcal{H}_1 \subseteq \mathcal{H}$ we have $VCdim(\mathcal{H}_1) \leq VCdim(\mathcal{H}) = d \implies G_{\mathcal{H}_1}(n-1)\leq \Phi_d(n-1)$. For every hypothesis in $\mathcal{H}_2$, there is always 2 hypothesis in $\mathcal{H}$ to classify $x_n$. Thus we have $VCdim(\mathcal{H}_2) \leq d -1 \implies G_{\mathcal{H}_2}(n-1)\leq \Phi_{d-1}(n-1)$. Combining the two inequalities yields \begin{align} G_{\mathcal{H}}(n)&= G_{\mathcal{H}_1}(n-1)+G_{\mathcal{H}_2}(n-1)\\ &\leq \Phi_d(n-1)+\Phi_{d-1}(n-1)=\Phi_d(n) \end{align} The last equality utilizes the recurrence relation and left as an exercise: $${n\choose k}={n-1\choose k}+{n-1\choose k-1}$$ For $n\geq d$, we have \begin{align} \sum_{i=0}^d {n\choose i}&\leq\left(\frac{n}{d}\right)^d\sum_{i=0}^d {n\choose i}\left(\frac{d}{n}\right)^i\\ &=\left(\frac{n}{d}\right)^d\left(1+\frac{d}{n}\right)^n\leq\left(\frac{ne}{d}\right)^d \end{align} ## Generalization Bound for Infinite Function Class Now we are ready to bound the generalization gap with growth function $G_{\mathcal{H}}$ and VC dimension $VCDim(\mathcal{H})$. ### Symmetrization We'll just make use of the following fact, and will not proof it here: (A similar one will appear in the next unit again) :::success **Lemma (Symmetrization)** For any $\epsilon>\sqrt{\frac{2}{n}}$, $$\Pr\{\underset{h\in\mathcal{H}}{\sup}\mid R_P(h) - \hat{R_n}(h)\mid \geq \epsilon \} \leq 2\Pr \{\underset{h\in\mathcal{H}}{\sup}\mid\hat{R_n'}(h) - \hat{R_n}(h)\mid \geq \frac{\epsilon}{2} \}$$ where $\hat{R_n'}(h)$ denote the empirical risk on the *ghost dataset* $\mathcal{D}^{n'}$ that has same size $n$, but is independent to the original dataset $\mathcal{D}^n$. ::: The symmetrization lemma replaces the true risk by an average over the ghost sample. As a result, the right hand side is only dependent to the samples $\mathcal{D}^n$ and $\mathcal{D}^{n'}$. Note that $\mathcal{D}^{n'}$ has no practical implications, we don’t need to have another dataset at training, it’s just a mathematical trick we’re gonna use. Now we are ready to give a generalization bound for infinite function class. :::success **Theorem (Vapnik-Chervonenkis)** With probability at least $1-\delta$, $$\forall h\in\mathcal{H}, | R_P(h)-\hat{R_n}(h) | \leq 2\sqrt{\frac{2}{n} \left(\ln G_{\mathcal{H}}(2n)+ \ln\frac{4}{\delta} \right)}$$ ::: **[proof]** By symmetrization, since the function $h\in\mathcal{H}$ now depends on two data set $\mathcal{D}^n$ and $\mathcal{D}^{n'}$, we can reduce to only considering the function that distinguishes $2n$ points. So using the union bound, \begin{align} \Pr\{\underset{h\in\mathcal{H}}{\sup}| R_P(h) - \hat{R_n}(h)| \geq \epsilon \} &\leq 2 \Pr \{\underset{h\in\mathcal{H}}{\sup}|\hat{R_n'}(h) - \hat{R_n}(h)| \geq \frac{\epsilon}{2} \}\\ &\leq 2G_{\mathcal{H}}(2n)\Pr \{|\hat{R_n'}(h) - \hat{R_n}(h)| \geq \frac{\epsilon}{2} \}\\ &\leq 4G_{\mathcal{H}}(2n)\exp{\left(-\frac{n\epsilon^2}{8}\right)} \end{align} The last inequality utilizes the Hoeffding's inequality. Setting $\delta = 4G_{\mathcal{H}}(2n)\exp{\left(-\frac{n\epsilon^2}{8}\right)}$ can get the desired result above. ### Connecting to the VC dimension Recall Sauer's lemma, we can then get the following corollary immediately: :::success **Corollary (Generalization Bound with VC dimension)** Let the hypothesis space $\mathcal{H}$ has $VCDim(\mathcal{H}) = d$. Then, with probability at least $1-\delta$, $$\forall h\in\mathcal{H}, | R_P(h)-\hat{R_n}(h) | \leq 2\sqrt{\frac{2}{n} \left(d\ln \frac{2n}{d}+ \ln\frac{4}{\delta} \right)}$$ ::: Note that in order for the result to be meaningful, $d$ must be finite. ###### tags: `machine learning`