# 1-3: Growth Function and VC Dimension
## Growth Function and VC dimension
### Definition
For the example above, we can see that for an infinite dimensional class, it's hard to use $\mid\mathcal{H}\mid$ to bound the generalization gap. The idea is to introduce a new quantity that can measure the size of the hypothesis space -- by "grouping" the similar functions based on the given sample.
:::warning
The following content about VC theory is in the binary classification setting, i.e, $\mathcal{Y}=\{\pm 1 \}$ and 0/1 loss.
:::
Let $\mathcal{D}^n:=\{(x_1,y_1), ..., (x_n, y_n)\}$ and $S=\{x_1, ..., x_n\}$. Also, let
$$\mathcal{H}_S:=\{(h(x_1),...,h(x_n))\mid h\in \mathcal{H}\}$$
denote all possible realization that $S$ can be classified. (We assume that every $h\in \mathcal{H}$ has a finite range, i.e, $h(x_i) < \infty$)
:::success
**Definition (Growth function)**
The *growth function* of a hypothesis set is the maximum number of the possible outcomes.
$$G_{\mathcal{H}}(n):=\sup_{\mid S\mid=n}\mid \mathcal{H}_S \mid$$
:::
:::success
**Definition (Shattering, VC dimension)**
We say that $\mathcal{H}$ *shatters* $S$ if $\mid \mathcal{H}_S \mid = 2^{\mid S\mid}$.
The *Vapnik–Chervonenkis (VC) dimension* of a hypothesis set is the largest cardinality of the set that $\mathcal{H}$ can shatter:
$$VCDim(\mathcal{H}):=\underset{n}{\arg\max} \: G_{\mathcal{H}}(n)=2^n$$
If no such $n$ exists, then $VCDim(\mathcal{H})=\infty$.
:::
Typical steps to show that $VCdim(F) = d$:
1 Show that no $(d + 1)$ points can be shattered.
2 Find $d$ points that can be shattered.
### Some examples
Let's look at a few examples.
:::info
**Example (Threshold Classifiers)**
Let $\mathcal{X}=\mathbb{R}$, $\mathcal{H}=\{h: x\to \text{sign}(x-a) \mid a\in \mathbb{R} \}$. Then,
$$G_{\mathcal{H}}(n)=n+1$$
$$VCDim(\mathcal{H})=1$$
:::
**[Idea]**
Think that there are $n$ points on the line. From the leftmost, move the threshold to the right. Since it will pass $n$ points, it generates $n+1$ different labelings.
:::info
**Example (Interval Classifiers)**
Let $\mathcal{X}=\mathbb{R}$, $\mathcal{H}=\{h: x\to \text{sign}(a-x)(x-b) \mid a\leq b\in \mathbb{R} \}$. Then,
$$G_{\mathcal{H}}(n)=1+n+(n-1)+...+1 =\frac{n^2+n+2}{2}$$
$$VCDim(\mathcal{H})=2$$
:::
**[Idea]**
Think that the number of points that is positively labelled can differ from 0 to n. Each points that is labeled positive should be consecutive in the order. Thus there are $1, n,n-1,...,1$ possibility for each cases, leading to the result above.
:::info
**Example (Linear Classifiers in $\mathbb{R}^d$)**
Let $\mathcal{X}=\mathbb{R}^d$, $\mathcal{H}=\{h: \textbf{x}\to \text{sign}(\textbf{w}^T\textbf{x}) \mid \textbf{w}\in \mathbb{R}^d \}$ Then,
$$G_{\mathcal{H}}(n)=\frac{n^2+n+2}{2}$$
$$VCDim(\mathcal{H})=d$$
:::
Let $\textbf{I}_d$ be the identity matrix with size $d\times d$ with the unit vector $\textbf{e}_i, i=1,...,d$ defined as
$$\textbf{I}_d =
\begin{bmatrix}
\vert & & \vert \\
\textbf{e}_1 &... & \textbf{e}_d \\
\vert & & \vert
\end{bmatrix}
$$
Then for any output labeling $\textbf{y}=(y_1,...,y_d)^T$, take $\textbf{w}=\textbf{y}$, then $d$ points $\{\textbf{e}_1, ..., \textbf{e}_d\}$ are shattered.
Consider any $d+1$ points $\{\textbf{x}_1, ..., \textbf{x}_{d+1}\}$. Since the dimension of $\mathbb{R}^d$ is $d$, then these vectors are linearly dependent, i.e,
$$\exists \textbf{a} = (a_1,...,a_{d+1})^T \neq\textbf{0}\text{ s.t. } \sum_{i=1}^{d+1}a_i\textbf{x}_i=\textbf{0}$$
Denote the set of indices $I=\{i \mid a_i>0\}$ and $J=\{j\mid a_j<0\}$. Then $I$ and $J$ are disjoint. We then have
$$\sum_{i\in I}a_i\textbf{x}_i=\sum_{j\in J}\mid a_j \mid\textbf{x}_j$$
If these $d+1$ points can be shattered, then there exists some $\textbf{w}\in \mathbb{R}^{d+1}$ such that
\begin{align}
\textbf{w}\cdot\textbf{x}_i&>0, \forall i \in I\\
\textbf{w}\cdot\textbf{x}_j&<0, \forall j \in J\\
\end{align}
Without loss of generality, we can assume that both $I$ and $J$ are non-empty (this just require some scalar multiplication on $\textbf{x}_i$). Then this yields
$$0<\sum_{i\in I}a_i\textbf{w}\cdot\textbf{x}_i=\textbf{w}\cdot\left(\sum_{i\in I}a_i\textbf{x}_i\right)=\textbf{w}\cdot\left(\sum_{j\in J}\mid a_j \mid\textbf{x}_j\right)=\sum_{j\in J}\mid a_j \mid\textbf{w}\cdot\textbf{x}_j<0$$
a contradiction. Finally note that if either $I$ or $J$ is empty, then one of the inequalities above is an equality, but the contradiction still follows.
:::warning
If $\mathcal{H}=\{h: \textbf{x}\to \text{sign}(\textbf{w}^T\textbf{x}+w_0) \mid \textbf{w}\in \mathbb{R}^d, w_0\in\mathbb{R} \}$, then it becomes an *affine classifier*, which can be viewed as adding a basis $1$. Let the augmented vector
$$\textbf{z}:=[1, \textbf{x}]^T\in\mathbb{R}^{d+1}$$
And the parameters redefined as
$$\textbf{u}:=[1,\textbf{w}]^T\in\mathbb{R}^{d+1}$$
Then the classifier can be rewritten as $\mathcal{H}=\{h: \textbf{z}\to \text{sign}(\textbf{u}^T\textbf{z}) \mid \textbf{u}\in \mathbb{R}^{d+1} \}$, thus
$$VCDim(\mathcal{H})=d+1$$
:::
We'll leave an exercise here.
:::info
**Example (Axis aligned rectangles)**
Show that the axis aligned rectangles in $\mathbb{R}^2$ has $VCDim(\mathcal{H})=4$.
:::
### Properties of VC dimension
::: success
**Lemma**
For a finite class of hypothesis, $G_{\mathcal{H}}(n)\leq\mid\mathcal{H}\mid$ and $VCDim(\mathcal{H})\leq\log_2{\mid\mathcal{H}\mid}$.
:::
**[proof]**
The first statement is trivial since there are only $\mathcal{H}$ functions, then there are at most $\mathcal{H}$ possible labelings.
For the second statement, notice that
$$VCDim(\mathcal{H})=\underset{n}{\arg\max } G_{\mathcal{H}}(n)=2^n\leq \underset{n}{\arg\max } \mid \mathcal{H} \mid=2^n$$
Therefore $VCDim(\mathcal{H})\leq\log_2{\mid\mathcal{H}\mid}$.
:::success
**Lemma (Sauer)**
If $\mathcal{H}$ is a class of functions with binary outputs and its VC dimension is $VCDim(\mathcal{H})=d$. Then for all $n\in\mathbb{N}$,
$$G_{\mathcal{H}}(n)\leq\sum_{i=0}^d {n\choose i}=\Phi_d(n)=O(n^d)$$
Furthermore, for $n\geq d$,
$$G_{\mathcal{H}}(n)\leq\left(\frac{ne}{d}\right)^d$$
:::
For $n<d$, we'll give a proof by the induction on $n+d$.
**Base Case:**
For $d=0$, there is only 1 bahavior of functions in $\mathcal{H}$, i.e, $G_{\mathcal{H}}(n)=\Phi_0(n)=1$.
For $n=0$, labeling an empty set has one possibility, i.e, $G_{\mathcal{H}}(0)=\Phi_d(0)=1$.
**Induction Step:**
Assume that the theorem holds for all $n'+d'<n+d$.
Given $S=\{x_1, ..., x_n\}$, and without loss of generality, we can assume that all functions in $\mathcal{H}$ labels the points differently (that is, $\forall h, h' \in \mathcal{H}, (h(S)!=h'(S)$).
We define $S'=\{x_1, ..., x_{n-1}\}$ and two classes of functions:
\begin{align}
\mathcal{H}_1&:=\{h\in\mathcal{H} \mid \forall h, h' \in \mathcal{H}_1, h(S')!=h'(S')\}\\
\mathcal{H}_2&:=\{h \in \mathcal{H} \mid \exists h'\in\mathcal{H}_1 \text{ s.t. } h(x_i)=h'(x_i) \forall i<n, h(x_n)\neq h'(x_n)\}
\end{align}
Note that $\mathcal{H}_1$ and $\mathcal{H}_2$ forms a *partition* of $\mathcal{H}$. $\mathcal{H}_2$ contains the function that, has the exact same output in the first $(n-1)$ inputs but only differs in the last one for some function in $\mathcal{H}_1$. See the following for a clear example:

Since $\mathcal{H}_1 \subseteq \mathcal{H}$ we have $VCdim(\mathcal{H}_1) \leq VCdim(\mathcal{H}) = d \implies G_{\mathcal{H}_1}(n-1)\leq \Phi_d(n-1)$.
For every hypothesis in $\mathcal{H}_2$, there is always 2 hypothesis in $\mathcal{H}$ to classify $x_n$. Thus we have $VCdim(\mathcal{H}_2) \leq d -1 \implies G_{\mathcal{H}_2}(n-1)\leq \Phi_{d-1}(n-1)$.
Combining the two inequalities yields
\begin{align}
G_{\mathcal{H}}(n)&= G_{\mathcal{H}_1}(n-1)+G_{\mathcal{H}_2}(n-1)\\
&\leq \Phi_d(n-1)+\Phi_{d-1}(n-1)=\Phi_d(n)
\end{align}
The last equality utilizes the recurrence relation and left as an exercise:
$${n\choose k}={n-1\choose k}+{n-1\choose k-1}$$
For $n\geq d$, we have
\begin{align}
\sum_{i=0}^d {n\choose i}&\leq\left(\frac{n}{d}\right)^d\sum_{i=0}^d {n\choose i}\left(\frac{d}{n}\right)^i\\
&=\left(\frac{n}{d}\right)^d\left(1+\frac{d}{n}\right)^n\leq\left(\frac{ne}{d}\right)^d
\end{align}
## Generalization Bound for Infinite Function Class
Now we are ready to bound the generalization gap with growth function $G_{\mathcal{H}}$ and VC dimension $VCDim(\mathcal{H})$.
### Symmetrization
We'll just make use of the following fact, and will not proof it here:
(A similar one will appear in the next unit again)
:::success
**Lemma (Symmetrization)**
For any $\epsilon>\sqrt{\frac{2}{n}}$,
$$\Pr\{\underset{h\in\mathcal{H}}{\sup}\mid R_P(h) - \hat{R_n}(h)\mid \geq \epsilon \} \leq 2\Pr \{\underset{h\in\mathcal{H}}{\sup}\mid\hat{R_n'}(h) - \hat{R_n}(h)\mid \geq \frac{\epsilon}{2} \}$$
where $\hat{R_n'}(h)$ denote the empirical risk on the *ghost dataset* $\mathcal{D}^{n'}$ that has same size $n$, but is independent to the original dataset $\mathcal{D}^n$.
:::
The symmetrization lemma replaces the true risk by an average over the ghost sample. As a result, the right hand side is only dependent to the samples $\mathcal{D}^n$ and $\mathcal{D}^{n'}$. Note that $\mathcal{D}^{n'}$ has no practical implications, we don’t need to have another dataset at training, it’s just a mathematical trick we’re gonna use.
Now we are ready to give a generalization bound for infinite function class.
:::success
**Theorem (Vapnik-Chervonenkis)**
With probability at least $1-\delta$,
$$\forall h\in\mathcal{H}, | R_P(h)-\hat{R_n}(h) | \leq 2\sqrt{\frac{2}{n} \left(\ln G_{\mathcal{H}}(2n)+ \ln\frac{4}{\delta} \right)}$$
:::
**[proof]**
By symmetrization, since the function $h\in\mathcal{H}$ now depends on two data set $\mathcal{D}^n$ and $\mathcal{D}^{n'}$, we can reduce to only considering the function that distinguishes $2n$ points. So using the union bound,
\begin{align}
\Pr\{\underset{h\in\mathcal{H}}{\sup}| R_P(h) - \hat{R_n}(h)| \geq \epsilon \} &\leq 2 \Pr \{\underset{h\in\mathcal{H}}{\sup}|\hat{R_n'}(h) - \hat{R_n}(h)| \geq \frac{\epsilon}{2} \}\\
&\leq 2G_{\mathcal{H}}(2n)\Pr \{|\hat{R_n'}(h) - \hat{R_n}(h)| \geq \frac{\epsilon}{2} \}\\
&\leq 4G_{\mathcal{H}}(2n)\exp{\left(-\frac{n\epsilon^2}{8}\right)}
\end{align}
The last inequality utilizes the Hoeffding's inequality.
Setting $\delta = 4G_{\mathcal{H}}(2n)\exp{\left(-\frac{n\epsilon^2}{8}\right)}$ can get the desired result above.
### Connecting to the VC dimension
Recall Sauer's lemma, we can then get the following corollary immediately:
:::success
**Corollary (Generalization Bound with VC dimension)**
Let the hypothesis space $\mathcal{H}$ has $VCDim(\mathcal{H}) = d$. Then, with probability at least $1-\delta$,
$$\forall h\in\mathcal{H}, | R_P(h)-\hat{R_n}(h) | \leq 2\sqrt{\frac{2}{n} \left(d\ln \frac{2n}{d}+ \ln\frac{4}{\delta} \right)}$$
:::
Note that in order for the result to be meaningful, $d$ must be finite.
###### tags: `machine learning`