# 1-1: Learning Framework for Supervised Learning
**Statistical Learning Theory(SLT)** is a field that formalize the framework of machine learning to give a concrete proof of its possibility. The questions that Statistical learning aims to deal with are pretty fundamental:
* Which learning tasks can be performed by computers in general (positive and negative results)?
* What kind of assumptions do we have to make such that machine learning can be successful?
* What are the key properties a learning algorithm needs to satisfy in order to be successful?
* Which performance guarantees can we give on the results of certain learning algorithms?
To answer those questions, we need to build on a certain mathematical framework. Let's start from supervised learning.
## Setup
Let $\mathcal{X}$ and $\mathcal{Y}$ denote the **input space** and **output space** respectively. A data set $\mathcal{D}^n := \{ (x_1, y_1), ..., (x_n, y_n)\}$ is collected where each pair $(x_i, y_i)$ has $x_i \in \mathcal{X}$, $y_i \in \mathcal{Y}$.
Assume we have a predictor $h: \mathcal{X} \rightarrow \mathcal{Y}$,how do we quantify its performance on predicting data?
We shall introduce the **loss function**. Let $\hat{y} = h(x)$ be the predicted output, and $y$ be the real output (correspond to $x$), then the loss function can be defined as
$$l(\hat{y},y) =
\begin{cases}
\mathbb{I}\{\hat{y} \neq y\},&\text{(0/1 loss)}\\
(y-\hat{y})^2,&\text{(MSE loss)}
\end{cases}$$
Usually, 0/1 loss is used on classification problem, whereas MSE loss use on regression.
We must make some probability assumption to how $x$ and $y$ is related. Assume that each pair from the data set $(x_i, y_i)$ are i.i.d. samples drawn from the joint relaion $P$ of two random variables (RV) $X$、$Y$.
The performance of the predictor $h$ is then the expected loss, or the **risk (generalization error)**:
$$R_P(h) := \mathbb{E}_{(X,Y)\sim P}[l(h(X),Y)]$$
## Learning Algorithm
### No Free Lunch (NFL) Theorem
> Learning is not possible without assumptions.
Usually, we would not find the optimal solution among all measurable functions, since this could potentially become an infinite dimensional optimization problem in general.
So, is there an *universal* learning algorithm that can actually deal with any kind of problems? The following theorem gives an answer "no":
:::success
**Theorem (No Free Lunch, Wolpert et al., 1996)**
Without prior (hypothesis), the testing set errors of any two algorithms are equal over all possible senarios.
:::
In other words, there always exists a bad distribution $P$ such that the generalization error can be arbitarily bad.
So, does that mean we are doomed to have no information outside the training set $\mathcal{D}^n$?
### Inductive Bias
We can restricted on a function space $\mathcal{H}$ (called the **Hypothesis Space** formally, or simply *the model*) and find the best hypothesis $h$ in that space. Such restrictions are often called an *inductive bias*.
A simple way to facilitate the procedure is to parametrize the function by some variables, for example, consider the class of function described by parameter $\textbf{w}$:
$$\mathcal{H}_\textbf{w} := \{h_\textbf{w} \mid \textbf{w}\in W\subseteq \mathbb{R}^n \}$$
and the objective is to minimize the empirical risk $\hat{R_n}(h)$ of the parameter $\textbf{w}$.
This is the main approach in statistical learning theory, since it does not put restrictive assumptions on the data. Such a hypothesis space can be viewed as reflecting some *prior knowledge* that the learner has about the task.
A **Learning Algorithm** under the inductive bias $\mathcal{A}: \mathcal{D}^n \rightarrow \mathcal{H}$ finds an estimator $h \in \mathcal{H}$ in the hypothesis space by the given data set $\mathcal{D}^n := \{ (x_1, y_1), ..., (x_n, y_n)\}$. We can use a simple picture to illustrate the relationship of the components we've mentioned above:

### Excess risk decomposition
In general, the hidden distribution $P$ in ML problem is unknown, thus one would use the best predictor in $\mathcal{H}$ as the baseline of measuring the quality of the algorithm $\mathcal{A}$:
$$h^*_\mathcal{H} := \underset{h\in\mathcal{H}}{\arg\min} R_P(h)$$
For any predictor $h$,we can define and decompose the **excess risk** as follows:
$$\mathcal{ER}_P(h):=R_P(h)-R_P(h^*)=\underbrace{[R_P(h)-R_P(h^*_\mathcal{H})]}_{\text{estimation error}}+\underbrace{[R_P(h^*_\mathcal{H})-R_P(h^*)]}_{\text{approximation error}}$$

The approximation error is deterministic and is caused by the restriction of using $\mathcal{H}$.
The estimation error is caused by the usage of a finite sample $n$ that cannot completely represent the underlying distribution, thus it can be discussed and optimized. (We'll cover about it later)
## Learning Algorithms: Examples
### Bayes hypothesis
When the distribution $P$ is known, one can find the best function $h^*$ of all measurable functions such that the risk can be minimized by some optimization method. Usually, $h^*$ is termed as the **Bayes hypothesis**:
$$h^* := \underset{h}{\arg\min} R_P(h)$$
And the risk of the Bayes hypothesis is called the **Bayes risk**:
$$R^*:=R_P(h^*)=\inf_h R_P(h)$$
For the loss function introduced above, there are already some analytical solution to the optimization problem, for example:
* classification problem with 0-1 loss: maximum a posteriori probability (MAP) rule
* regression problem with squared loss: minimum mean square error (MMSE) rule
Let's look at some examples of Bayes hypothesis.
:::info
**Example: Binary Classification**
For binary classification with $\mathcal{Y} = \{ \pm 1 \}$ and 0/1 loss, one can show that the following predictor is a Bayes hypothesis:
$$h^*(x) =
\begin{cases}
1,&Pr\{Y=1 \mid X=x\} \geq \frac{1}{2}\\
-1,&\text{otherwise}
\end{cases} $$
:::
**[Proof]**
The risk of a predictor $h$ can be written as
\begin{aligned}
R_P(h) &= \mathbb{E}_{(X,Y)\sim P}[\mathbb{I}\{h(X) \neq Y\}]\\
&= \mathbb{E}_X[\mathbb{E}_{Y\mid X}[\mathbb{I}\{h(X) \neq Y\}]]\\
&= \mathbb{E}_X[Pr\{Y=1 \mid X\}\cdot \mathbb{I}\{h(X) \neq 1\} + Pr\{Y=-1 \mid X\}\cdot \mathbb{I}\{h(X) \neq -1\}]\\
&=\mathbb{E}_X[\eta(X) \mathbb{I}\{h(X) =-1\} + (1-\eta(X) )\mathbb{I}\{h(X) =1\}]
\end{aligned}
where we rewrite $\eta(x) := Pr\{Y=1 \mid X=x\}$. The Bayes risk is thus
$$R_P^* = \mathbb{E}_X[\min(\eta(X), 1-\eta(X))]$$
In particular, one can show that $h^*(x)$ can acheive such risk. Since $$\eta(X)<\frac{1}{2} \implies \eta(X) = \min(\eta(X), 1-\eta(X))$$
and vice versa for $\eta(X) \geq \frac{1}{2}$, The risk of $h^*$ is
\begin{aligned}
R_P(h^*) &= \mathbb{E}_X[\eta(X) \mathbb{I}\{h^*(X) =-1\} + (1-\eta(X) )\mathbb{I}\{h^*(X) =1\}]\\
&=\mathbb{E}_X[\eta(X) \mathbb{I}\{\eta(X)<\frac{1}{2}\} + (1-\eta(X) )\mathbb{I}\{\eta(X)\geq \frac{1}{2}\}]\\
&= \mathbb{E}_X[ (\mathbb{I}\{\eta(X)<\frac{1}{2}\} + \mathbb{I}\{\eta(X)\geq \frac{1}{2}\}]) \min(\eta(X), 1-\eta(X))]\\
&= \mathbb{E}_X[\min(\eta(X), 1-\eta(X))]
\end{aligned}
Thus $h^*$ is a Bayes hypothesis for binary classification.
:::info
**Example: Regression**
For regression with $\mathcal{Y} = \mathbb{R}$ and MSE loss, one can show that the following predictor is a bayes hypothesis:
$$\hat{h}(x) = \mathbb{E}[Y \mid X=x]$$
:::
The proof will be left as an exercise. :smirk:
### Empirical Risk minimization(ERM)
Even when the inductive bias is imposed, the distribution of the data $P$ is unknown in general. Thus it's hard to find the best solution in the hypothesis space $\mathcal{H}$ with unknown $P$:
$$h^*_\mathcal{H} := \underset{h\in\mathcal{H}}{\arg\min} R_P(h)$$
However, we can replace the risk by the **empirical risk**, which is measurable:
$$ \hat{R_n}(h)= \frac{1}{n}\sum_{i=1}^nl(h(x_i),y_i)$$
Thus, it become quite reasonable to minimize the empirical risk. This learning paradigm is known as the **Empirical Risk Minimization(ERM)**:
$$\hat{h_n} := \underset{h\in\mathcal{H}}{\arg\min}\hat{R_n}(h)$$
Although the ERM approach seems very natural, it may cause overfitting.
For example, ler's consider an extreme case:
$$\hat{h}(x)=\begin{cases}
y_i, &\exists x=x_i, \text{where} (x_i, y_i) \in \mathcal{D}^n\\
\text{random},&\text{otherwise}
\end{cases}$$
It's clear that this predictor does NOT learn anything! So when using the ERM method, one should be very careful about overfitting.
### Structural Risk Minimization (SRM)
A variant of ERM is SRM, which can solve the overfitting issue.
Take a sequence of increasing hypothesis spaces $\mathcal{H}_1, \mathcal{H}_2, ...$ with $\mathcal{H}_i \subset \mathcal{H}_{i+1}$ and $\cup_{i=1}^{\infty}\mathcal{H}_i = \mathcal{H}$.
Given the dataset $\mathcal{D}^n$, we find a function that minimize the empirical risk in each model.
$$\hat{h_n} := \underset{h\in\mathcal{H}_i, i\in\mathbb{N}}{\arg\min}\hat{R_n}(h)$$
This is also called the *method of Sieves*.
While SRM benefits from strong theoretical guarantees, it is typically computationally very expensive, since it requires determining the solution of multiple ERM problems.
###### tags: `machine learning`