Statistical Learning Theory(SLT) is a field that formalize the framework of machine learning to give a concrete proof of its possibility. The questions that Statistical learning aims to deal with are pretty fundamental:
To answer those questions, we need to build on a certain mathematical framework. Let's start from supervised learning.
Let
Assume we have a predictor
We shall introduce the loss function. Let
Usually, 0/1 loss is used on classification problem, whereas MSE loss use on regression.
We must make some probability assumption to how
The performance of the predictor
Learning is not possible without assumptions.
Usually, we would not find the optimal solution among all measurable functions, since this could potentially become an infinite dimensional optimization problem in general.
So, is there an universal learning algorithm that can actually deal with any kind of problems? The following theorem gives an answer "no":
Theorem (No Free Lunch, Wolpert et al., 1996)
Without prior (hypothesis), the testing set errors of any two algorithms are equal over all possible senarios.
In other words, there always exists a bad distribution
So, does that mean we are doomed to have no information outside the training set
We can restricted on a function space
A simple way to facilitate the procedure is to parametrize the function by some variables, for example, consider the class of function described by parameter
and the objective is to minimize the empirical risk
This is the main approach in statistical learning theory, since it does not put restrictive assumptions on the data. Such a hypothesis space can be viewed as reflecting some prior knowledge that the learner has about the task.
A Learning Algorithm under the inductive bias
In general, the hidden distribution
For any predictor
The approximation error is deterministic and is caused by the restriction of using
The estimation error is caused by the usage of a finite sample
When the distribution
And the risk of the Bayes hypothesis is called the Bayes risk:
For the loss function introduced above, there are already some analytical solution to the optimization problem, for example:
Let's look at some examples of Bayes hypothesis.
Example: Binary Classification
For binary classification with
[Proof]
The risk of a predictor
where we rewrite
In particular, one can show that
and vice versa for
Thus
Example: Regression
For regression with
The proof will be left as an exercise.
Even when the inductive bias is imposed, the distribution of the data
However, we can replace the risk by the empirical risk, which is measurable:
Thus, it become quite reasonable to minimize the empirical risk. This learning paradigm is known as the Empirical Risk Minimization(ERM):
Although the ERM approach seems very natural, it may cause overfitting.
For example, ler's consider an extreme case:
It's clear that this predictor does NOT learn anything! So when using the ERM method, one should be very careful about overfitting.
A variant of ERM is SRM, which can solve the overfitting issue.
Take a sequence of increasing hypothesis spaces
Given the dataset
This is also called the method of Sieves.
While SRM benefits from strong theoretical guarantees, it is typically computationally very expensive, since it requires determining the solution of multiple ERM problems.
machine learning