Try   HackMD

1-1: Learning Framework for Supervised Learning

Statistical Learning Theory(SLT) is a field that formalize the framework of machine learning to give a concrete proof of its possibility. The questions that Statistical learning aims to deal with are pretty fundamental:

  • Which learning tasks can be performed by computers in general (positive and negative results)?
  • What kind of assumptions do we have to make such that machine learning can be successful?
  • What are the key properties a learning algorithm needs to satisfy in order to be successful?
  • Which performance guarantees can we give on the results of certain learning algorithms?

To answer those questions, we need to build on a certain mathematical framework. Let's start from supervised learning.

Setup

Let

X and
Y
denote the input space and output space respectively. A data set
Dn:={(x1,y1),...,(xn,yn)}
is collected where each pair
(xi,yi)
has
xiX
,
yiY
.

Assume we have a predictor

h:XY,how do we quantify its performance on predicting data?

We shall introduce the loss function. Let

y^=h(x) be the predicted output, and
y
be the real output (correspond to
x
), then the loss function can be defined as
l(y^,y)={I{y^y},(0/1 loss)(yy^)2,(MSE loss)

Usually, 0/1 loss is used on classification problem, whereas MSE loss use on regression.

We must make some probability assumption to how

x and
y
is related. Assume that each pair from the data set
(xi,yi)
are i.i.d. samples drawn from the joint relaion
P
of two random variables (RV)
X
Y
.

The performance of the predictor

h is then the expected loss, or the risk (generalization error):
RP(h):=E(X,Y)P[l(h(X),Y)]

Learning Algorithm

No Free Lunch (NFL) Theorem

Learning is not possible without assumptions.

Usually, we would not find the optimal solution among all measurable functions, since this could potentially become an infinite dimensional optimization problem in general.

So, is there an universal learning algorithm that can actually deal with any kind of problems? The following theorem gives an answer "no":

Theorem (No Free Lunch, Wolpert et al., 1996)
Without prior (hypothesis), the testing set errors of any two algorithms are equal over all possible senarios.

In other words, there always exists a bad distribution

P such that the generalization error can be arbitarily bad.

So, does that mean we are doomed to have no information outside the training set

Dn?

Inductive Bias

We can restricted on a function space

H (called the Hypothesis Space formally, or simply the model) and find the best hypothesis
h
in that space. Such restrictions are often called an inductive bias.

A simple way to facilitate the procedure is to parametrize the function by some variables, for example, consider the class of function described by parameter

w:
Hw:={hwwWRn}

and the objective is to minimize the empirical risk

Rn^(h) of the parameter
w
.

This is the main approach in statistical learning theory, since it does not put restrictive assumptions on the data. Such a hypothesis space can be viewed as reflecting some prior knowledge that the learner has about the task.

A Learning Algorithm under the inductive bias

A:DnH finds an estimator
hH
in the hypothesis space by the given data set
Dn:={(x1,y1),...,(xn,yn)}
. We can use a simple picture to illustrate the relationship of the components we've mentioned above:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Excess risk decomposition

In general, the hidden distribution

P in ML problem is unknown, thus one would use the best predictor in
H
as the baseline of measuring the quality of the algorithm
A
:
hH:=argminhHRP(h)

For any predictor

h,we can define and decompose the excess risk as follows:
ERP(h):=RP(h)RP(h)=[RP(h)RP(hH)]estimation error+[RP(hH)RP(h)]approximation error

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The approximation error is deterministic and is caused by the restriction of using

H.
The estimation error is caused by the usage of a finite sample
n
that cannot completely represent the underlying distribution, thus it can be discussed and optimized. (We'll cover about it later)

Learning Algorithms: Examples

Bayes hypothesis

When the distribution

P is known, one can find the best function
h
of all measurable functions such that the risk can be minimized by some optimization method. Usually,
h
is termed as the Bayes hypothesis:
h:=argminhRP(h)

And the risk of the Bayes hypothesis is called the Bayes risk:

R:=RP(h)=infhRP(h)

For the loss function introduced above, there are already some analytical solution to the optimization problem, for example:

  • classification problem with 0-1 loss: maximum a posteriori probability (MAP) rule
  • regression problem with squared loss: minimum mean square error (MMSE) rule

Let's look at some examples of Bayes hypothesis.

Example: Binary Classification
For binary classification with

Y={±1} and 0/1 loss, one can show that the following predictor is a Bayes hypothesis:
h(x)={1,Pr{Y=1X=x}121,otherwise

[Proof]
The risk of a predictor

h can be written as
RP(h)=E(X,Y)P[I{h(X)Y}]=EX[EYX[I{h(X)Y}]]=EX[Pr{Y=1X}I{h(X)1}+Pr{Y=1X}I{h(X)1}]=EX[η(X)I{h(X)=1}+(1η(X))I{h(X)=1}]

where we rewrite

η(x):=Pr{Y=1X=x}. The Bayes risk is thus
RP=EX[min(η(X),1η(X))]

In particular, one can show that

h(x) can acheive such risk. Since
η(X)<12η(X)=min(η(X),1η(X))

and vice versa for

η(X)12, The risk of
h
is
RP(h)=EX[η(X)I{h(X)=1}+(1η(X))I{h(X)=1}]=EX[η(X)I{η(X)<12}+(1η(X))I{η(X)12}]=EX[(I{η(X)<12}+I{η(X)12}])min(η(X),1η(X))]=EX[min(η(X),1η(X))]

Thus

h is a Bayes hypothesis for binary classification.

Example: Regression
For regression with

Y=R and MSE loss, one can show that the following predictor is a bayes hypothesis:
h^(x)=E[YX=x]

The proof will be left as an exercise.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Empirical Risk minimization(ERM)

Even when the inductive bias is imposed, the distribution of the data

P is unknown in general. Thus it's hard to find the best solution in the hypothesis space
H
with unknown
P
:
hH:=argminhHRP(h)

However, we can replace the risk by the empirical risk, which is measurable:

Rn^(h)=1ni=1nl(h(xi),yi)

Thus, it become quite reasonable to minimize the empirical risk. This learning paradigm is known as the Empirical Risk Minimization(ERM):

hn^:=argminhHRn^(h)

Although the ERM approach seems very natural, it may cause overfitting.
For example, ler's consider an extreme case:

h^(x)={yi,x=xi,where(xi,yi)Dnrandom,otherwise

It's clear that this predictor does NOT learn anything! So when using the ERM method, one should be very careful about overfitting.

Structural Risk Minimization (SRM)

A variant of ERM is SRM, which can solve the overfitting issue.

Take a sequence of increasing hypothesis spaces

H1,H2,... with
HiHi+1
and
i=1Hi=H
.
Given the dataset
Dn
, we find a function that minimize the empirical risk in each model.

hn^:=argminhHi,iNRn^(h)

This is also called the method of Sieves.

While SRM benefits from strong theoretical guarantees, it is typically computationally very expensive, since it requires determining the solution of multiple ERM problems.

tags: machine learning