1-1: Learning Framework for Supervised Learning

Statistical Learning Theory(SLT) is a field that formalize the framework of machine learning to give a concrete proof of its possibility. The questions that Statistical learning aims to deal with are pretty fundamental:

Which learning tasks can be performed by computers in general (positive and negative results)?
What kind of assumptions do we have to make such that machine learning can be successful?
What are the key properties a learning algorithm needs to satisfy in order to be successful?
Which performance guarantees can we give on the results of certain learning algorithms?

To answer those questions, we need to build on a certain mathematical framework. Let's start from supervised learning.

Setup

Let

X

and

Y

denote the input space and output space respectively. A data set

D^{n} := {(x_{1}, y_{1}), . . ., (x_{n}, y_{n})}

is collected where each pair

(x_{i}, y_{i})

has

x_{i} \in X

y_{i} \in Y

Assume we have a predictor

h : X \to Y

，how do we quantify its performance on predicting data?

We shall introduce the loss function. Let

\hat{y} = h (x)

be the predicted output, and

y

be the real output (correspond to

x

), then the loss function can be defined as

l (\hat{y}, y) = {\begin{cases} I {\hat{y} \neq y}, & (0/1 loss) \\ (y - \hat{y})^{2}, & (MSE loss) \end{cases}

Usually, 0/1 loss is used on classification problem, whereas MSE loss use on regression.

We must make some probability assumption to how

x

and

y

is related. Assume that each pair from the data set

(x_{i}, y_{i})

are i.i.d. samples drawn from the joint relaion

P

of two random variables (RV)

X

、

Y

The performance of the predictor

h

is then the expected loss, or the risk (generalization error):

R_{P} (h) := E_{(X, Y) \sim P} [l (h (X), Y)]

Learning Algorithm

No Free Lunch (NFL) Theorem

Learning is not possible without assumptions.

Usually, we would not find the optimal solution among all measurable functions, since this could potentially become an infinite dimensional optimization problem in general.

So, is there an universal learning algorithm that can actually deal with any kind of problems? The following theorem gives an answer "no":

Theorem (No Free Lunch, Wolpert et al., 1996)
Without prior (hypothesis), the testing set errors of any two algorithms are equal over all possible senarios.

In other words, there always exists a bad distribution

P

such that the generalization error can be arbitarily bad.

So, does that mean we are doomed to have no information outside the training set

D^{n}

Inductive Bias

We can restricted on a function space

H

(called the Hypothesis Space formally, or simply the model) and find the best hypothesis

h

in that space. Such restrictions are often called an inductive bias.

A simple way to facilitate the procedure is to parametrize the function by some variables, for example, consider the class of function described by parameter

w

H_{w} := {h_{w} ∣ w \in W \subseteq R^{n}}

and the objective is to minimize the empirical risk

\hat{R_{n}} (h)

of the parameter

w

This is the main approach in statistical learning theory, since it does not put restrictive assumptions on the data. Such a hypothesis space can be viewed as reflecting some prior knowledge that the learner has about the task.

A Learning Algorithm under the inductive bias

A : D^{n} \to H

finds an estimator

h \in H

in the hypothesis space by the given data set

D^{n} := {(x_{1}, y_{1}), . . ., (x_{n}, y_{n})}

. We can use a simple picture to illustrate the relationship of the components we've mentioned above:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Excess risk decomposition

In general, the hidden distribution

P

in ML problem is unknown, thus one would use the best predictor in

H

as the baseline of measuring the quality of the algorithm

A

h_{H}^{*} := \underset{h \in H}{\arg min} R_{P} (h)

For any predictor

h

，we can define and decompose the excess risk as follows:

{ER}_{P} (h) := R_{P} (h) - R_{P} (h^{*}) = \underset{estimation error}{\underset{⏟}{[R_{P} (h) - R_{P} (h_{H}^{*})]}} + \underset{approximation error}{\underset{⏟}{[R_{P} (h_{H}^{*}) - R_{P} (h^{*})]}}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The approximation error is deterministic and is caused by the restriction of using

H

.
The estimation error is caused by the usage of a finite sample

n

that cannot completely represent the underlying distribution, thus it can be discussed and optimized. (We'll cover about it later)

Learning Algorithms: Examples

Bayes hypothesis

When the distribution

P

is known, one can find the best function

h^{*}

of all measurable functions such that the risk can be minimized by some optimization method. Usually,

h^{*}

is termed as the Bayes hypothesis:

h^{*} := \underset{h}{\arg min} R_{P} (h)

And the risk of the Bayes hypothesis is called the Bayes risk:

R^{*} := R_{P} (h^{*}) = inf_{h} R_{P} (h)

For the loss function introduced above, there are already some analytical solution to the optimization problem, for example:

classification problem with 0-1 loss: maximum a posteriori probability (MAP) rule
regression problem with squared loss: minimum mean square error (MMSE) rule

Let's look at some examples of Bayes hypothesis.

Example: Binary Classification
For binary classification with

Y = {\pm 1}

and 0/1 loss, one can show that the following predictor is a Bayes hypothesis:

h^{*} (x) = {\begin{cases} 1, & P r {Y = 1 ∣ X = x} \geq \frac{1}{2} \\ - 1, & otherwise \end{cases}

[Proof]
The risk of a predictor

h

can be written as

\begin{aligned} R_{P} (h) & = E_{(X, Y) \sim P} [I {h (X) \neq Y}] \\ = E_{X} [E_{Y ∣ X} [I {h (X) \neq Y}]] \\ = E_{X} [P r {Y = 1 ∣ X} \cdot I {h (X) \neq 1} + P r {Y = - 1 ∣ X} \cdot I {h (X) \neq - 1}] \\ = E_{X} [η (X) I {h (X) = - 1} + (1 - η (X)) I {h (X) = 1}] \end{aligned}

where we rewrite

η (x) := P r {Y = 1 ∣ X = x}

. The Bayes risk is thus

R_{P}^{*} = E_{X} [min (η (X), 1 - η (X))]

In particular, one can show that

h^{*} (x)

can acheive such risk. Since

η (X) < \frac{1}{2} ⟹ η (X) = min (η (X), 1 - η (X))

and vice versa for

η (X) \geq \frac{1}{2}

, The risk of

h^{*}

\begin{aligned} R_{P} (h^{*}) & = E_{X} [η (X) I {h^{*} (X) = - 1} + (1 - η (X)) I {h^{*} (X) = 1}] \\ = E_{X} [η (X) I {η (X) < \frac{1}{2}} + (1 - η (X)) I {η (X) \geq \frac{1}{2}}] \\ = E_{X} [(I {η (X) < \frac{1}{2}} + I {η (X) \geq \frac{1}{2}}]) min (η (X), 1 - η (X))] \\ = E_{X} [min (η (X), 1 - η (X))] \end{aligned}

Thus

h^{*}

is a Bayes hypothesis for binary classification.

Example: Regression
For regression with

Y = R

and MSE loss, one can show that the following predictor is a bayes hypothesis:

\hat{h} (x) = E [Y ∣ X = x]

The proof will be left as an exercise.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Empirical Risk minimization(ERM)

Even when the inductive bias is imposed, the distribution of the data

P

is unknown in general. Thus it's hard to find the best solution in the hypothesis space

H

with unknown

P

h_{H}^{*} := \underset{h \in H}{\arg min} R_{P} (h)

However, we can replace the risk by the empirical risk, which is measurable:

\hat{R_{n}} (h) = \frac{1}{n} \sum_{i = 1}^{n} l (h (x_{i}), y_{i})

Thus, it become quite reasonable to minimize the empirical risk. This learning paradigm is known as the Empirical Risk Minimization(ERM):

\hat{h_{n}} := \underset{h \in H}{\arg min} \hat{R_{n}} (h)

Although the ERM approach seems very natural, it may cause overfitting.
For example, ler's consider an extreme case:

\hat{h} (x) = {\begin{cases} y_{i}, & \exists x = x_{i}, where (x_{i}, y_{i}) \in D^{n} \\ random, & otherwise \end{cases}

It's clear that this predictor does NOT learn anything! So when using the ERM method, one should be very careful about overfitting.

Structural Risk Minimization (SRM)

A variant of ERM is SRM, which can solve the overfitting issue.

Take a sequence of increasing hypothesis spaces

H_{1}, H_{2}, . . .

with

H_{i} \subset H_{i + 1}

and

\cup_{i = 1}^{\infty} H_{i} = H

.
Given the dataset

D^{n}

, we find a function that minimize the empirical risk in each model.

\hat{h_{n}} := \underset{h \in H_{i}, i \in N}{\arg min} \hat{R_{n}} (h)

This is also called the method of Sieves.

While SRM benefits from strong theoretical guarantees, it is typically computationally very expensive, since it requires determining the solution of multiple ERM problems.

1-1: Learning Framework for Supervised Learning

Setup

Learning Algorithm

No Free Lunch (NFL) Theorem

Inductive Bias

Excess risk decomposition

Learning Algorithms: Examples

Bayes hypothesis

Empirical Risk minimization(ERM)

Structural Risk Minimization (SRM)

tags: machine learning

Read more

2-1: Linear Classifiers

畢旅

Approximate Logic Synthesis

CVSD Final

tags: `machine learning`