Machine learning basics

tags: `machine learning`

Machine learning basics

The material is from Andrew Ng's course: Machine Learning^[1]. Another excellent material is the book "Neutral networks and deep learning" written by Michael Nielsen^[2].

Linear regression with one variable

Training sample

(x, y)

, where

x

is called the feature (input),

y

the target (output).

(x^{(i)}, y^{(i)})

–-the

i^{t h}

instance of the training sample.

Hypothesis:

h_{θ} (x) = θ_{0} + θ_{1} x

. Given a value of the feature, predict the value of the target.

What is the flow chart of the machine learning process?

Cost function: Measure the error between the prediction from the hypothesis and the target value from the training set. We denote it as

J

. We can define many different versions of such measures. One of the most common choices is the so-called

L_{2}

norm, which is defined by the equation

J (θ_{0}, θ_{1}) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2},

where

m

is the size of the training sample.

The goal of the machine leraning algorithm is to find the parameters,

θ_{0}

and

θ_{1}

, such that the cost function

J (θ_{0}, θ_{1})

achieves the minimum value.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

How do we find such combination of

(θ_{0}, θ_{1})

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Gradient descent (GD) method

Starting from an initial guess of

θ_{0}

and

θ_{1}

, we update them as

θ_{j} \leftarrow θ_{j} - α \frac{\partial J}{\partial θ_{j}}, j = 1, 2,

where

α

is referred to as learning rate.

When applying the GD method to the linear regression, we have

\begin{aligned} θ_{0} & \leftarrow θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}), \\ θ_{1} & \leftarrow θ_{1} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x^{(i)} . \end{aligned}

The aforementioned method is often referred to as the batch GD method since everytime it update the parameters using the information from all instances of the traning sample.

Linear regression with multiple variables

The model we consider now has more than one features,

x_{1}, x_{2}, . . ., x_{n}

, yet still one target value,

y

The hypothesis becomes

h_{θ} (x) = θ^{T} \cdot x = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + . . . + θ_{n} x_{n} .

where

x

is a column vector and

x = (x_{0}, x_{1}, x_{2}, . . ., x_{n})^{T}

with

x_{0} = 1

called zero feature.

The cost function becomes

J (θ_{0}, θ_{1}, . . ., θ_{n}) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2} .

The corresponding updates of

θ_{0}

θ_{1}

,…,

θ_{n}

are

\begin{aligned} θ_{0} & \leftarrow θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}), \\ θ_{1} & \leftarrow θ_{1} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{1}^{(i)} . \\ θ_{2} & \leftarrow θ_{2} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{2}^{(i)}, \\ ⋮ \\ θ_{n} & \leftarrow θ_{n} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{n}^{(i)} . \end{aligned}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

It is desirable to have the features whose values are of the same magnitude or on a similar scale (for fast convergence of the learning algorithm). The way to do this is called feature scaling. Often we scale the value of the

j^{t h}

feature as

x_{j} := \frac{x_{j} - μ_{j}}{σ_{j}}, j = 1, 2, . . ., n,

where

μ_{j}

and

σ_{j}

are the average and standard deviation of

x_{j}

Logistic regression

We now move on to the classification problems, e.g., whether an email belongs to spam or not, in which the prediction has discrete values.

The logistic regression is one of the most common learning algorithms for classification problems.

Binary classification

The output of logistic regression is always in the range

[0, 1]

Logistic regression model:

h_{θ} (x) \in [0, 1]

\begin{aligned} h_{θ} (x) & = g (θ^{T} \cdot x), \\ g (z) & = \frac{1}{1 + e^{- z}}, \end{aligned}

where

g

is called logistic/sigmoid function.

For a set of parameters

θ

and input

x

, the model estimates the probability at which the output equals one, i.e.,

h_{θ} (x) = P (y = 1 | x, θ) .

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Decision boundary:
The model predicts

y = 1

for

h_{θ} (x) \geq 0.5

and

y = 0

for

h_{θ} (x) < 0.5

. From the graph of the sigmoid function we have

g (z = 0) = 0.5

g (z > 0) > 0.5

, and

g (z < 0) < 0.5

. Thus, we get that

y = 1

when

θ^{T} \cdot x \geq 0

and

y = 0

otherwise.

Here is an example:

h_{θ} (x) = g (- 3 + x_{1} + x_{2})

. It predict a cross when

- 3 + x_{1} + x_{2} \geq 0

and a circle otherwise.

Cost function:

To construct a convex cost function, we define

C o s t (h_{θ} (x), y) = {\begin{cases} - \log (h_{θ} (x)) & if y = 1, \\ - \log (1 - h_{θ} (x)) & if y = 0. \end{cases}

It means that when

y = 1

and the prediction is also 1, then the cost is 0, when

y = 1

but the prediction is 0, then the cost is

\infty

. Similar for the case when

y = 0

Then the cost function is

J (θ) = \frac{1}{m} \sum_{i = 1}^{m} C o s t (h_{θ} (x^{(i)}), y^{(i)}) .

Note that

y

can only take the value to be either 0 or 1. We can rewrite

C o s t (h_{θ} (x), y)

C o s t (h_{θ} (x), y) = - y \log (h_{θ} (x)) - (1 - y) \log (1 - h_{θ} (x)) .

And the cost function becomes

J (θ) = \frac{1}{m} \sum_{i = 1}^{m} [- y^{(i)} \log (h_{θ} (x^{(i)})) - (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))] .

The above function is also called cross-entropy loss function^[3].

We update the parameters

θ

θ_{j} \leftarrow θ_{j} - α \frac{\partial J (θ)}{\partial θ_{j}}

where

\frac{\partial J (θ)}{\partial θ_{j}} = \frac{1}{m} \sum_{i = 1}^{m} [h_{θ} (x^{(i)}) - y^{(i)}] x_{j}^{(i)},

for

j = 0, 1, . . ., n

Multiclass classification

Training the three classifiers

h_{θ}^{(i)} (x)

. For a given

x

, choose

i

such that

h_{θ}^{(i)} (x) = P (y = i | x, θ)

is maximum, i.e.,

max_{i} h^{(i)} (x)

Overfitting and regulization

Overfitting of the model to the data comes from the higher orders in the hypothesis

h_{θ} (x)

To avoid overfitting, we can penalize the higher orders in the cost function, i.e., regularization.

Introduce a regularization parameter,

λ

, which is a large number, in the cost function.

Regularized linear regression:

The regularized cost function

J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} ((h_{θ} (x^{(i)}) - y^{(i)})^{2} + λ \sum_{i = 1}^{n} θ_{j}^{2}) .

The learning algorithm

\begin{aligned} θ_{0} & \leftarrow θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}), \\ θ_{j} & \leftarrow θ_{j} - α \frac{1}{m} \sum_{i = 1}^{m} ((h_{θ} (x^{(i)}) - y^{(i)}) x_{j}^{(i)} + \frac{λ}{m} θ_{j}), j = 1, 2, . . ., n . \end{aligned}

Regularized logistic regression:

The regularized cost function

J (θ) = \frac{1}{m} \sum_{i = 1}^{m} [- y^{(i)} \log (h_{θ} (x^{(i)})) - (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))] + \frac{λ}{2 m} \sum_{i = 1}^{n} θ_{j}^{2} .

The learning algorithm has the same expression as that of the regularized linear regression.

Neural networks

When the model has too many features, i.e., the dimension of

x

is big, the computational complexity of the linear and logistic regressions becomes huge. This is the place where neural network comes to play a role.

First of all, we may ask why the neural networks are so powerful. The answer to this question lies in the fact that the neural network can approximate any function with any desirable satisfaction, which has been proven mathematically, that is,

Suppose we're given a [continuous] function
$f (x)$ which we'd like to compute to within some desired accuracy
$ϵ > 0$ . The guarantee is that by using enough hidden neurons we can always find a neural network whose output
$g (x)$ satisfies:

$| g (x) - f (x) | < ϵ$

for all input
$x$ . In other words, the approximation will be good to within the desired accuracy for every possible input.
–- Michael Nielsen, Neural networks and deep learning^[4]

Model representation

Biological neurons:

Neural networks:
The neural networks consists of many layers of neurons, and each neutron (or activation units) takes some features as input and provides an output according to a learning model. Below is a sigmoid activation function, or a single neuron. In neural networks, the parameters are also referred to as weights.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For a neural network shown in the figure below, the first layer is called input layer, the last layer output layer, and the middle layer hidden layer. In each layer there is a bias unit.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Forward propagation

In order to represent the hypothesis of the neural networks, we introduce the following notations:

a_{i}^{(j)}

: the

i^{t h}

activation unit in the

j^{t h}

layer;

Θ^{(j)}

: the matrix of weights controlling function mapping from layer

j

to layer

j + 1

; its size is

s_{j + 1} \times (s_{j} + 1)

, where

s_{j}

and

s_{j + 1}

are the number of units in the

j^{t h}

and

j^{t h} + 1

layer, respectively.

In the above example,

\begin{aligned} a_{1}^{(2)} & = g (Θ_{10}^{(1)} x_{0} + Θ_{11}^{(1)} x_{1} + Θ_{12}^{(1)} x_{2} + Θ_{13}^{(1)} x_{3}), \\ a_{2}^{(2)} & = g (Θ_{20}^{(1)} x_{0} + Θ_{21}^{(1)} x_{1} + Θ_{22}^{(1)} x_{2} + Θ_{23}^{(1)} x_{3}), \\ a_{3}^{(2)} & = g (Θ_{30}^{(1)} x_{0} + Θ_{31}^{(1)} x_{1} + Θ_{32}^{(1)} x_{2} + Θ_{33}^{(1)} x_{3}), \end{aligned}

and

h_{Θ} (x) = g (Θ_{10}^{(2)} a_{0}^{(2)} + Θ_{11}^{(2)}) a_{1}^{(2)} + Θ_{12}^{(2)} a_{2}^{(2)} + Θ_{13}^{(2)} a_{3}^{(2)}) .

We can vectorize the representation as follows. Let

\begin{array}{r} x = [\begin{array}{c} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \end{array}], z^{(2)} = [\begin{array}{c} z_{1}^{(2)} \\ z_{2}^{(2)} \\ z_{3}^{(2)} \end{array}], z^{(2)} = Θ^{(1)} x, a^{(2)} = g (z^{(2)}), \end{array}

and let

z^{(3)} = Θ^{(2)} a^{(2)}, h_{Θ} (x) = a_{1}^{(3)} = g (z^{(3)}) .

The units

a^{(j)}

in the hidden layers can be viewed as the newly evolved features from the origin features

x

Cost function

The neural network can be divided into binary classification and multi-class classification.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In the neural network, the output could be a vector with dimension

K

, that is,

h_{Θ} (x) \in R^{K}

. We denote the

k^{t h}

output as

(h_{Θ} (x))_{k}

In analogy to the cost function of logistic regression, the cost function of neural network can be written as

\begin{aligned} J (Θ) = & - \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} (y_{k}^{(i)} \log (h_{Θ} (x^{(i)}))_{k} + (1 - y_{k}^{(i)}) \log (1 - (h_{Θ} (x^{(i)}))_{k})) \\ + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{j i}^{(l)})^{2} . \end{aligned}

(Note

m

denotes the total number of training samples and

K

denotes the total number of classifiers.)

The cost function above is a summation of two parts: the first is a summation of the difference between the output of hypothesis and the real value from training sample for each class, and the second is a summation of the components of the matrix of weights for all layers except for the bias terms.

Back propagation

Understand from a simple case

Note: In this section the notations are different from that are adopted by Andrew Ng.

To help to understand what back propagation is, let us start with the simplest case where there is only one neuron in each layer and no regularization terms.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The following derivations are from the video tutorial by 3BLUE1BROWN^[5].

Another excellent material about back propagation algorithm can be found on this page^[6]

From the forward propagation we know that, given a training sample

(x, y)

, the error between the output

h_{θ} (x)

and the real value

y

, or the cost function, is

C (w_{1}, b_{1}, w_{2}, b_{2}, w_{3}, b_{3})

, where

w_{i}

and

b_{i}

are, respectively, the weight and bias associated with the function mapping from layer

i

to layer

i + 1

. And we try to minimize the value of

C

by tuning the values of

w_{i}

b_{i}

according to some algorithm we will discuss about next.

First of all, let us only consider the last two layers and see how the value of cost function,

C_{0}

, is affected by the weight

w^{(L)}

, bias

b^{(L)}

, and the value of activation unit

a^{(L - 1)}

from previous layer.

According to the chain rule, we have

\frac{\partial C_{0}}{\partial w^{(L)}} = \frac{\partial C_{0}}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial w^{(L)}} .

(We understand

\partial C_{0}

as the change in

C_{0}

due to a slight change of

\partial w^{(L)}

w^{(L)}

Since

\begin{aligned} \frac{\partial z^{(L)}}{\partial w^{(L)}} & = a^{(L - 1)}, \\ \frac{\partial a^{(L)}}{\partial z^{(L)}} & = σ^{'} (z^{(L)}), \end{aligned}

we have

\frac{\partial C_{0}}{\partial w^{(L)}} = a^{(L - 1)} σ^{'} (z^{(L)}) \frac{\partial C_{0}}{\partial a^{(L)}} .

Similarly,

\begin{aligned} \frac{\partial C_{0}}{\partial b^{(L)}} & = σ^{'} (z^{(L)}) \frac{\partial C_{0}}{\partial a^{(L)}}, \\ \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}} & = w^{(L)} σ^{'} (z^{(L)}) \color r e d \frac{\partial C_{0}}{\partial a^{(L)}} . \end{aligned}

Now we are ready to study the change of

C_{0}

due to the changes of weight

w^{(L - 1)}

, bias

b^{(L - 1)}

, and activation unit value

a^{(L - 2)}

, where the idea of back propagation comes in.

Again, by chain rule we have

\begin{aligned} \frac{\partial C_{0}}{\partial w^{(L - 1)}} & = \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}} \frac{\partial a^{(L - 1)}}{\partial z^{(L - 1)}} \frac{\partial z^{(L - 1)}}{\partial w^{(L - 1)}} \\ = a^{(L - 2)} σ^{'} (z^{(L - 1)}) \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}}, \\ \frac{\partial C_{0}}{\partial b^{(L - 1)}} & = \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}} \frac{\partial a^{(L - 1)}}{\partial z^{(L - 1)}} \frac{\partial z^{(L - 1)}}{\partial b^{(L - 1)}} \\ = σ^{'} (z^{(L - 1)}) \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}}, \\ \color r e d \frac{\partial C_{0}}{\partial a^{(L - 2)}} & = \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}} \frac{\partial a^{(L - 1)}}{\partial z^{(L - 1)}} \frac{\partial z^{(L - 1)}}{\partial a^{(L - 2)}} \\ = w^{(L - 1)} σ^{'} (z^{(L - 1)}) \color r e d \frac{\partial C_{0}}{\partial a^{(L - 1)}} . \end{aligned}

We can move on until to find the derivatives of

C_{0}

with respect to the weight

w^{(1)}

and

b^{(1)}

controlling function mapping from the first layer to the second. With all the derivatives of

C_{0}

with respect to

w_{i}

and

b_{i}

, we are able to update them using the gradient descent method to minimize

C_{0}

That is,

\begin{aligned} w_{i} & \leftarrow w_{i} - α \frac{\partial C_{0}}{\partial w_{i}}, \\ b_{i} & \leftarrow b_{i} - α \frac{\partial C_{0}}{\partial b_{i}} . \end{aligned}

The above derivation is for neural network with only one neuron in each layer and for only one training sample. For the neural network composed of multiple neurons in each layer and trained by a set of data, the expressions become complicated but the core idea remains the same.

Algorithm

Recall for a complicate neural network trained by a large data set, we need to minimize the cost function

J (Θ)

which involves computing the derivative

\partial J (Θ) / \partial Θ_{i j}^{(l)}

. Once again we note

l \in {1, 2, . . ., L - 1}

is the layer number,

i \in {1, 2, . . ., s_{j}}

, and

j \in {1, 2, . . ., s_{j + 1}}

, where

s_{j}

and

s_{j + 1}

are the number of activation units in the

j^{t h}

and

j^{t h} + 1

layer, respectively.

As an example, we consider a neural network with 4 layers shown in figure below. The error between the output of hypothesis and the real value in the last layer is

δ_{j}^{(4)} = a_{j}^{(4)} - y_{j}, or δ^{(4)} = a^{(4)} - y .

We denote the "error" of unit

j

in layer

l

δ_{j}^{(l)} := \frac{\partial J}{\partial z_{j}^{(l)}}

Then we can compute the "error" in the previous layers in a fashion of backward propagation. That is,

\begin{aligned} δ^{(3)} & = (Θ^{(3)})^{T} δ^{(4)} . * g^{'} (z^{(3)}), \\ δ^{(2)} & = (Θ^{(2)})^{T} δ^{(2)} . * g^{'} (z^{(2)}) \end{aligned}

The symbol ".*" denotes the element-wise multiplication of matrices, also known as the Hadamard product^[7]. Recall

z^{(3)} = Θ^{(2)} a^{(2)}

and

z^{(2)} = Θ^{(1)} a^{(1)}

It can be shown that

g^{'} (z^{(i)}) = a^{(i)} . * (1 - a^{(i)})

Proof

Recall that

g (z^{(l)}) = \frac{1}{1 + e^{- z^{(l)}}}

By taking the derivative, we have

\begin{aligned} g^{'} (z^{(l)}) & = \frac{\partial g}{\partial z^{(l)}} \\ = \frac{1}{1 + e^{- z^{(l)}}} . * (1 - \frac{1}{1 + e^{- z^{(l)}}}) \\ = g (z^{(l)}) . * (1 - g (z^{(l)})) \\ = a^{(l)} . * (1 - a^{(l)}) . \end{aligned}

It can be further shown that when ignoring regularization (

λ = 0

), we have

\frac{\partial J (Θ)}{\partial Θ_{i j}^{(l)}} = a_{j}^{(l)} δ_{i}^{(l + 1)} .

The proof can be found on this page^[8] and this page^[9].

I repeated the proof procedure in this note^[10].

Machine learning basics

tags: machine learning

Linear regression with one variable

Linear regression with multiple variables

Logistic regression

Binary classification

Multiclass classification

Overfitting and regulization

Neural networks

Model representation

Forward propagation

Cost function

Back propagation

Understand from a simple case

Algorithm

Read more

Neural network in practice

Hydrodynamic particle removal from surfaces

Proof of back propagation algorithm

tags: `machine learning`