Try   HackMD

Machine learning basics

tags: machine learning

The material is from Andrew Ng's course: Machine Learning[1]. Another excellent material is the book "Neutral networks and deep learning" written by Michael Nielsen[2].

Linear regression with one variable

Training sample

(x,y), where
x
is called the feature (input),
y
the target (output).

(x(i),y(i))-the
ith
instance of the training sample.

Hypothesis:

hθ(x)=θ0+θ1x. Given a value of the feature, predict the value of the target.

What is the flow chart of the machine learning process?

Created with Raphaël 2.2.0training setlearning algorithmhypothesis

Cost function: Measure the error between the prediction from the hypothesis and the target value from the training set. We denote it as

J. We can define many different versions of such measures. One of the most common choices is the so-called
L2
norm, which is defined by the equation

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2,
where
m
is the size of the training sample.

The goal of the machine leraning algorithm is to find the parameters,

θ0 and
θ1
, such that the cost function
J(θ0,θ1)
achieves the minimum value.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
How do we find such combination of
(θ0,θ1)
?
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Gradient descent (GD) method

Starting from an initial guess of

θ0 and
θ1
, we update them as

θjθjαJθj,j=1,2,

where

α is referred to as learning rate.

When applying the GD method to the linear regression, we have

θ0θ0α1mi=1m(hθ(x(i))y(i)),θ1θ1α1mi=1m(hθ(x(i))y(i))x(i).

The aforementioned method is often referred to as the batch GD method since everytime it update the parameters using the information from all instances of the traning sample.

Linear regression with multiple variables

The model we consider now has more than one features,

x1,x2,...,xn, yet still one target value,
y
.

The hypothesis becomes

hθ(x)=θTx=θ0+θ1x1+θ2x2+...+θnxn.

where

x is a column vector and
x=(x0,x1,x2,...,xn)T
with
x0=1
called zero feature.

The cost function becomes

J(θ0,θ1,...,θn)=12mi=1m(hθ(x(i))y(i))2.

The corresponding updates of

θ0,
θ1
,,
θn
are

θ0θ0α1mi=1m(hθ(x(i))y(i)),θ1θ1α1mi=1m(hθ(x(i))y(i))x1(i).θ2θ2α1mi=1m(hθ(x(i))y(i))x2(i),θnθnα1mi=1m(hθ(x(i))y(i))xn(i).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
It is desirable to have the features whose values are of the same magnitude or on a similar scale (for fast convergence of the learning algorithm). The way to do this is called feature scaling. Often we scale the value of the
jth
feature as

xj:=xjμjσj,j=1,2,...,n,

where

μj and
σj
are the average and standard deviation of
xj
.

Logistic regression

We now move on to the classification problems, e.g., whether an email belongs to spam or not, in which the prediction has discrete values.

The logistic regression is one of the most common learning algorithms for classification problems.

Binary classification

The output of logistic regression is always in the range

[0,1].

Logistic regression model:

hθ(x)[0,1]

hθ(x)=g(θTx),g(z)=11+ez,

where

g is called logistic/sigmoid function.

For a set of parameters

θ and input
x
, the model estimates the probability at which the output equals one, i.e.,

hθ(x)=P(y=1|x,θ).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Decision boundary:
The model predicts
y=1
for
hθ(x)0.5
and
y=0
for
hθ(x)<0.5
. From the graph of the sigmoid function we have
g(z=0)=0.5
,
g(z>0)>0.5
, and
g(z<0)<0.5
. Thus, we get that
y=1
when
θTx0
and
y=0
otherwise.

Here is an example:

hθ(x)=g(3+x1+x2). It predict a cross when
3+x1+x20
and a circle otherwise.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Cost function:

To construct a convex cost function, we define

Cost(hθ(x),y)={log(hθ(x))ify=1,log(1hθ(x))ify=0.

It means that when

y=1 and the prediction is also 1, then the cost is 0, when
y=1
but the prediction is 0, then the cost is
. Similar for the case when
y=0
.

Then the cost function is

J(θ)=1mi=1mCost(hθ(x(i)),y(i)).

Note that

y can only take the value to be either 0 or 1. We can rewrite
Cost(hθ(x),y)
as

Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x)).

And the cost function becomes

J(θ)=1mi=1m[y(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))].

The above function is also called cross-entropy loss function[3].

We update the parameters

θ as

θjθjαJ(θ)θj

where

J(θ)θj=1mi=1m[hθ(x(i))y(i)]xj(i),

for

j=0,1,...,n.

Multiclass classification

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Training the three classifiers

hθ(i)(x). For a given
x
, choose
i
such that
hθ(i)(x)=P(y=i|x,θ)
is maximum, i.e.,
maxih(i)(x)
.

Overfitting and regulization

Overfitting of the model to the data comes from the higher orders in the hypothesis

hθ(x).

To avoid overfitting, we can penalize the higher orders in the cost function, i.e., regularization.

Introduce a regularization parameter,

λ, which is a large number, in the cost function.

Regularized linear regression:

The regularized cost function

J(θ)=12mi=1m((hθ(x(i))y(i))2+λi=1nθj2).

The learning algorithm

θ0θ0α1mi=1m(hθ(x(i))y(i)),θjθjα1mi=1m((hθ(x(i))y(i))xj(i)+λmθj),j=1,2,...,n.

Regularized logistic regression:

The regularized cost function

J(θ)=1mi=1m[y(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))]+λ2mi=1nθj2.

The learning algorithm has the same expression as that of the regularized linear regression.

Neural networks

When the model has too many features, i.e., the dimension of

x is big, the computational complexity of the linear and logistic regressions becomes huge. This is the place where neural network comes to play a role.

First of all, we may ask why the neural networks are so powerful. The answer to this question lies in the fact that the neural network can approximate any function with any desirable satisfaction, which has been proven mathematically, that is,

Suppose we're given a [continuous] function

f(x) which we'd like to compute to within some desired accuracy
ϵ>0
. The guarantee is that by using enough hidden neurons we can always find a neural network whose output
g(x)
satisfies:

|g(x)f(x)|<ϵ

for all input

x. In other words, the approximation will be good to within the desired accuracy for every possible input.
- Michael Nielsen, Neural networks and deep learning[4]

Model representation

Biological neurons:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Neural networks:
The neural networks consists of many layers of neurons, and each neutron (or activation units) takes some features as input and provides an output according to a learning model. Below is a sigmoid activation function, or a single neuron. In neural networks, the parameters are also referred to as weights.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For a neural network shown in the figure below, the first layer is called input layer, the last layer output layer, and the middle layer hidden layer. In each layer there is a bias unit.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Forward propagation

In order to represent the hypothesis of the neural networks, we introduce the following notations:

ai(j): the
ith
activation unit in the
jth
layer;
Θ(j)
: the matrix of weights controlling function mapping from layer
j
to layer
j+1
; its size is
sj+1×(sj+1)
, where
sj
and
sj+1
are the number of units in the
jth
and
jth+1
layer, respectively.

In the above example,

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3),a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3),a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3),

and

hΘ(x)=g(Θ10(2)a0(2)+Θ11(2))a1(2)+Θ12(2)a2(2)+Θ13(2)a3(2)).

We can vectorize the representation as follows. Let

x=[x0x1x2x3],z(2)=[z1(2)z2(2)z3(2)],z(2)=Θ(1)x,a(2)=g(z(2)),

and let

z(3)=Θ(2)a(2),hΘ(x)=a1(3)=g(z(3)).

The units

a(j) in the hidden layers can be viewed as the newly evolved features from the origin features
x
.

Cost function

The neural network can be divided into binary classification and multi-class classification.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

In the neural network, the output could be a vector with dimension

K, that is,
hΘ(x)RK
. We denote the
kth
output as
(hΘ(x))k
.

In analogy to the cost function of logistic regression, the cost function of neural network can be written as

J(Θ)=1mi=1mk=1K(yk(i)log(hΘ(x(i)))k+(1yk(i))log(1(hΘ(x(i)))k))+λ2ml=1L1i=1slj=1sl+1(Θji(l))2.

(Note

m denotes the total number of training samples and
K
denotes the total number of classifiers.)

The cost function above is a summation of two parts: the first is a summation of the difference between the output of hypothesis and the real value from training sample for each class, and the second is a summation of the components of the matrix of weights for all layers except for the bias terms.

Back propagation

Understand from a simple case

Note: In this section the notations are different from that are adopted by Andrew Ng.

To help to understand what back propagation is, let us start with the simplest case where there is only one neuron in each layer and no regularization terms.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The following derivations are from the video tutorial by 3BLUE1BROWN[5].

Another excellent material about back propagation algorithm can be found on this page[6]

From the forward propagation we know that, given a training sample

(x,y), the error between the output
hθ(x)
and the real value
y
, or the cost function, is
C(w1,b1,w2,b2,w3,b3)
, where
wi
and
bi
are, respectively, the weight and bias associated with the function mapping from layer
i
to layer
i+1
. And we try to minimize the value of
C
by tuning the values of
wi
,
bi
according to some algorithm we will discuss about next.

First of all, let us only consider the last two layers and see how the value of cost function,

C0, is affected by the weight
w(L)
, bias
b(L)
, and the value of activation unit
a(L1)
from previous layer.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

According to the chain rule, we have

C0w(L)=C0a(L)a(L)z(L)z(L)w(L).

(We understand

C0 as the change in
C0
due to a slight change of
w(L)
in
w(L)
.)

Since

z(L)w(L)=a(L1),a(L)z(L)=σ(z(L)),

we have

C0w(L)=a(L1)σ(z(L))C0a(L).

Similarly,

C0b(L)=σ(z(L))C0a(L),\colorredC0a(L1)=w(L)σ(z(L))\colorredC0a(L).

Now we are ready to study the change of

C0 due to the changes of weight
w(L1)
, bias
b(L1)
, and activation unit value
a(L2)
, where the idea of back propagation comes in.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Again, by chain rule we have

C0w(L1)=\colorredC0a(L1)a(L1)z(L1)z(L1)w(L1)=a(L2)σ(z(L1))\colorredC0a(L1),C0b(L1)=\colorredC0a(L1)a(L1)z(L1)z(L1)b(L1)=σ(z(L1))\colorredC0a(L1),\colorredC0a(L2)=\colorredC0a(L1)a(L1)z(L1)z(L1)a(L2)=w(L1)σ(z(L1))\colorredC0a(L1).

We can move on until to find the derivatives of

C0 with respect to the weight
w(1)
and
b(1)
controlling function mapping from the first layer to the second. With all the derivatives of
C0
with respect to
wi
and
bi
, we are able to update them using the gradient descent method to minimize
C0
.

That is,

wiwiαC0wi,bibiαC0bi.

The above derivation is for neural network with only one neuron in each layer and for only one training sample. For the neural network composed of multiple neurons in each layer and trained by a set of data, the expressions become complicated but the core idea remains the same.

Algorithm

Recall for a complicate neural network trained by a large data set, we need to minimize the cost function

J(Θ) which involves computing the derivative
J(Θ)/Θij(l)
. Once again we note
l{1,2,...,L1}
is the layer number,
i{1,2,...,sj}
, and
j{1,2,...,sj+1}
, where
sj
and
sj+1
are the number of activation units in the
jth
and
jth+1
layer, respectively.

As an example, we consider a neural network with 4 layers shown in figure below. The error between the output of hypothesis and the real value in the last layer is

δj(4)=aj(4)yj,orδ(4)=a(4)y.

We denote the "error" of unit

j in layer
l
as
δj(l):=Jzj(l)
.

Then we can compute the "error" in the previous layers in a fashion of backward propagation. That is,

δ(3)=(Θ(3))Tδ(4).g(z(3)),δ(2)=(Θ(2))Tδ(2).g(z(2))

The symbol ".*" denotes the element-wise multiplication of matrices, also known as the Hadamard product[7]. Recall

z(3)=Θ(2)a(2) and
z(2)=Θ(1)a(1)
.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

It can be shown that

g(z(i))=a(i).(1a(i)).

Proof

Recall that

g(z(l))=11+ez(l)

By taking the derivative, we have

g(z(l))=gz(l)=11+ez(l).(111+ez(l))=g(z(l)).(1g(z(l)))=a(l).(1a(l)).

It can be further shown that when ignoring regularization (

λ=0), we have

J(Θ)Θij(l)=aj(l)δi(l+1).

The proof can be found on this page[8] and this page[9].

I repeated the proof procedure in this note[10].


  1. https://www.coursera.org/learn/machine-learning ↩︎

  2. http://neuralnetworksanddeeplearning.com/chap2.html ↩︎

  3. https://en.wikipedia.org/wiki/Cross_entropy ↩︎

  4. http://neuralnetworksanddeeplearning.com/chap4.html ↩︎

  5. https://www.youtube.com/watch?v=tIeHLnjs5U8 ↩︎

  6. https://en.wikipedia.org/wiki/Hadamard_product_(matrices) ↩︎

  7. https://mc.ai/a-beginners-guide-to-deriving-and-implementing-backpropagation/ ↩︎

  8. https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a ↩︎

  9. https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm ↩︎

  10. https://hackmd.io/@wldeng/back-propagation ↩︎