machine learning
The material is from Andrew Ng's course: Machine Learning[1]. Another excellent material is the book "Neutral networks and deep learning" written by Michael Nielsen[2].
Training sample , where is called the feature (input), the target (output).
–-the instance of the training sample.
Hypothesis: . Given a value of the feature, predict the value of the target.
What is the flow chart of the machine learning process?
Cost function: Measure the error between the prediction from the hypothesis and the target value from the training set. We denote it as . We can define many different versions of such measures. One of the most common choices is the so-called norm, which is defined by the equation
where is the size of the training sample.
The goal of the machine leraning algorithm is to find the parameters, and , such that the cost function achieves the minimum value.
Starting from an initial guess of and , we update them as
where is referred to as learning rate.
When applying the GD method to the linear regression, we have
The aforementioned method is often referred to as the batch GD method since everytime it update the parameters using the information from all instances of the traning sample.
The model we consider now has more than one features, , yet still one target value, .
The hypothesis becomes
where is a column vector and with called zero feature.
The cost function becomes
The corresponding updates of , ,…, are
where and are the average and standard deviation of .
We now move on to the classification problems, e.g., whether an email belongs to spam or not, in which the prediction has discrete values.
The logistic regression is one of the most common learning algorithms for classification problems.
The output of logistic regression is always in the range .
Logistic regression model:
where is called logistic/sigmoid function.
For a set of parameters and input , the model estimates the probability at which the output equals one, i.e.,
Here is an example: . It predict a cross when and a circle otherwise.
Cost function:
To construct a convex cost function, we define
It means that when and the prediction is also 1, then the cost is 0, when but the prediction is 0, then the cost is . Similar for the case when .
Then the cost function is
Note that can only take the value to be either 0 or 1. We can rewrite as
And the cost function becomes
The above function is also called cross-entropy loss function[3].
We update the parameters as
where
for .
Training the three classifiers . For a given , choose such that is maximum, i.e., .
Overfitting of the model to the data comes from the higher orders in the hypothesis .
To avoid overfitting, we can penalize the higher orders in the cost function, i.e., regularization.
Introduce a regularization parameter, , which is a large number, in the cost function.
Regularized linear regression:
The regularized cost function
The learning algorithm
Regularized logistic regression:
The regularized cost function
The learning algorithm has the same expression as that of the regularized linear regression.
When the model has too many features, i.e., the dimension of is big, the computational complexity of the linear and logistic regressions becomes huge. This is the place where neural network comes to play a role.
First of all, we may ask why the neural networks are so powerful. The answer to this question lies in the fact that the neural network can approximate any function with any desirable satisfaction, which has been proven mathematically, that is,
Suppose we're given a [continuous] function which we'd like to compute to within some desired accuracy . The guarantee is that by using enough hidden neurons we can always find a neural network whose output satisfies:
for all input . In other words, the approximation will be good to within the desired accuracy for every possible input.
–- Michael Nielsen, Neural networks and deep learning[4]
Biological neurons:
Neural networks:
The neural networks consists of many layers of neurons, and each neutron (or activation units) takes some features as input and provides an output according to a learning model. Below is a sigmoid activation function, or a single neuron. In neural networks, the parameters are also referred to as weights.
For a neural network shown in the figure below, the first layer is called input layer, the last layer output layer, and the middle layer hidden layer. In each layer there is a bias unit.
In order to represent the hypothesis of the neural networks, we introduce the following notations:
: the activation unit in the layer;
: the matrix of weights controlling function mapping from layer to layer ; its size is , where and are the number of units in the and layer, respectively.
In the above example,
and
We can vectorize the representation as follows. Let
and let
The units in the hidden layers can be viewed as the newly evolved features from the origin features .
The neural network can be divided into binary classification and multi-class classification.
In the neural network, the output could be a vector with dimension , that is, . We denote the output as .
In analogy to the cost function of logistic regression, the cost function of neural network can be written as
(Note denotes the total number of training samples and denotes the total number of classifiers.)
The cost function above is a summation of two parts: the first is a summation of the difference between the output of hypothesis and the real value from training sample for each class, and the second is a summation of the components of the matrix of weights for all layers except for the bias terms.
Note: In this section the notations are different from that are adopted by Andrew Ng.
To help to understand what back propagation is, let us start with the simplest case where there is only one neuron in each layer and no regularization terms.
The following derivations are from the video tutorial by 3BLUE1BROWN[5].
Another excellent material about back propagation algorithm can be found on this page[6]
From the forward propagation we know that, given a training sample , the error between the output and the real value , or the cost function, is , where and are, respectively, the weight and bias associated with the function mapping from layer to layer . And we try to minimize the value of by tuning the values of , according to some algorithm we will discuss about next.
First of all, let us only consider the last two layers and see how the value of cost function, , is affected by the weight , bias , and the value of activation unit from previous layer.
According to the chain rule, we have
(We understand as the change in due to a slight change of in .)
Since
we have
Similarly,
Now we are ready to study the change of due to the changes of weight , bias , and activation unit value , where the idea of back propagation comes in.
Again, by chain rule we have
We can move on until to find the derivatives of with respect to the weight and controlling function mapping from the first layer to the second. With all the derivatives of with respect to and , we are able to update them using the gradient descent method to minimize .
That is,
The above derivation is for neural network with only one neuron in each layer and for only one training sample. For the neural network composed of multiple neurons in each layer and trained by a set of data, the expressions become complicated but the core idea remains the same.
Recall for a complicate neural network trained by a large data set, we need to minimize the cost function which involves computing the derivative . Once again we note is the layer number, , and , where and are the number of activation units in the and layer, respectively.
As an example, we consider a neural network with 4 layers shown in figure below. The error between the output of hypothesis and the real value in the last layer is
We denote the "error" of unit in layer as .
Then we can compute the "error" in the previous layers in a fashion of backward propagation. That is,
The symbol ".*" denotes the element-wise multiplication of matrices, also known as the Hadamard product[7]. Recall and .
It can be shown that .
Recall that
By taking the derivative, we have
It can be further shown that when ignoring regularization (), we have
The proof can be found on this page[8] and this page[9].
I repeated the proof procedure in this note[10].
https://en.wikipedia.org/wiki/Hadamard_product_(matrices) ↩︎
https://mc.ai/a-beginners-guide-to-deriving-and-implementing-backpropagation/ ↩︎
https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a ↩︎
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm ↩︎