# Machine learning basics ###### tags: `machine learning` [toc] The material is from Andrew Ng's course: Machine Learning[^1a]. Another excellent material is the book _"Neutral networks and deep learning"_ written by Michael Nielsen[^1b]. ## Linear regression with one variable Training sample $(x,y)$, where $x$ is called the feature (input), $y$ the target (output). $(x^{(i)},y^{(i)})$---the $i^{th}$ instance of the training sample. __Hypothesis__: $h_{\theta}(x) = \theta_0 + \theta_1 x$. Given a value of the feature, predict the value of the target. :::success What is the flow chart of the machine learning process? ```flow st=>start: training set e=>end: hypothesis op=>operation: learning algorithm op1=>operation: Given input op2=>operation: prediction st->op->e ``` ::: __Cost function__: Measure the error between the prediction from the hypothesis and the target value from the training set. We denote it as $J$. We can define many different versions of such measures. One of the most common choices is the so-called $L_2$ norm, which is defined by the equation $$ J(\theta_0,\theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2, $$ where $m$ is the size of the training sample. The goal of the machine leraning algorithm is to find the parameters, $\theta_0$ and $\theta_1$, such that the cost function $J(\theta_0,\theta_1)$ achieves the minimum value. :raised_hand: How do we find such combination of $(\theta_0,\theta_1)$? :point_right: __Gradient descent (GD) method__ Starting from an initial guess of $\theta_0$ and $\theta_1$, we update them as $$ \theta_j \leftarrow \theta_j - \alpha \frac{\partial J}{\partial\theta_j}, \qquad j = 1,2, $$ where $\alpha$ is referred to as learning rate. When applying the GD method to the linear regression, we have $$ \begin{align} \theta_0 &\leftarrow \theta_0 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}), \\ \theta_1 & \leftarrow \theta_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}. \end{align} $$ The aforementioned method is often referred to as the batch GD method since everytime it update the parameters using the information from all instances of the traning sample. ## Linear regression with multiple variables The model we consider now has more than one features, $x_1, x_2, ..., x_n$, yet still one target value, $y$. The hypothesis becomes $$ h_{\theta}(\boldsymbol{x}) = \boldsymbol{\theta}^{\text{T}}\cdot\boldsymbol{x} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n. $$ where $\boldsymbol{x}$ is a column vector and $\boldsymbol{x} = (x_0,x_1,x_2,...,x_n)^{\text{T}}$ with $x_0=1$ called zero feature. The cost function becomes $$ J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(\boldsymbol{x}^{(i)}) - y^{(i)})^2. $$ The corresponding updates of $\theta_0$, $\theta_1$,..., $\theta_n$ are $$ \begin{align} \theta_0 &\leftarrow \theta_0 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)}), \\ \theta_1 & \leftarrow \theta_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)})x_1^{(i)}. \\ \theta_2 & \leftarrow \theta_2 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)})x_2^{(i)},\\ &\vdots\\ \theta_n & \leftarrow \theta_n - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)})x_n^{(i)}. \end{align} $$ :bulb: It is desirable to have the features whose values are of the same magnitude or on a similar scale (for fast convergence of the learning algorithm). The way to do this is called feature scaling. Often we scale the value of the $j^{th}$ feature as $$ x_j := \frac{x_j-\mu_j}{\sigma_j}, \qquad j = 1, 2,...,n, $$ where $\mu_j$ and $\sigma_j$ are the average and standard deviation of $x_j$. ## Logistic regression We now move on to the classification problems, e.g., whether an email belongs to spam or not, in which the prediction has discrete values. The logistic regression is one of the most common learning algorithms for classification problems. ### Binary classification The output of logistic regression is always in the range $[0,1]$. __Logistic regression model__: $h_{\theta}(\boldsymbol{x}) \in [0,1]$ $$ \begin{align} h_{\theta}(\boldsymbol{x}) &= g(\boldsymbol{\theta}^{\text{T}}\cdot\boldsymbol{x}), \\ g(z) &= \frac{1}{1+e^{-z}}, \end{align} $$ where $g$ is called __logistic/sigmoid function__. For a set of parameters $\boldsymbol{\theta}$ and input $\boldsymbol{x}$, the model estimates the probability at which the output equals one, i.e., $$ h_{\theta}(\boldsymbol{x}) = P(y=1|\boldsymbol{x},\boldsymbol{\theta}). $$ :bulb:__Decision boundary__: The model predicts $y=1$ for $h_{\theta}(\boldsymbol{x}) \ge 0.5$ and $y=0$ for $h_{\theta}(\boldsymbol{x}) < 0.5$. From the graph of the sigmoid function we have $g(z=0) = 0.5$, $g(z>0) > 0.5$, and $g(z<0) < 0.5$. Thus, we get that $y=1$ when $\boldsymbol{\theta}^{\text{T}}\cdot\boldsymbol{x} \ge 0$ and $y=0$ otherwise. Here is an example: $h_{\theta}(\boldsymbol{x}) = g(-3 + x_1 + x_2)$. It predict a cross when $-3+x_1+x_2 \ge 0$ and a circle otherwise. <img src="https://i.imgur.com/yRpzlcR.png" width=500> __Cost function__: To construct a convex cost function, we define $$ Cost(h_{\theta}(\boldsymbol{x}),y) = \begin{cases} -\log(h_{\theta}(\boldsymbol{x})) \quad &\text{if} \quad y = 1, \\ -\log(1-h_{\theta}(\boldsymbol{x})) \quad &\text{if} \quad y = 0. \end{cases} $$ It means that when $y=1$ and the prediction is also 1, then the cost is 0, when $y=1$ but the prediction is 0, then the cost is $\infty$. Similar for the case when $y=0$. Then the cost function is $$ J(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^m Cost(h_{\theta}(\boldsymbol{x}^{(i)}),y^{(i)}). $$ Note that $y$ can only take the value to be either 0 or 1. We can rewrite $Cost(h_{\theta}(\boldsymbol{x}),y)$ as $$ Cost(h_{\theta}(\boldsymbol{x}),y) = -y\log(h_{\theta}(\boldsymbol{x}))-(1-y)\log(1-h_{\theta}(\boldsymbol{x})). $$ And the cost function becomes $$ J(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}\log(h_{\theta}(\boldsymbol{x}^{(i)}))-(1-y^{(i)})\log(1-h_{\theta}(\boldsymbol{x}^{(i)}))]. $$ The above function is also called cross-entropy loss function[^2]. We update the parameters $\boldsymbol{\theta}$ as $$ \theta_j \leftarrow \theta_j - \alpha \frac{\partial J(\boldsymbol{\theta})}{\partial \theta_j} $$ where $$ \frac{\partial J(\boldsymbol{\theta})}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m}[h_{\theta}(\boldsymbol{x}^{(i)}) - y^{(i)}]x_j^{(i)}, $$ for $j=0,1,...,n$. ### Multiclass classification <img src="https://i.imgur.com/KVezdki.png" width=500> Training the three classifiers $h_{\theta}^{(i)}(\boldsymbol{x})$. For a given $\boldsymbol{x}$, choose $i$ such that $h_{\theta}^{(i)}(\boldsymbol{x}) = P(y=i|\boldsymbol{x},\boldsymbol{\theta})$ is maximum, i.e., $\max_{i}h^{(i)}(\boldsymbol{x})$. ## Overfitting and regulization Overfitting of the model to the data comes from the higher orders in the hypothesis $h_{\theta}(\boldsymbol{x})$. To avoid overfitting, we can penalize the higher orders in the cost function, i.e., regularization. Introduce a __regularization parameter__, $\lambda$, which is a large number, in the cost function. __Regularized linear regression__: The regularized cost function $$ J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^m ((h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)})^2 + \lambda \sum_{i=1}^n \theta_j^2). $$ The learning algorithm $$ \begin{align} \theta_0 &\leftarrow \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m (h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)}), \\ \theta_j &\leftarrow \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m ((h_{\theta}(\boldsymbol{x}^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j), \quad j=1,2,...,n. \end{align} $$ __Regularized logistic regression__: The regularized cost function $$ J(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}\log(h_{\theta}(\boldsymbol{x}^{(i)}))-(1-y^{(i)})\log(1-h_{\theta}(\boldsymbol{x}^{(i)}))] + \frac{\lambda}{2m} \sum_{i=1}^{n} \theta_j^2. $$ The learning algorithm has the same expression as that of the regularized linear regression. ## Neural networks When the model has too many features, i.e., the dimension of $\boldsymbol{x}$ is big, the computational complexity of the linear and logistic regressions becomes huge. This is the place where neural network comes to play a role. First of all, we may ask why the neural networks are so powerful. The answer to this question lies in the fact that the neural network can approximate any function with any desirable satisfaction, which has been proven mathematically, that is, > Suppose we're given a [continuous] function $f(x)$ which we'd like to compute to within some desired accuracy $\epsilon>0$. The guarantee is that by using enough hidden neurons we can always find a neural network whose output $g(x)$ satisfies: > > $$ > |g(x)-f(x)| < \epsilon > $$ > > for all input $x$. In other words, the approximation will be good to within the desired accuracy for every possible input. > --- Michael Nielsen, Neural networks and deep learning[^1c] ### Model representation __Biological neurons__: <img src="https://i.imgur.com/7A4Vrxr.png" width=500> <img src="https://i.imgur.com/QGWEmdh.png" width=500> __Neural networks__: The neural networks consists of many layers of neurons, and each neutron (or activation units) takes some features as input and provides an output according to a learning model. Below is a sigmoid activation function, or a single neuron. In neural networks, the parameters are also referred to as weights. <img src="https://i.imgur.com/AuPJAql.png" width=400> For a neural network shown in the figure below, the first layer is called input layer, the last layer output layer, and the middle layer hidden layer. In each layer there is a bias unit. <img src="https://i.imgur.com/FTI9IPv.png" width=500> ### Forward propagation In order to represent the hypothesis of the neural networks, we introduce the following notations: $a_i^{(j)}$: the $i^{th}$ activation unit in the $j^{th}$ layer; $\Theta^{(j)}$: the matrix of weights controlling function mapping from layer $j$ to layer $j+1$; its size is $s_{j+1} \times (s_j+1)$, where $s_j$ and $s_{j+1}$ are the number of units in the $j^{th}$ and $j^{th}+1$ layer, respectively. In the above example, $$ \begin{align} a_1^{(2)} &= g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3), \\ a_2^{(2)} &= g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3), \\ a_3^{(2)} &= g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3), \end{align} $$ and $$ h_{\Theta}(\boldsymbol{x}) = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11})a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3). $$ We can vectorize the representation as follows. Let $$ \begin{align} \boldsymbol{x} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad \boldsymbol{z}^{(2)} = \begin{bmatrix} z^{(2)}_1 \\ z^{(2)}_2 \\ z^{(2)}_3 \end{bmatrix}, \quad \boldsymbol{z}^{(2)} = \Theta^{(1)} \boldsymbol{x}, \quad \boldsymbol{a}^{(2)} = g(\boldsymbol{z}^{(2)}), \end{align} $$ and let $$ \boldsymbol{z}^{(3)} = \Theta^{(2)} \boldsymbol{a}^{(2)}, \quad h_{\Theta}(\boldsymbol{x}) = a^{(3)}_1 = g(\boldsymbol{z}^{(3)}). $$ The units $\boldsymbol{a}^{(j)}$ in the hidden layers can be viewed as the newly evolved features from the origin features $\boldsymbol{x}$. ### Cost function The neural network can be divided into binary classification and multi-class classification. <img src="https://i.imgur.com/mHlpxUQ.png" width=500> In the neural network, the output could be a vector with dimension $K$, that is, $h_{\Theta}(\boldsymbol{x}) \in \mathbb{R}^{K}$. We denote the $k^{th}$ output as $(h_{\Theta}(\boldsymbol{x}))_k$. In analogy to the cost function of logistic regression, the cost function of neural network can be written as $$ \begin{align} J(\Theta) = &-\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left( y_k^{(i)}\log(h_{\Theta}(\boldsymbol{x}^{(i)}))_{k} + (1-y_k^{(i)})\log(1-(h_{\Theta}(\boldsymbol{x}^{(i)}))_k) \right) \\ &+\frac{\lambda}{2m} \sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta^{(l)}_{ji})^2. \end{align} $$ (Note $m$ denotes the total number of training samples and $K$ denotes the total number of classifiers.) The cost function above is a summation of two parts: the first is a summation of the difference between the output of hypothesis and the real value from training sample for each class, and the second is a summation of the components of the matrix of weights for all layers except for the bias terms. ### Back propagation #### Understand from a simple case Note: In this section the notations are different from that are adopted by Andrew Ng. To help to understand what back propagation is, let us start with the simplest case where there is only one neuron in each layer and no regularization terms. <img src="https://i.imgur.com/dNSZwPe.png" width=500> The following derivations are from the video tutorial by 3BLUE1BROWN[^3]. Another excellent material about back propagation algorithm can be found [on this page](https://mc.ai/a-beginners-guide-to-deriving-and-implementing-backpropagation/)[^4] From the forward propagation we know that, given a training sample $(x,y)$, the error between the output $h_{\theta}(x)$ and the real value $y$, or the cost function, is $C(w_1,b_1,w_2,b_2,w_3,b_3)$, where $w_i$ and $b_i$ are, respectively, the weight and bias associated with the function mapping from layer $i$ to layer $i+1$. And we try to minimize the value of $C$ by tuning the values of $w_i$, $b_i$ according to some algorithm we will discuss about next. First of all, let us only consider the last two layers and see how the value of cost function, $C_0$, is affected by the weight $w^{(L)}$, bias $b^{(L)}$, and the value of activation unit $a^{(L-1)}$ from previous layer. <img src="https://i.imgur.com/l58X0kd.png" width=500> According to the chain rule, we have $$ \frac{\partial C_0}{\partial w^{(L)}} = \frac{\partial C_0}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial w^{(L)}}. $$ (We understand $\partial C_0$ as the change in $C_0$ due to a slight change of $\partial w^{(L)}$ in $w^{(L)}$.) Since $$ \begin{align} \frac{\partial z^{(L)}}{\partial w^{(L)}} &= a^{(L-1)}, \\ \frac{\partial a^{(L)}}{\partial z^{(L)}} &= \sigma'\left(z^{(L)}\right), \end{align} $$ we have $$ \frac{\partial C_0}{\partial w^{(L)}} = a^{(L-1)} \sigma'\left(z^{(L)}\right) \frac{\partial C_0}{\partial a^{(L)}}. $$ Similarly, $$ \begin{align} \frac{\partial C_0}{\partial b^{(L)}} &= \sigma'\left(z^{(L)}\right) \frac{\partial C_0}{\partial a^{(L)}}, \\ \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}} &= w^{(L)} \sigma'\left(z^{(L)}\right) \color{red}{\frac{\partial C_0}{\partial a^{(L)}}}. \end{align} $$ Now we are ready to study the change of $C_0$ due to the changes of weight $w^{(L-1)}$, bias $b^{(L-1)}$, and activation unit value $a^{(L-2)}$, where the idea of _back propagation_ comes in. <img src="https://i.imgur.com/7XAj4Hl.png" width=500> Again, by chain rule we have $$ \begin{align} \frac{\partial C_0}{\partial w^{(L-1)}} &= \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} \frac{\partial z^{(L-1)}}{\partial w^{(L-1)}} \\ &= a^{(L-2)} \sigma'\left(z^{(L-1)}\right) \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}}, \\ \frac{\partial C_0}{\partial b^{(L-1)}} &= \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} \frac{\partial z^{(L-1)}}{\partial b^{(L-1)}} \\ &= \sigma'\left(z^{(L-1)}\right) \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}}, \\ \color{red}{\frac{\partial C_0}{\partial a^{(L-2)}}} &= \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} \frac{\partial z^{(L-1)}}{\partial a^{(L-2)}} \\ &= w^{(L-1)} \sigma'\left(z^{(L-1)}\right) \color{red}{\frac{\partial C_0}{\partial a^{(L-1)}}}. \end{align} $$ We can move on until to find the derivatives of $C_0$ with respect to the weight $w^{(1)}$ and $b^{(1)}$ controlling function mapping from the first layer to the second. With all the derivatives of $C_0$ with respect to $w_i$ and $b_i$, we are able to update them using the gradient descent method to minimize $C_0$. That is, $$ \begin{align} w_i &\leftarrow w_i - \alpha \frac{\partial C_0}{\partial w_i}, \\ b_i &\leftarrow b_i - \alpha \frac{\partial C_0}{\partial b_i}. \end{align} $$ The above derivation is for neural network with only one neuron in each layer and for only one training sample. For the neural network composed of multiple neurons in each layer and trained by a set of data, the expressions become complicated but the core idea remains the same. #### Algorithm Recall for a complicate neural network trained by a large data set, we need to minimize the cost function $J(\Theta)$ which involves computing the derivative $\partial J(\Theta)/\partial \Theta^{(l)}_{ij}$. Once again we note $l \in \{1,2,...,L-1\}$ is the layer number, $i \in \{1,2,...,s_j\}$, and $j \in \{1,2,...,s_{j+1}\}$, where $s_j$ and $s_{j+1}$ are the number of activation units in the $j^{th}$ and $j^{th}+1$ layer, respectively. As an example, we consider a neural network with 4 layers shown in figure below. The error between the output of hypothesis and the real value in the last layer is $$ \delta^{(4)}_j = a^{(4)}_j - y_j, \quad \text{or} \quad \boldsymbol{\delta}^{(4)} = \boldsymbol{a}^{(4)} - \boldsymbol{y}. $$ We denote the "error" of unit $j$ in layer $l$ as $\delta^{(l)}_j:=\frac{\partial J}{\partial z_j^{(l)}}$. Then we can compute the "error" in the previous layers in a fashion of backward propagation. That is, $$ \begin{align} \boldsymbol{\delta}^{(3)} &= (\Theta^{(3)})^{\text{T}} \boldsymbol{\delta}^{(4)}.*g'(\boldsymbol{z}^{(3)}), \\ \boldsymbol{\delta}^{(2)} &= (\Theta^{(2)})^{\text{T}} \boldsymbol{\delta}^{(2)}.*g'(\boldsymbol{z}^{(2)}) \end{align} $$ The symbol ".*" denotes the element-wise multiplication of matrices, also known as the Hadamard product[^5]. Recall $\boldsymbol{z}^{(3)} = \Theta^{(2)} \boldsymbol{a}^{(2)}$ and $\boldsymbol{z}^{(2)} = \Theta^{(1)} \boldsymbol{a}^{(1)}$. <img src="https://i.imgur.com/jKk8NOE.png" width=500> It can be shown that $g'(\boldsymbol{z}^{(i)}) = \boldsymbol{a}^{(i)}.*(\boldsymbol{1}-\boldsymbol{a}^{(i)})$. :::spoiler Proof Recall that $$ g(\boldsymbol{z}^{(l)}) = \frac{1}{1+e^{-\boldsymbol{z}^{(l)}}} $$ By taking the derivative, we have $$ \begin{align} g'(\boldsymbol{z}^{(l)}) &= \frac{\partial g}{\partial\boldsymbol{z}^{(l)}} \\ &= \frac{1}{1+e^{-\boldsymbol{z}^{(l)}}} .* \left( 1- \frac{1}{1+e^{-\boldsymbol{z}^{(l)}}} \right) \\ &= g(\boldsymbol{z}^{(l)}).* \left(1-g(\boldsymbol{z}^{(l)})\right) \\ &= \boldsymbol{a}^{(l)}.* \left(1-\boldsymbol{a}^{(l)}\right). \end{align} $$ ::: It can be further shown that when ignoring regularization ($\lambda = 0$), we have $$ \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} = a_j^{(l)} \delta_i^{(l+1)}. $$ The proof can be found on [this page](https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a)[^6] and [this page](https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm)[^7]. I repeated the proof procedure in [this note](https://hackmd.io/@wldeng/back-propagation)[^8]. [^1a]: https://www.coursera.org/learn/machine-learning [^1b]: http://neuralnetworksanddeeplearning.com/chap2.html [^1c]: http://neuralnetworksanddeeplearning.com/chap4.html [^2]: https://en.wikipedia.org/wiki/Cross_entropy [^3]: https://www.youtube.com/watch?v=tIeHLnjs5U8 [^4]: https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29 [^5]: https://mc.ai/a-beginners-guide-to-deriving-and-implementing-backpropagation/ [^6]: https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a [^7]: https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm [^8]: https://hackmd.io/@wldeng/back-propagation