Proof of back propagation algorithm

# Proof of back propagation algorithm ###### tags: `machine learning` [toc] In the [machine learning note](https://hackmd.io/@wldeng/machine-learning-note), we provide a brief summary of machine learning basics and concepts. Here we present a detailed proof of back propagation algorithm for the neural network. The material is mainly borrowed from References[^1][^2]. Note in the convention I use here for $\Theta^{(l)}$ is consistent with Andrew Ng's notation, which indicates the matrix of weights controlling mapping from layer $l$ to $l+1$; while in[^1] it means mapping from layer $l-1$ to $l$. ## Conventions Let us define a neural network with the following conventions: $$ \begin{align} L &= \text{total number of layers in the network},\\ s_l &= \text{number of units (not including bias unit) in the $l^{th}$ layer}, \\ K &= \text{number of units in the output layer, } (K=s_L), \\ \Theta_{i,j}^{(l)} &= \text{the matrix of weights controlling function mapping layer $l$ to layer $l+1$, } \\ &\qquad \text{where $i \in \{1,2,...,s_{l+1}\}$ and $j \in \{1,2,...,s_l+1\}$, i.e., $\Theta^{(l)} \in \mathbb{R}^{s_{l+1}\times (s_{l}+1)}$,} \end{align} $$ <img src="https://i.imgur.com/tRYlx7K.png" width=500> and $$ \begin{align} \boldsymbol{a}^{(l)} &= \text{the vector of activation units in layer $l$, $\boldsymbol{a}^{(l)}\in \mathbb{R}^{s_l+1}$ with $a^{(l)}_0 = 1$ (bias unit)} \\ &\qquad\text{and $a^{(l)}_i = g(z^{(l)}_i)$ for $i = 1,2,...,s_l$}, \\ \boldsymbol{z}^{l} &= \Theta^{(l-1)}\boldsymbol{a}^{(l-1)}, \boldsymbol{z}^{l} \in \mathbb{R}^{s_l}, \text{ or}, \\ z^{(l)}_i &= (\Theta^{(l-1)}_i)^\text{T}\cdot\boldsymbol{a}^{(l-1)}, \text{ for } i = 1,2,...,s_l, \text{ where } (\Theta^{(l)}_i)^\text{T} \in \mathbb{R}^{s_{l-1}+1}. \end{align} $$ We write the matrix of weight $\Theta^{(l)}$ for the $l^{th}$ layer as follows: $$ \begin{align} [\Theta^{(l)}]_{i,j} = \begin{bmatrix} (\Theta^{(l)}_1)^{\text{T}}\\ (\Theta^{(l)}_2)^{\text{T}}\\ \vdots \\ (\Theta^{(l)}_{s_{l}+1})^{\text{T}}\\ \end{bmatrix} = \begin{bmatrix} \Theta^{(l)}_{1,0} & \Theta^{(l)}_{1,1} & \ldots & \Theta^{(l)}_{1,s_{l}} \\ \Theta^{(l)}_{2,0} & \Theta^{(l)}_{2,1} & \ldots & \Theta^{(l)}_{2,s_{l}} \\ \vdots & \vdots & \vdots &\vdots \\ \Theta^{(l)}_{s_{l+1},0} & \Theta^{(l)}_{s_{l+1},1} & \ldots & \Theta^{(l)}_{s_{l+1},s_{l}} \end{bmatrix} \in \mathbb{R}^{s_{l+1}\times (s_{l}+1)}. \end{align} $$ Then we can write $$ \begin{align} \boldsymbol{a}^{(l+1)} = \begin{pmatrix} 1 \\ g\left(\boldsymbol{z}^{(l+1)}\right) \end{pmatrix} = \begin{pmatrix} 1 \\ g\left(\Theta^{(l)}\boldsymbol{a}^{(l)}\right) \end{pmatrix}. \end{align} $$ ## Cost function Recall that the cost function of the neural network is $$ \begin{align} J(\Theta) = \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left( cost(a_k^{(L)},y_k^{(i)})\right) +\frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l+1}} \sum_{j=1}^{s_l}(\Theta^{(l)}_{i,j})^2, \end{align} $$ where $a_k^{(L)}=(h_{\Theta}(\boldsymbol{x}^{i}))_k = g(z^{(L)}_k)$ and $$ cost(a_k^{(L)},y_k^{(i)}) = -\left( y_k^{(i)}\log(h_{\Theta}(\boldsymbol{x}^{(i)}))_{k} + (1-y_k^{(i)})\log(1-(h_{\Theta}(\boldsymbol{x}^{(i)}))_k) \right). $$ Note that the regularization term does not include the weights of the biases, $\Theta^{(l)}_{i,0}$, $i=1,2,...,s_{l+1}$. ## Algorithm We use the gradient descent method to find the minimum of $J$, which requires to compute $$ \frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}. $$ In the following parts, we consider only the first part of $J(\Theta)$ (as if the regularization term $\lambda=0$). The partial derivative of the second term of $J(\Theta)$ is easy to compute. ==__Definition of $\delta$__== Recall that $$ z_{i}^{(l)} = (\Theta_i^{(l-1)})^{\text{T}}\cdot\boldsymbol{a}^{(l-1)} = \sum_{p=0}^{s_{l-1}}\Theta_{i,p}^{(l-1)}a_p^{(l-1)}, $$ where $i=1,2,...,s_l$ and $l=2,3,...,L$. Then we have $$ \begin{align} \frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}} &= 0 \quad \text{if} \quad k \ne i \text{ and } p \ne j,\\ \frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}} &= a_j^{l} \quad \text{if} \quad k = i \text{ and } p = j. \end{align} $$ Applying the chain rule, we get $$ \frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \sum_{k=1}^{s_l} \frac{\partial J(\Theta)}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}}. $$ By defining $$ \color{red}{ \delta^{(l)}_k = \frac{\partial J(\Theta)}{\partial z^{(l)}_k}, \quad k = 1,2,...,s_l, } $$ i.e., $$ \boldsymbol{\delta}^{(l)} = \nabla_{\boldsymbol{z}^{(l)}}J(\Theta) \in \mathbb{R}^{s_l}, $$ we have $$ \color{red}{ \frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = a^{(l)}_j\delta^{(l+1)}_i, } $$ where $i=1,2,...,s_{l+1}$, $j=1,2,...,s_{l}+1$, and $l=1,2,...,L-1$. More specifically, for the derivatives of the bias units' weights, we have: $$ \frac{\partial J(\Theta)}{\partial \Theta_{i,0}^{(l)}} = a^{(l)}_0\delta^{(l+1)}_i =\delta^{(l+1)}_i. $$ ==__Value of $\boldsymbol{\delta}^{(L)}$__== Now let us find $\boldsymbol{\delta}$ for the output layer (layer L). By definition and chain rule, $$ \delta^{(L)}_i = \frac{\partial J(\Theta)}{\partial z^{(L)}_i} = \sum_{k=1}^{s_L} \frac{\partial J(\Theta)}{\partial a^{(L)}_k} \frac{\partial a^{(L)}_k}{\partial z_{i}^{(L)}} =\frac{\partial J(\Theta)}{\partial a^{(L)}_i} g'(z_i{(L)}), $$ where $i= 1,2,...,s_L$, i.e., $$ \color{red}{ \boldsymbol{\delta}^{L} = \nabla_{\boldsymbol{a}^{(L)}}J(\Theta).*g'(\boldsymbol{z}^{(L)}). } $$ Note we used the result $$ \begin{align} \frac{\partial a^{(l)}_k}{\partial z_{i}^{(l)}} = \begin{cases} 0, &\quad \text{if } k \ne i,\\ g'(z_i{(l)}), &\quad \text{if } k = i. \end{cases} \end{align} $$ The exact form of $\nabla_{\boldsymbol{a}^{(L)}}J$ will, of course, depend on the form of the cost function. If we're using the quadratic cost function, $J = \frac{1}{2}\sum_{i=1}^{s_L}(a_i^{(L)}-y_i)^2$, then we have $$ \boldsymbol{\delta}^{L} = \boldsymbol{a}^{(L)} - \boldsymbol{y}. $$ ==__Relation between $\boldsymbol{\delta}^{(l)}$ and $\boldsymbol{\delta}^{(l+1)}$__== Recall that $$ z^{(l+1)}_k = (\Theta_k^{(l)})^{\text{T}} \cdot \boldsymbol{a}^{(l)} = (\Theta_k^{(l)})^{\text{T}} \cdot g(\boldsymbol{z}^{(l)}) = \sum_{p=0}^{s_{l}}\Theta_{k,p}^{(l)}g(z_p^{(l)}). $$ Then $$ \frac{\partial z^{(l+1)}_k}{\partial z_{i}^{(l)}} = \Theta_{k,i}^{(l)}g'(z_i^{(l)}) $$ By definition and chain rule, we have $$ \delta^{(l)}_i = \frac{\partial J(\Theta)}{\partial z^{(l)}_i} = \sum_{k=1}^{s_l+1} \frac{\partial J(\Theta)}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial z_{i}^{(l)}} = \sum_{k=1}^{s_l+1} \delta_k^{(l+1)} \Theta_{k,i}^{(l)}g'(z_i^{(l)}), $$ i.e., $$ \color{red}{ \boldsymbol{\delta}^{(l)} = (\Theta^{(l)})^{\text{T}} \boldsymbol{\delta}^{(l+1)}.*g'(\boldsymbol{z}^{(l)}). } $$ This equation appears complicated, but each element has a nice interpretation. Suppose we know the error $\boldsymbol{\delta}^{(l+1)}$ at the $l^{th}+1$ layer. When we apply the transpose weight matrix, $(\Theta^{(l)})^{\text{T}}$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{th}$ layer. We then take the Hadamard product $.*g'(\boldsymbol{z}^{(l)})$. This moves the error backward through the activation function in layer $l$, giving us the error $\boldsymbol{\delta}^{(l)}$ in the weighted input to layer $l$. <img src="https://i.imgur.com/awbx6Jk.png" width=500> ### Summary: Equations of back propagation Backpropagate the error $$ \begin{align} \boldsymbol{\delta}^{L} &= \nabla_{\boldsymbol{a}^{(L)}}J(\Theta).*g'(\boldsymbol{z}^{(L)}),\\ \boldsymbol{\delta}^{(l)} &= (\Theta^{(l)})^{\text{T}} \boldsymbol{\delta}^{(l+1)}.*g'(\boldsymbol{z}^{(l)}), \end{align} $$ and compute the derivative $$ \frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = a^{(l)}_j\delta^{(l+1)}_i, $$ where $i=1,2,...,s_{l+1}$, $j=1,2,...,s_{l}+1$, and $l=1,2,...,L-1$. Recall we showed that[^3] $$ g'(\boldsymbol{z}^{(i)}) = \boldsymbol{a}^{(i)}.*(\boldsymbol{1}-\boldsymbol{a}^{(i)}). $$ [^1]: https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a [^2]: http://neuralnetworksanddeeplearning.com/chap2.html [^3]: https://hackmd.io/@wldeng/machine-learning-note#Algorithm