# Proof of back propagation algorithm
###### tags: `machine learning`
[toc]
In the [machine learning note](https://hackmd.io/@wldeng/machine-learning-note), we provide a brief summary of machine learning basics and concepts. Here we present a detailed proof of back propagation algorithm for the neural network. The material is mainly borrowed from References[^1][^2].
Note in the convention I use here for $\Theta^{(l)}$ is consistent with Andrew Ng's notation, which indicates the matrix of weights controlling mapping from layer $l$ to $l+1$; while in[^1] it means mapping from layer $l-1$ to $l$.
## Conventions
Let us define a neural network with the following conventions:
$$
\begin{align}
L &= \text{total number of layers in the network},\\
s_l &= \text{number of units (not including bias unit) in the $l^{th}$ layer}, \\
K &= \text{number of units in the output layer, } (K=s_L), \\
\Theta_{i,j}^{(l)} &= \text{the matrix of weights controlling function mapping layer $l$ to layer $l+1$, } \\
&\qquad \text{where $i \in \{1,2,...,s_{l+1}\}$ and $j \in \{1,2,...,s_l+1\}$, i.e., $\Theta^{(l)} \in \mathbb{R}^{s_{l+1}\times (s_{l}+1)}$,}
\end{align}
$$
<img src="https://i.imgur.com/tRYlx7K.png" width=500>
and
$$
\begin{align}
\boldsymbol{a}^{(l)} &= \text{the vector of activation units in layer $l$, $\boldsymbol{a}^{(l)}\in \mathbb{R}^{s_l+1}$ with $a^{(l)}_0 = 1$ (bias unit)} \\
&\qquad\text{and $a^{(l)}_i = g(z^{(l)}_i)$ for $i = 1,2,...,s_l$}, \\
\boldsymbol{z}^{l} &= \Theta^{(l-1)}\boldsymbol{a}^{(l-1)}, \boldsymbol{z}^{l} \in \mathbb{R}^{s_l}, \text{ or}, \\ z^{(l)}_i &= (\Theta^{(l-1)}_i)^\text{T}\cdot\boldsymbol{a}^{(l-1)}, \text{ for } i = 1,2,...,s_l, \text{ where } (\Theta^{(l)}_i)^\text{T} \in \mathbb{R}^{s_{l-1}+1}.
\end{align}
$$
We write the matrix of weight $\Theta^{(l)}$ for the $l^{th}$ layer as follows:
$$
\begin{align}
[\Theta^{(l)}]_{i,j} =
\begin{bmatrix}
(\Theta^{(l)}_1)^{\text{T}}\\
(\Theta^{(l)}_2)^{\text{T}}\\
\vdots \\
(\Theta^{(l)}_{s_{l}+1})^{\text{T}}\\
\end{bmatrix}
=
\begin{bmatrix}
\Theta^{(l)}_{1,0} & \Theta^{(l)}_{1,1} & \ldots & \Theta^{(l)}_{1,s_{l}} \\
\Theta^{(l)}_{2,0} & \Theta^{(l)}_{2,1} & \ldots & \Theta^{(l)}_{2,s_{l}} \\
\vdots & \vdots & \vdots &\vdots \\
\Theta^{(l)}_{s_{l+1},0} & \Theta^{(l)}_{s_{l+1},1} & \ldots & \Theta^{(l)}_{s_{l+1},s_{l}}
\end{bmatrix}
\in
\mathbb{R}^{s_{l+1}\times (s_{l}+1)}.
\end{align}
$$
Then we can write
$$
\begin{align}
\boldsymbol{a}^{(l+1)}
=
\begin{pmatrix}
1 \\
g\left(\boldsymbol{z}^{(l+1)}\right)
\end{pmatrix}
=
\begin{pmatrix}
1 \\
g\left(\Theta^{(l)}\boldsymbol{a}^{(l)}\right)
\end{pmatrix}.
\end{align}
$$
## Cost function
Recall that the cost function of the neural network is
$$
\begin{align}
J(\Theta) = \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left( cost(a_k^{(L)},y_k^{(i)})\right)
+\frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l+1}} \sum_{j=1}^{s_l}(\Theta^{(l)}_{i,j})^2,
\end{align}
$$
where $a_k^{(L)}=(h_{\Theta}(\boldsymbol{x}^{i}))_k = g(z^{(L)}_k)$ and
$$
cost(a_k^{(L)},y_k^{(i)}) = -\left( y_k^{(i)}\log(h_{\Theta}(\boldsymbol{x}^{(i)}))_{k} + (1-y_k^{(i)})\log(1-(h_{\Theta}(\boldsymbol{x}^{(i)}))_k) \right).
$$
Note that the regularization term does not include the weights of the biases, $\Theta^{(l)}_{i,0}$, $i=1,2,...,s_{l+1}$.
## Algorithm
We use the gradient descent method to find the minimum of $J$, which requires to compute
$$
\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.
$$
In the following parts, we consider only the first part of $J(\Theta)$ (as if the regularization term $\lambda=0$). The partial derivative of the second term of $J(\Theta)$ is easy to compute.
==__Definition of $\delta$__==
Recall that
$$
z_{i}^{(l)} = (\Theta_i^{(l-1)})^{\text{T}}\cdot\boldsymbol{a}^{(l-1)} = \sum_{p=0}^{s_{l-1}}\Theta_{i,p}^{(l-1)}a_p^{(l-1)},
$$
where $i=1,2,...,s_l$ and $l=2,3,...,L$.
Then we have
$$
\begin{align}
\frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}} &= 0 \quad \text{if} \quad k \ne i \text{ and } p \ne j,\\
\frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}} &= a_j^{l} \quad \text{if} \quad k = i \text{ and } p = j.
\end{align}
$$
Applying the chain rule, we get
$$
\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} =
\sum_{k=1}^{s_l} \frac{\partial J(\Theta)}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial \Theta_{i,j}^{(l)}}.
$$
By defining
$$
\color{red}{
\delta^{(l)}_k = \frac{\partial J(\Theta)}{\partial z^{(l)}_k}, \quad k = 1,2,...,s_l,
}
$$
i.e.,
$$
\boldsymbol{\delta}^{(l)} = \nabla_{\boldsymbol{z}^{(l)}}J(\Theta) \in \mathbb{R}^{s_l},
$$
we have
$$
\color{red}{
\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = a^{(l)}_j\delta^{(l+1)}_i,
}
$$
where $i=1,2,...,s_{l+1}$, $j=1,2,...,s_{l}+1$, and $l=1,2,...,L-1$.
More specifically, for the derivatives of the bias units' weights, we have:
$$
\frac{\partial J(\Theta)}{\partial \Theta_{i,0}^{(l)}} = a^{(l)}_0\delta^{(l+1)}_i =\delta^{(l+1)}_i.
$$
==__Value of $\boldsymbol{\delta}^{(L)}$__==
Now let us find $\boldsymbol{\delta}$ for the output layer (layer L). By definition and chain rule,
$$
\delta^{(L)}_i = \frac{\partial J(\Theta)}{\partial z^{(L)}_i}
= \sum_{k=1}^{s_L} \frac{\partial J(\Theta)}{\partial a^{(L)}_k} \frac{\partial a^{(L)}_k}{\partial z_{i}^{(L)}}
=\frac{\partial J(\Theta)}{\partial a^{(L)}_i} g'(z_i{(L)}),
$$
where $i= 1,2,...,s_L$, i.e.,
$$
\color{red}{
\boldsymbol{\delta}^{L} = \nabla_{\boldsymbol{a}^{(L)}}J(\Theta).*g'(\boldsymbol{z}^{(L)}).
}
$$
Note we used the result
$$
\begin{align}
\frac{\partial a^{(l)}_k}{\partial z_{i}^{(l)}} =
\begin{cases}
0, &\quad \text{if } k \ne i,\\
g'(z_i{(l)}), &\quad \text{if } k = i.
\end{cases}
\end{align}
$$
The exact form of $\nabla_{\boldsymbol{a}^{(L)}}J$ will, of course, depend on the form of the cost function. If we're using the quadratic cost function, $J = \frac{1}{2}\sum_{i=1}^{s_L}(a_i^{(L)}-y_i)^2$, then we have
$$
\boldsymbol{\delta}^{L} = \boldsymbol{a}^{(L)} - \boldsymbol{y}.
$$
==__Relation between $\boldsymbol{\delta}^{(l)}$ and $\boldsymbol{\delta}^{(l+1)}$__==
Recall that
$$
z^{(l+1)}_k
= (\Theta_k^{(l)})^{\text{T}} \cdot \boldsymbol{a}^{(l)}
= (\Theta_k^{(l)})^{\text{T}} \cdot g(\boldsymbol{z}^{(l)})
= \sum_{p=0}^{s_{l}}\Theta_{k,p}^{(l)}g(z_p^{(l)}).
$$
Then
$$
\frac{\partial z^{(l+1)}_k}{\partial z_{i}^{(l)}}
= \Theta_{k,i}^{(l)}g'(z_i^{(l)})
$$
By definition and chain rule, we have
$$
\delta^{(l)}_i = \frac{\partial J(\Theta)}{\partial z^{(l)}_i}
= \sum_{k=1}^{s_l+1} \frac{\partial J(\Theta)}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial z_{i}^{(l)}}
= \sum_{k=1}^{s_l+1} \delta_k^{(l+1)} \Theta_{k,i}^{(l)}g'(z_i^{(l)}),
$$
i.e.,
$$
\color{red}{
\boldsymbol{\delta}^{(l)} =
(\Theta^{(l)})^{\text{T}}
\boldsymbol{\delta}^{(l+1)}.*g'(\boldsymbol{z}^{(l)}).
}
$$
This equation appears complicated, but each element has a nice interpretation. Suppose we know the error $\boldsymbol{\delta}^{(l+1)}$ at the $l^{th}+1$ layer. When we apply the transpose weight matrix, $(\Theta^{(l)})^{\text{T}}$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{th}$ layer. We then take the Hadamard product $.*g'(\boldsymbol{z}^{(l)})$. This moves the error backward through the activation function in layer $l$, giving us the error $\boldsymbol{\delta}^{(l)}$ in the weighted input to layer $l$.
<img src="https://i.imgur.com/awbx6Jk.png" width=500>
### Summary: Equations of back propagation
Backpropagate the error
$$
\begin{align}
\boldsymbol{\delta}^{L} &= \nabla_{\boldsymbol{a}^{(L)}}J(\Theta).*g'(\boldsymbol{z}^{(L)}),\\
\boldsymbol{\delta}^{(l)} &=
(\Theta^{(l)})^{\text{T}}
\boldsymbol{\delta}^{(l+1)}.*g'(\boldsymbol{z}^{(l)}),
\end{align}
$$
and compute the derivative
$$
\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = a^{(l)}_j\delta^{(l+1)}_i,
$$
where $i=1,2,...,s_{l+1}$, $j=1,2,...,s_{l}+1$, and $l=1,2,...,L-1$.
Recall we showed that[^3]
$$
g'(\boldsymbol{z}^{(i)}) = \boldsymbol{a}^{(i)}.*(\boldsymbol{1}-\boldsymbol{a}^{(i)}).
$$
[^1]: https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a
[^2]: http://neuralnetworksanddeeplearning.com/chap2.html
[^3]: https://hackmd.io/@wldeng/machine-learning-note#Algorithm