# Batch Normalization
###### tags: `Data Science`, `Deep Neural Network`
## Motivation
* Stochastic gradient descent requires careful tuning of the model hyper-parameters, specifically the __learning rate__ and the __initial values__.
* __(Internal) Covariate shift__: the change in the distributions of each hiddent layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. Consider output $l=F_2(F_1(u, \Theta_1), \Theta_2)$, let $\mathbf{x}=F_1(u, \Theta_1)$, then the learning of parameter $\Theta_2$ during gradient descent is:
\begin{equation}
\Theta_2:=\Theta_2-\frac{\eta}{m}\sum_{i=1}^{m}\frac{\partial F_2{(x_i, \Theta_2)}}{\partial \Theta_2}
\end{equation}where $m$ is the size of the mini-batch, $\eta$ is the learning rate. When the distribution of $x$ change drastically, the update of $\Theta_2$ need to readjust to compensate, causing the network to be unstable.
* __Gradient vanishing__: when using sigmoid activation function, the gradient vanish since only a small range of the domain has non-zero gradient, causing the update of gradient to be extremely inefficient.
## Contribution
* The authors proposed an algorithm which __reduce internal covariate shift__, and in doing so (__enable higher learning rate__) of deep neural nets.
* The authors also mention the method can regularize the model, making the model more generalize (see _Section 3.3_ of original paper)
## Methods
### Definitions
* $l \in \{1, 2, ... L\}$: the index of the (input and hidden) layers.
* $\mathbf{x}^{(l)}$: input of a specific activation layer $l$ (output of the activation function of previous layer in a mini-batch).
* $m$: mini-batch size
* $\mu^{(l)}$: $\frac{1}{m}\sum_{i=1}^mx_i^{(l)}$
* $\sigma^{2(l)}$: $\frac{1}{m}\sum_{i=1}^m(x_i^{(l)}-\mu^{(l)})^2$
* $\epsilon$: constant to ensure numerical stability.
### Formulation
The authors proposed to normalize __each dimension independently__ (for each mini-batch):
\begin{equation}
\hat{x}^{(l)}=\frac{\hat{x}^{(l)}-\mu^{(l)}}{\sqrt{\sigma^{2(l)}+\epsilon}}
\end{equation}Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the _linear regime_ of the nonlinearity. To address this, the authors introduce, for the input of each activation $\mathbf{x}^{(k)}$ , a pair of __trainable__ parameters $\mathbf{\gamma}^{(k)}, \mathbf{\beta}^{(k)}$:
\begin{equation}
y^{(l)}_i=\gamma^{(l)}\hat{x}^{(l)}_i+\beta^{(k)}
\end{equation}
### Backpropagation
\begin{align}
\frac{\partial l}{\partial \hat{x_i}^{(l)}}&=\frac{\partial l}{\partial y_i^{(l)}}\gamma^{(l)}\\
\frac{\partial l}{\partial \sigma^{2(l)}}&=\sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}^{(l)}}(x^{(l)}_i-\mu^{(l)})\frac{-1}{2}(\sigma^{2(l)}+\epsilon)^{-3/2}\\
\frac{\partial l}{\partial \mu^{(l)}}&=\sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}^{(l)}}\frac{-1}{\sqrt{\sigma^{2(l)}+\epsilon}}+\frac{\partial l}{\partial\sigma^{2(l)}}\frac{\sum_{i=1}^m-2(x_i-\mu^{(l)})}{m}\\
\frac{\partial l}{\partial x_i}&=\frac{\partial l}{\partial \hat{x_i}^{(l)}}\frac{1}{\sqrt{\sigma^{2(l)}+\epsilon}}+\frac{\partial l}{\partial\sigma^{2(l)}}\frac{2(x_i-\mu^{(l)})}{m}+\frac{\partial l}{\partial \mu^{(l)}}\frac{1}{m}\\
\frac{\partial l}{\partial \gamma^{(l)}}&=\sum_{i=1}^{m}\frac{\partial l}{\partial y_i^{(l)}}\hat{x}_i\\
\frac{\partial l}{\partial \beta^{(l)}}&=\sum_{i=1}^{m}\frac{\partial l}{\partial y_i^{(l)}}\\
\end{align}
### Testing
In the testing stage, the authors suggest to use the mean and variance consider __entire population__:
\begin{equation}
\hat{x}^{(l)}=\frac{\hat{x}^{(l)}-E[x^{(l)}]}{\sqrt{Var[x^{(l)}]+\epsilon}}
\end{equation}where $Var[x^{(l)}]=\frac{m}{m-1}E[\sigma^{2(l)}_B]$ where $E[\sigma^{2(l)}_B]$ is the expected value for the variance consider all the batches. The normalized value can be further composed with the scaling and shifting factors:
\begin{equation}
y^{(l)}_i=\gamma^{(l)}\hat{x}^{(l)}_i+\beta^{(k)}
\end{equation}
## Results
* MNIST accuracy

* ImageNet accuracy

## Reference
* Sergey Ioffe _et al_. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv.