Batch Normalization

# Batch Normalization ###### tags: `Data Science`, `Deep Neural Network` ## Motivation * Stochastic gradient descent requires careful tuning of the model hyper-parameters, specifically the __learning rate__ and the __initial values__. * __(Internal) Covariate shift__: the change in the distributions of each hiddent layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. Consider output $l=F_2(F_1(u, \Theta_1), \Theta_2)$, let $\mathbf{x}=F_1(u, \Theta_1)$, then the learning of parameter $\Theta_2$ during gradient descent is: \begin{equation} \Theta_2:=\Theta_2-\frac{\eta}{m}\sum_{i=1}^{m}\frac{\partial F_2{(x_i, \Theta_2)}}{\partial \Theta_2} \end{equation}where $m$ is the size of the mini-batch, $\eta$ is the learning rate. When the distribution of $x$ change drastically, the update of $\Theta_2$ need to readjust to compensate, causing the network to be unstable. * __Gradient vanishing__: when using sigmoid activation function, the gradient vanish since only a small range of the domain has non-zero gradient, causing the update of gradient to be extremely inefficient. ## Contribution * The authors proposed an algorithm which __reduce internal covariate shift__, and in doing so (__enable higher learning rate__) of deep neural nets. * The authors also mention the method can regularize the model, making the model more generalize (see _Section 3.3_ of original paper) ## Methods ### Definitions * $l \in \{1, 2, ... L\}$: the index of the (input and hidden) layers. * $\mathbf{x}^{(l)}$: input of a specific activation layer $l$ (output of the activation function of previous layer in a mini-batch). * $m$: mini-batch size * $\mu^{(l)}$: $\frac{1}{m}\sum_{i=1}^mx_i^{(l)}$ * $\sigma^{2(l)}$: $\frac{1}{m}\sum_{i=1}^m(x_i^{(l)}-\mu^{(l)})^2$ * $\epsilon$: constant to ensure numerical stability. ### Formulation The authors proposed to normalize __each dimension independently__ (for each mini-batch): \begin{equation} \hat{x}^{(l)}=\frac{\hat{x}^{(l)}-\mu^{(l)}}{\sqrt{\sigma^{2(l)}+\epsilon}} \end{equation}Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the _linear regime_ of the nonlinearity. To address this, the authors introduce, for the input of each activation $\mathbf{x}^{(k)}$ , a pair of __trainable__ parameters $\mathbf{\gamma}^{(k)}, \mathbf{\beta}^{(k)}$: \begin{equation} y^{(l)}_i=\gamma^{(l)}\hat{x}^{(l)}_i+\beta^{(k)} \end{equation} ### Backpropagation \begin{align} \frac{\partial l}{\partial \hat{x_i}^{(l)}}&=\frac{\partial l}{\partial y_i^{(l)}}\gamma^{(l)}\\ \frac{\partial l}{\partial \sigma^{2(l)}}&=\sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}^{(l)}}(x^{(l)}_i-\mu^{(l)})\frac{-1}{2}(\sigma^{2(l)}+\epsilon)^{-3/2}\\ \frac{\partial l}{\partial \mu^{(l)}}&=\sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}^{(l)}}\frac{-1}{\sqrt{\sigma^{2(l)}+\epsilon}}+\frac{\partial l}{\partial\sigma^{2(l)}}\frac{\sum_{i=1}^m-2(x_i-\mu^{(l)})}{m}\\ \frac{\partial l}{\partial x_i}&=\frac{\partial l}{\partial \hat{x_i}^{(l)}}\frac{1}{\sqrt{\sigma^{2(l)}+\epsilon}}+\frac{\partial l}{\partial\sigma^{2(l)}}\frac{2(x_i-\mu^{(l)})}{m}+\frac{\partial l}{\partial \mu^{(l)}}\frac{1}{m}\\ \frac{\partial l}{\partial \gamma^{(l)}}&=\sum_{i=1}^{m}\frac{\partial l}{\partial y_i^{(l)}}\hat{x}_i\\ \frac{\partial l}{\partial \beta^{(l)}}&=\sum_{i=1}^{m}\frac{\partial l}{\partial y_i^{(l)}}\\ \end{align} ### Testing In the testing stage, the authors suggest to use the mean and variance consider __entire population__: \begin{equation} \hat{x}^{(l)}=\frac{\hat{x}^{(l)}-E[x^{(l)}]}{\sqrt{Var[x^{(l)}]+\epsilon}} \end{equation}where $Var[x^{(l)}]=\frac{m}{m-1}E[\sigma^{2(l)}_B]$ where $E[\sigma^{2(l)}_B]$ is the expected value for the variance consider all the batches. The normalized value can be further composed with the scaling and shifting factors: \begin{equation} y^{(l)}_i=\gamma^{(l)}\hat{x}^{(l)}_i+\beta^{(k)} \end{equation} ## Results * MNIST accuracy ![](https://i.imgur.com/Uq6BFms.png) * ImageNet accuracy ![](https://i.imgur.com/IB6dB4Z.png) ## Reference * Sergey Ioffe _et al_. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv.