Dropout - HackMD

# Dropout ###### tags: `Data Science`, `Deep Neural Network` ## Motivation * Overfitting is a serious problem in deep neural networks. * Ensemble is helpful to avoid overfitting, but it is too slow to apply on deep neural network ## Contribution * The authors proposed an algorithm that prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. ## Methods ### Definitions * $l \in \{1, 2, ... L\}$: the index of the (input and hidden) layers. * $\mathbf{z}^{(l)}$: the vector of the input into the layer $l$. * $\mathbf{y}^{(l)}$: the vector of the output from the layer $l$. * $\mathbf{W}^{(l)}$: the weights of the input into the layer $l$. * $\mathbf{b}^{(l)}$: the biases of the input into the layer $l$ ### Formulation ![](https://i.imgur.com/xv4oOgA.png) The feed forward operation of a particular node $i$ in layer $l+1$ of a deep neural network can be formulated as follow: \begin{align}z^{(l+1)}_i&=\mathbf{w}^{(l+1)}_i\mathbf{y}^{(l)}+b_i^{(l)}\\ y^{(l+1)}_i&=f(z^{(l+1)}_i)\end{align} where $f$ is any activation function. With dropout, the feed forward operation becomes: \begin{align}r_i^{(l)} &\sim \text{Bernoulli}(p)\\ \tilde{\mathbf{y}}&=\mathbf{r}^{(l)} \odot \mathbf{y}^{(l)}\\ z^{(l+1)}_i&=\mathbf{w}^{(l+1)}_i\tilde{\mathbf{y}}^{(l)}+b_i^{(l)}\\ y^{(l+1)}_i&=f(z^{(l+1)}_i) \end{align} ### Backpropagation and Testing ![](https://i.imgur.com/KxIHOmz.png) In the training process, the probability $p$ for the Bernoulli distribution is a hyperparameter defined by the user. $\mathbf{r}$ can be then be sampled from the distribution. The backpropagation is the same for the classic deep neural network. In the testing process, the weights are scaled as $\mathbf{W}_{test}=p\mathbf{W}_{train}$ ### Backpropagation (Mini-batch Stochastic Gradient Descent) The backpropagation for Mini-batch SGD can be summerized as follows: * For each training case (sample) in a mini-batch, the authors * Forward and backpropagation for that training case are done only on this thinned network. * The gradients for each parameter are __averaged__ over the training cases in each mini-batch. * Any training case which does not use a parameter contributes a gradient of __zero__ for that parameter. The pseudocode for backpropagation in mini-batch: ```python for batch in mini-batches: for sample in the batch: sample a thinned network by dropping out units feed-forward operation backpropagation average the gradient update the weight ``` ## Results ![](https://i.imgur.com/35f47Ci.png) ## Reference * Nitish Srivastava _et al_. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15.