# Dropout
###### tags: `Data Science`, `Deep Neural Network`
## Motivation
* Overfitting is a serious problem in deep neural networks.
* Ensemble is helpful to avoid overfitting, but it is too slow to apply on deep neural network
## Contribution
* The authors proposed an algorithm that prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently.
## Methods
### Definitions
* $l \in \{1, 2, ... L\}$: the index of the (input and hidden) layers.
* $\mathbf{z}^{(l)}$: the vector of the input into the layer $l$.
* $\mathbf{y}^{(l)}$: the vector of the output from the layer $l$.
* $\mathbf{W}^{(l)}$: the weights of the input into the layer $l$.
* $\mathbf{b}^{(l)}$: the biases of the input into the layer $l$
### Formulation

The feed forward operation of a particular node $i$ in layer $l+1$ of a deep neural network can be formulated as follow:
\begin{align}z^{(l+1)}_i&=\mathbf{w}^{(l+1)}_i\mathbf{y}^{(l)}+b_i^{(l)}\\
y^{(l+1)}_i&=f(z^{(l+1)}_i)\end{align}
where $f$ is any activation function. With dropout, the feed forward operation becomes:
\begin{align}r_i^{(l)} &\sim \text{Bernoulli}(p)\\
\tilde{\mathbf{y}}&=\mathbf{r}^{(l)} \odot \mathbf{y}^{(l)}\\
z^{(l+1)}_i&=\mathbf{w}^{(l+1)}_i\tilde{\mathbf{y}}^{(l)}+b_i^{(l)}\\
y^{(l+1)}_i&=f(z^{(l+1)}_i)
\end{align}
### Backpropagation and Testing

In the training process, the probability $p$ for the Bernoulli distribution is a hyperparameter defined by the user. $\mathbf{r}$ can be then be sampled from the distribution. The backpropagation is the same for the classic deep neural network. In the testing process, the weights are scaled as $\mathbf{W}_{test}=p\mathbf{W}_{train}$
### Backpropagation (Mini-batch Stochastic Gradient Descent)
The backpropagation for Mini-batch SGD can be summerized as follows:
* For each training case (sample) in a mini-batch, the authors
* Forward and backpropagation for that training case are done only on this thinned network.
* The gradients for each parameter are __averaged__ over the training cases in each mini-batch.
* Any training case which does not use a parameter contributes a gradient of __zero__ for that parameter.
The pseudocode for backpropagation in mini-batch:
```python
for batch in mini-batches:
for sample in the batch:
sample a thinned network by dropping out units
feed-forward operation
backpropagation
average the gradient
update the weight
```
## Results

## Reference
* Nitish Srivastava _et al_. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15.