# L7-Regularization
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/regularization.html
* In machine learning, We hope that training in the training set will minimize the expected loss over the true data-generating distribution $p_{data}(x, y)$
* Strategies used to reduce test error, possibly at the expense of increased training error, are known collectively as regularization
* Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty $\Omega(\theta)$.
* $\tilde J(\theta; x, y) = J(\theta; x, y) + \Omega(\theta)$
* We usually minimize 
* However, we want to acheive 
* Regularization problem is to minizize $g^*(x) = argminE_{x, y \sim p_{data}}[(y - g(x))^2]$
* $g^*(x) = E_{y \sim p_{data}(y|x)}[y]$
* $\begin{split}
E_{x, y \sim p_{data}}[(y - \hat y(x))^2] &= E_{x, y \sim p_{data}}[y - g^*(x) + g^*(x) - \hat y(x))^2] \\
&= E_{x, y \sim p_{data}}[(y - g^*(x))^2 + (g^*(x) - \hat y(x))^2]
\end{split}$
* where the cross-term
$\begin{split}&E_{x, y \sim p_{data}}[2(y - g^*(x))(g^*(x) - \hat y(x))] \\
&= E_{x \sim p_{data}}E_{y \sim p_{data} (y|x)}[2(y - g^*(x))(g^*(x) - \hat y(x))] \\
&= E_{x \sim p_{data}}[2E_{y \sim p_{data} (y|x)}(y - g^*(x))(g^*(x) - \hat y(x))]\end{split}$
the fitst term is 0.
* $E_{x, y \sim p_{data}}[(y - g^*(x))^2]$ is the bayes error.
* The Bayes error, which arises from the intrinsic noise on data, represents the minimum achievable value of the expected generalization error. We cannot minimize it.
* $\begin{split}
E_{x \sim p_{data}}[(g^*(x) - \hat y(x))^2] &= E_{x \sim p_{data}}[(g^*(x) - E_{x \sim \hat p_{data}}[\hat y(x)] + E_{x \sim \hat p_{data}}[\hat y(x)] - \hat y(x))^2] \\
&= E_{x \sim p_{data}}[(g^*(x) - E_{x \sim \hat p_{data}}[\hat y(x)])^2 + (E_{x \sim \hat p_{data}}[\hat y(x)] - \hat y(x))^2]
\end{split}$
* where the cross-term
$E_{x \sim p_{data}}[2(g^*(x) - E_{x \sim \hat p_{data}}[\hat y(x)])(E_{x \sim \hat p_{data}}[\hat y(x)] - \hat y(x))]$
the last term is 0.
* $(g^*(x) - E_{x \sim \hat p_{data}}[\hat y(x)])^2$ is the bias term which represents the extent to which the average model prediction over all training data sets differs from the optimal prediction.
* $(E_{x \sim \hat p_{data}}[\hat y(x)] - \hat y(x))^2$ is the variance term which measures the extent to which the model $\hat y(x)$ is sensitive to the particular choice of training set.
* Generalization loss = Bayes error + $E_{x \sim p_{data}}[(bias)^2] + E_{x \sim p_{data}}[variance]$
* Trading off Bias and Variance
* 
* In general, models of high capacity have low bias and high variance, where as models of low capacity have high bias and low variance
* L2 Regularization
* The L2 parameter regularization drives the weights closer to the origin by adding a L2-norm penalty
* 
* 
* 
* 
* 
* 
* 
* 
* along directions corresponding to large $\lambda_i$, a further deviation from $w^∗$ contributes significantly to increasing the objective function
* The effect of L2 regularization is to decay away components of $w^∗$ along unimportant directions with $\lambda_i << \alpha$
* L1 Regularization
* Formally, L1regularization on the model parameter w is defined as
* 
* It is however noticed that the full general Hessian does not admit a clean algebraic solution to the following problem
* 
* 
* 
* 
* 
* The sparsity property induced by L1 regularization has been used extensively as afeature selection mechanism. Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.
* Norm Penalties as Constrained Optimization
* 
* 
* This is exactly the same as the regularized training problem of minimizing $\tilde J$.We can thus think of a parameter norm penalty as imposing a constraint on the weights. If $\Omega$ is the L2 norm, then the weights are constrained to lie in an L2 ball. If $\Omega$ is the , then the weights are constrained to lie in a region of limited L1 norm. Usually we do not know the size of the constraint region that weimpose by using weight decay with coefficient $\alpha^∗$ because the value of $\alpha^∗$ does not directly tell us the value of $k$. In principle, one can solve fork, but the relationship between $k$ and $\alpha^∗$ depends on the form of $J$. While we do not know the exact size of the constraint region, we can control it roughly by increasing or decreasing $\alpha$ in order to grow or shrink the constraint region. Larger $\alpha$ will result in a smaller constraint region. Smaller $\alpha$ will result in a larger constraint region.
* Trajectory of L2 Regularization
* 
* Trajectory of L1 Regularization
* 
* Early Stopping
* 
* Stop when the validation loss get the lowest.
* Early Stopping as Regularization
* In a sense, early stopping restricts the training to a small volume of parameter space around the initial $w_0$
* The reachable volume is determined by the product $\tau$ of the training iterations $\tau$ and learning rate $\epsilon$.
* imagine taking $\tau$ optimization steps (corresponding to $\tau$ training iterations) and with learning rate $\epsilon$. We can view the product $\tau\epsilon$ as a measure of effective capacity.
* Assuming the gradient is bounded, restricting both the number of iterations and the learning rate limits the volume of parameter space reachable from $\theta_0$. In this sense, $\tau\epsilon$ behaves as if it were the reciprocal ofthe coefficient used for weight decay.
* 
* 
* The derivations in this section have shown that a trajectory of length $\tau$ ends at a point that corresponds to a minimum of the L2-regularized objective. Early stopping is of course more than the mere restriction of the trajectory length; instead, early stopping typically involves monitoring the validation set error in order to stop the trajectory at a particularly good point in space. Early stopping therefore has the advantage over weight decay in that it automatically determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter.
* Early stopping however has the advantage of automatically determining the correct amount of regularization (i.e. the value of $\tau \thickapprox \alpha^{-1}$)
* Bootstrap Aggregating (Bagging)
* To train several models separately and have them vote on the output for test examples (a.k.a. model averaging or ensemble methods
* 
* In other words, on average, the ensemble will perform at least as well as any of its members, and ignificantly better if the members make independent errors.
* Dropout
* To train a bagged ensemble of exponentially many neural networks that consist of all subnetworks of a base network
* 
* Subnetwork construction
* To remove nonoutput units through multiplication of their output values by zero, with a mask vector $\mu$ indicating which units to keep
* Dropout training
* In each mini-batch step, we randomly sample a binary mask $\mu$; run forward- and back-propagation; and update parameters as usual.
* 
* Dropout training is not quite the same as bagging training. In the case of bagging, the models are all independent. In the case of dropout, the models share parameters, with each model inheriting a different subset of parameters from the parent neural network.
* In the case of bagging, each model is trained to convergence on its respective training set. In the case of dropout, typically most models are not explicitly trained at all.
* Pros and Cons
* 
* Another empirical approach, termed the weight scaling inference rule, allows us to approximate$p(y|x)$ in one model: the model with all units, but with the weights going out of unit $$ multiplied by the probability of including unit $i$.
* The motivation is to capture the right expected value of output from that unit, or to make sure that the expected total input to a unit at test time is roughly the same as that at training time.
* The weight scaling inference rule is exact in some settings, e.g. deep networks that have hidden layers without non-linearities
* Multitask Learning
* To improve generalization by pooling examples for several tasks.
* 
* Assumption: there exists a common pool of factors that explain the data variations, while each task is linked to a subset of these factors
* The dominant approach is to perform a weighted linear sum of losses
* 
* The model performance is however sensitive to weight selection $w_i$
* Training Multitask Likelihoods
* 
* 
* Data Augmentation
* To add fake data to make the model generalize better
* Adversarial Training
* To encourage the model $\hat y(x)$ to be locally constant in the vicinity of training data $x$ by including adversarial examples for training.
* 
* 