Regularization

# Regularizing a machine learning model  slide: https://hackmd.io/@ccornwell/regularization --- <h3>Balancing the space of parameters for model</h3> - <font size=+1>Procedure to minimize loss function on training data can favor making "large" parameters.</font> ![](https://i.imgur.com/LWP10uD.png =x350) --- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font style="color:#181818;">New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> ---- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> ---- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2>Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> --- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2 style="color:#181818;">In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font>$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>