---
title: Regularization
tags: Senior Seminar, Talk
description: View the slide with "Slide Mode".
---
# Regularizing a machine learning model
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/regularization
---
<h3>Balancing the space of parameters for model</h3>
- <font size=+1>Procedure to minimize loss function on training data can favor making "large" parameters.</font>

---
<h3>A penalty term on the loss function</h3>
- <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font>
<font style="color:#181818;">New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font>
- <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font>
----
<h3>A penalty term on the loss function</h3>
- <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font>
<font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font>
- <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font>
----
<h3>A penalty term on the loss function</h3>
- <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font>
<font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font>
- <font size=+2>Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font>
---
<h3>Gradients of loss with penalty</h3>
New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$
- <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font>
- <font size=+2 style="color:#181818;">In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font>
- <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font>
<font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>
----
<h3>Gradients of loss with penalty</h3>
New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$
- <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font>
- <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font>
- <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font>
<font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>
----
<h3>Gradients of loss with penalty</h3>
New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$
- <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font>
- <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font>
- <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font>
<font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>
----
<h3>Gradients of loss with penalty</h3>
New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$
- <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font>
- <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font>
- <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font>
<font>$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>