Regularization

--- title: Regularization tags: Senior Seminar, Talk description: View the slide with "Slide Mode". --- # Regularizing a machine learning model  slide: https://hackmd.io/@ccornwell/regularization --- <h3>Balancing the space of parameters for model</h3> - <font size=+1>Procedure to minimize loss function on training data can favor making "large" parameters.</font> ![](https://i.imgur.com/LWP10uD.png =x350) --- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font style="color:#181818;">New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> ---- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2 style="color:#181818;">Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> ---- <h3>A penalty term on the loss function</h3> - <font size=+2>To counter the tendency towards large (in absolute value) parameter values, place additional term in original loss function $\ell$.</font> <font>New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$</font> - <font size=+2>Often the norm in expression is the (squared) Euclidean norm, $||{\bf w}||^2 = w_1^2 + w_2^2 + \ldots+w_n^2$. Called "$L_2$-regularization".</font> --- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2 style="color:#181818;">In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2 style="color:#181818;">**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font style="color:#181818;">$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font> ---- <h3>Gradients of loss with penalty</h3> New loss function, for some $\alpha>0$: $\ell({\bf w}) + \alpha||{\bf w}||^2$ - <font size=+2>Note, for $j^{th}$ parameter $w_j$, the new partial w.r.t. $w_j$ is $\frac{\partial\ell}{\partial w_j} + (2\alpha)w_j$.</font> - <font size=+2>In an update, subtract _constant_ times $w_j$, in addition to subtracting partial of $\ell$.</font> - <font size=+2>**Weight decay.** To work well (in grad. descent), best to choose $\alpha\in(0,1/2)$, since with learning rate $\eta\in(0,1)$ the updated weight is:</font> <font>$w_j - \eta(\frac{\partial\ell}{\partial w_j} + 2\alpha w_j) = (1-2\eta\alpha)w_j - \eta(\frac{\partial\ell}{\partial w_j})$</font>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.