# Minimizing MSE for linear models
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/min-MSE
---
<h3>Mean Squared Error (MSE)</h3>
> considered as function of $w,b$.
- <font size=+2>Recall: $\operatorname{MSE}(w,b) = \frac{1}{m}\sum_{i=1}^m(h(x_i)-y_i)^2$, where $h(x_i)=wx_i + b$.</font>
- <font size=+2>The *least squares regression* (LSR) line has $w=\hat w$, and $b=\hat b$, minimizing $\operatorname{MSE}(w,b)$.</font>
- <font size=+2>e.g. for any $b_0$, $\operatorname{MSE}(\hat w, \hat b) \le \frac{1}{m}\sum_{i=1}^m(b_0 - y_i)^2.$</font>
----
<h3>Aside: "Coeff. of Determination"</h3>
- <font size=+2>When different overall scale of observed $y_i$ values, changes $\operatorname{MSE}$. (*How well does data fit LSR line if $\operatorname{MSE}(\hat w, \hat b)=2.5$?*)</font>
<span style="color:#181818;">
- <font size=+2>Using $\bar y$ for the average of the $y_i$,
$R^2 = 1 - \dfrac{\sum_i(\hat y_i - y_i)^2}{\sum_i(\bar y - y_i)^2}.$</font>
- <font size=+2>Claim: $0\le R^2\le 1$.</font>
- <font size=+2>$R^2 = 1$ only if data is exactly *on* LSR line.</font>
</span>
----
<h3>Aside: "Coeff. of Determination"</h3>
- <font size=+2>When different overall scale of observed $y_i$ values, changes $\operatorname{MSE}$. (*How well does data fit LSR line if $\operatorname{MSE}(\hat w, \hat b)=2.5$?*)</font>
- <font size=+2>Using $\bar y$ for the average of the $y_i$,
$R^2 = 1 - \dfrac{\sum_i(\hat y_i - y_i)^2}{\sum_i(\bar y - y_i)^2}.$</font>
- <font size=+2>Claim: $0\le R^2\le 1$.</font>
- <font size=+2>$R^2 = 1$ only if data is exactly *on* LSR line.</font>
---
<h3>MSE as loss function</h3>
> considered as function of $w,b$.
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font>
----
<h3>The gradient</h3>
- <font size=+2>The gradient of a function $f:\mathbb R^n \to \mathbb R$:
$$\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).$$
</font>
- <font size=+2>local min/max occurs at ${\bf w}$, then $\nabla f({\bf w}) = \vec{0}$.</font>
<span style="color:#181818;">
- <font size=+2>for ${\bf w}\in\mathbb R^n$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$.</font>
- <font size=+2>i.e. value of $f({\bf w} - \varepsilon\nabla f({\bf w}))$, lower by "the most"...</font>
</span>
----
<h3>The gradient</h3>
- <font size=+2>The gradient of a function $f:\mathbb R^n \to \mathbb R$:
$$\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).$$
</font>
- <font size=+2>local min/max occurs at ${\bf w}$, then $\nabla f({\bf w}) = \vec{0}$.</font>
- <font size=+2>for ${\bf w}\in\mathbb R^n$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$.</font>
- <font size=+2>i.e. value of $f({\bf w} - \varepsilon\nabla f({\bf w}))$, lower by "the most"...</font>
----
<h3>MSE as loss function</h3>
> considered as function of $w,b$.
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font>
<font size=+2>Given data, and *any* $w, b$, can calculate right side (the gradient of $\operatorname{MSE}$); subtract $\varepsilon$ times those values from $w$ and $b$, respectively. Should decrease $\operatorname{MSE}$.</font>
---
<h3>Meaning of zero partials</h3>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i) = \frac{1}{m}\sum_iy_i$</font>
- <font size=+2>$\hat y = h(x)$ is vertically balanced between observed $y$-values</font>
<span style="color:#181818;">
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i)x_i = \frac{1}{m}\sum_iy_ix_i$</font>
- <font size=+2>Like center of mass balance: if $h(x_i)-y_i$ were force on line, wouldn't cause rotation about $(0,b)$</font>
</span>
----
<h3>Meaning of zero partials</h3>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i) = \frac{1}{m}\sum_iy_i$</font>
- <font size=+2>$\hat y = h(x)$ is vertically balanced between observed $y$-values</font>
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i)x_i = \frac{1}{m}\sum_iy_ix_i$</font>
- <font size=+2>Like center of mass balance: if $h(x_i)-y_i$ were force on line, wouldn't cause rotation about $(0,b)$</font>
---
<h3>Solving for where gradient is zero</h3>
- <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font>
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font>
<span style="color:#181818;">
- <font size=+2>For short, write $s = \sum_ix_i,\quad u=\sum_ix_i^2$; then want $w,b$ so
$$sw + mb = \sum_iy_i\quad \text{ and }\quad uw + sb = \sum_ix_iy_i$$
</font>
</span>
----
<h3>Solving for where gradient is zero</h3>
- <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font>
- <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font>
- <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font>
- <font size=+2>For short, write $s = \sum_ix_i,\quad u=\sum_ix_i^2$; then want $w,b$ so
$$sw + mb = \sum_iy_i\quad \text{ and }\quad uw + sb = \sum_ix_iy_i$$
</font>
----
<h3>Solving for where gradient is zero</h3>
- <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font>
- <font size=+2>Using $s, u$ as before, also write $t = \sum_iy_i$ and $u = \sum_ix_iy_i$. Then grad of MSE is zero when</font>
$\begin{bmatrix}m&s\\ s&u\end{bmatrix}\begin{bmatrix}b\\ w\end{bmatrix} = \begin{bmatrix}t\\ v\end{bmatrix}$
---
<h3>Discussion</h3>
<br />
<br />
<br />
<br />
<br />
<br />
{"metaMigratedAt":"2023-06-15T20:00:24.245Z","metaMigratedFrom":"YAML","title":"Mean Squared Error","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":9079,\"del\":2694}]"}