# Minimizing MSE for linear models <!-- Put the link to this slide here so people can follow --> slide: https://hackmd.io/@ccornwell/min-MSE --- <h3>Mean Squared Error (MSE)</h3> > considered as function of $w,b$. - <font size=+2>Recall: $\operatorname{MSE}(w,b) = \frac{1}{m}\sum_{i=1}^m(h(x_i)-y_i)^2$, where $h(x_i)=wx_i + b$.</font> - <font size=+2>The *least squares regression* (LSR) line has $w=\hat w$, and $b=\hat b$, minimizing $\operatorname{MSE}(w,b)$.</font> - <font size=+2>e.g. for any $b_0$, $\operatorname{MSE}(\hat w, \hat b) \le \frac{1}{m}\sum_{i=1}^m(b_0 - y_i)^2.$</font> ---- <h3>Aside: "Coeff. of Determination"</h3> - <font size=+2>When different overall scale of observed $y_i$ values, changes $\operatorname{MSE}$. (*How well does data fit LSR line if $\operatorname{MSE}(\hat w, \hat b)=2.5$?*)</font> <span style="color:#181818;"> - <font size=+2>Using $\bar y$ for the average of the $y_i$, $R^2 = 1 - \dfrac{\sum_i(\hat y_i - y_i)^2}{\sum_i(\bar y - y_i)^2}.$</font> - <font size=+2>Claim: $0\le R^2\le 1$.</font> - <font size=+2>$R^2 = 1$ only if data is exactly *on* LSR line.</font> </span> ---- <h3>Aside: "Coeff. of Determination"</h3> - <font size=+2>When different overall scale of observed $y_i$ values, changes $\operatorname{MSE}$. (*How well does data fit LSR line if $\operatorname{MSE}(\hat w, \hat b)=2.5$?*)</font> - <font size=+2>Using $\bar y$ for the average of the $y_i$, $R^2 = 1 - \dfrac{\sum_i(\hat y_i - y_i)^2}{\sum_i(\bar y - y_i)^2}.$</font> - <font size=+2>Claim: $0\le R^2\le 1$.</font> - <font size=+2>$R^2 = 1$ only if data is exactly *on* LSR line.</font> --- <h3>MSE as loss function</h3> > considered as function of $w,b$. - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font> ---- <h3>The gradient</h3> - <font size=+2>The gradient of a function $f:\mathbb R^n \to \mathbb R$: $$\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).$$ </font> - <font size=+2>local min/max occurs at ${\bf w}$, then $\nabla f({\bf w}) = \vec{0}$.</font> <span style="color:#181818;"> - <font size=+2>for ${\bf w}\in\mathbb R^n$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$.</font> - <font size=+2>i.e. value of $f({\bf w} - \varepsilon\nabla f({\bf w}))$, lower by "the most"...</font> </span> ---- <h3>The gradient</h3> - <font size=+2>The gradient of a function $f:\mathbb R^n \to \mathbb R$: $$\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).$$ </font> - <font size=+2>local min/max occurs at ${\bf w}$, then $\nabla f({\bf w}) = \vec{0}$.</font> - <font size=+2>for ${\bf w}\in\mathbb R^n$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$.</font> - <font size=+2>i.e. value of $f({\bf w} - \varepsilon\nabla f({\bf w}))$, lower by "the most"...</font> ---- <h3>MSE as loss function</h3> > considered as function of $w,b$. - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font> <font size=+2>Given data, and *any* $w, b$, can calculate right side (the gradient of $\operatorname{MSE}$); subtract $\varepsilon$ times those values from $w$ and $b$, respectively. Should decrease $\operatorname{MSE}$.</font> --- <h3>Meaning of zero partials</h3> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i) = \frac{1}{m}\sum_iy_i$</font> - <font size=+2>$\hat y = h(x)$ is vertically balanced between observed $y$-values</font> <span style="color:#181818;"> - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i)x_i = \frac{1}{m}\sum_iy_ix_i$</font> - <font size=+2>Like center of mass balance: if $h(x_i)-y_i$ were force on line, wouldn't cause rotation about $(0,b)$</font> </span> ---- <h3>Meaning of zero partials</h3> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i) = \frac{1}{m}\sum_iy_i$</font> - <font size=+2>$\hat y = h(x)$ is vertically balanced between observed $y$-values</font> - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = 0\ \Rightarrow\ \frac{1}{m}\sum_ih(x_i)x_i = \frac{1}{m}\sum_iy_ix_i$</font> - <font size=+2>Like center of mass balance: if $h(x_i)-y_i$ were force on line, wouldn't cause rotation about $(0,b)$</font> --- <h3>Solving for where gradient is zero</h3> - <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font> - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font> <span style="color:#181818;"> - <font size=+2>For short, write $s = \sum_ix_i,\quad u=\sum_ix_i^2$; then want $w,b$ so $$sw + mb = \sum_iy_i\quad \text{ and }\quad uw + sb = \sum_ix_iy_i$$ </font> </span> ---- <h3>Solving for where gradient is zero</h3> - <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font> - <font size=+2>$\dfrac{\partial}{\partial w}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)x_i$</font> - <font size=+2>$\dfrac{\partial}{\partial b}\operatorname{MSE}(w,b) = \frac{2}{m}\sum_{i=1}^m(wx_i+b-y_i)$</font> - <font size=+2>For short, write $s = \sum_ix_i,\quad u=\sum_ix_i^2$; then want $w,b$ so $$sw + mb = \sum_iy_i\quad \text{ and }\quad uw + sb = \sum_ix_iy_i$$ </font> ---- <h3>Solving for where gradient is zero</h3> - <font size=+2>Rewrite $\nabla\operatorname{MSE}(w,b) = \vec 0$.</font> - <font size=+2>Using $s, u$ as before, also write $t = \sum_iy_i$ and $u = \sum_ix_iy_i$. Then grad of MSE is zero when</font> $\begin{bmatrix}m&s\\ s&u\end{bmatrix}\begin{bmatrix}b\\ w\end{bmatrix} = \begin{bmatrix}t\\ v\end{bmatrix}$ --- <h3>Discussion</h3> <br /> <br /> <br /> <br /> <br /> <br />
{"metaMigratedAt":"2023-06-15T20:00:24.245Z","metaMigratedFrom":"YAML","title":"Mean Squared Error","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":9079,\"del\":2694}]"}
    317 views