# Gradient descent
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/gradient-descent
---
<h3>Using the gradient for optimization</h3>
- <font size=+2>Sometimes, cannot solve exactly for where $\nabla f = \vec 0$</font>
- <font size=+2>Since, for ${\bf w}\in\mathbb R^{n+1}$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$,</font>
- <font size=+2>can make value of $f$ lower by changing input: say $f({\bf w} - \varepsilon\nabla f({\bf w}))$ for small $\varepsilon>0$</font>
---
<h3>Gradient Descent Procedure</h3>
- <font size=+2>Fix "learning rate" $\eta>0$ (this will have role of $\varepsilon$ above). Start with initial input parameters ${\bf w^{(0)}}$.</font>
- <font size=+2>($t=0$ to start) Compute $\nabla f({\bf w^{(t)}})$ and update parameters: ${\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}})$.</font>
- <font size=+2>Increase $t$ and repeat previous step unless some stopping condition (e.g. $|\eta\nabla f({\bf w^{(t)}})|$ fallen below specified tolerance)</font>
---
<h3>Linear regression example</h3>
- <font size=+2>If doing for lin.regression, $f=\operatorname{MSE}$, say have 100 data points from arrays `x` and `y` ($x$- and $y$-coords of data)</font>
- <font size=+2>and initial parameters `wbold = np.array([0,0])`.
```python
eta = 0.1
tolerance = 0.005
#compute partial w.r.t. w and w.r.t. b; some Numpy array "magic" here
partial_w = 0.02*sum((wbold[0]*x + wbold[1] - y)*x)
partial_b = 0.02*sum(wbold[0]*x + wbold[1] - y)
grad_mse = np.array([partial_w, partial_b])
while np.linalg.norm( eta*grad_mse ) > tolerance:
# update wbold
wbold = wbold - eta*grad_mse
# update grad_mse using new wbold
partial_w = 0.02*sum((wbold[0]*x + wbold[1] - y)*x)
partial_b = 0.02*sum(wbold[0]*x + wbold[1] - y)
grad_mse = np.array([partial_w, partial_b])
```
---
<h3>Linear regression example</h3>

- <font size=+2>True LSR line: $y = -2x + 1$;</font>
- <font size=+2>Slope, intercept from Grad Descent: $w \approx -1.850, b \approx 0.974$.</font>
- <font size=+1>Used code of previous slide; Num. updates: 51; $\approx$ 0.3 secs</font>
---
<h3>Does gradient descent converge (to minimum)?</h3>
- <font size=+2>Not always, even for nice functions.</font>
- <font size=+2>e.g. say $f(x) = x^2$

- <font size=+2>$\eta = 1.05$ on left; $\eta = 0.95$ on right</font>
---
<h3>Does gradient descent converge (to minimum)?</h3>
- <font size=+2>In some special cases, yes.</font>
- <font size=+2>Suppose that $f$ is differentiable and **convex**...</font>
<span style="color:#181818;">
- <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$...
and that $\eta \le 1/L$</font>
</span>
<span style="color:#181818;background-color:#181818;">
> Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then
$|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$
</span>
----
<h3>Does gradient descent converge (to minimum)?</h3>
- <font size=+2>In some special cases, yes.</font>
- <font size=+2>Suppose that $f$ is differentiable and **convex**...</font>
- <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$...
and that $\eta \le 1/L$</font>
<span style="color:#181818;background-color:#181818;">
> Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then
$|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$
</span>
----
<h3>Does gradient descent converge (to minimum)?</h3>
- <font size=+2>In some special cases, yes.</font>
- <font size=+2>Suppose that $f$ is differentiable and **convex**...</font>
- <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$...
and that $\eta \le 1/L$</font>
> Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then
$|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$
---
<h3>
Discussion
</h3>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
---
<h3>Wrap up</h3>
- <font size=+2>The MSE does satisfy the conditions (convex, and grad. is Lipshitz continuous).</font>
<span style="color:#181818;">
- <font size=+2>Learning a good logistic regression model (learning the half-space ${\bf w}\cdot{\bf x} + b = 0$) uses a loss function where convergence is guaranteed.</font>
- <font size=+2>Many loss functions we will want to use are not convex.</font>
</span>
----
<h3>Wrap up</h3>
- <font size=+2>The MSE does satisfy the conditions (convex, and grad. is Lipshitz continuous).</font>
- <font size=+2>Learning a good logistic regression model (learning the half-space ${\bf w}\cdot{\bf x} + b = 0$) uses a loss function where convergence is guaranteed.</font>
- <font size=+2>Many loss functions we will want to use are not convex.</font>
---
<h3>
Discussion
</h3>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
{"metaMigratedAt":"2023-06-15T20:14:29.433Z","metaMigratedFrom":"YAML","title":"Gradient descent","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":5207,\"del\":44}]"}