Gradient descent

# Gradient descent  slide: https://hackmd.io/@ccornwell/gradient-descent --- <h3>Using the gradient for optimization</h3> - <font size=+2>Sometimes, cannot solve exactly for where $\nabla f = \vec 0$</font> - <font size=+2>Since, for ${\bf w}\in\mathbb R^{n+1}$, direction of $-\nabla f({\bf w})$ is direction of most rapid decrease of $f$,</font> - <font size=+2>can make value of $f$ lower by changing input: say $f({\bf w} - \varepsilon\nabla f({\bf w}))$ for small $\varepsilon>0$</font> --- <h3>Gradient Descent Procedure</h3> - <font size=+2>Fix "learning rate" $\eta>0$ (this will have role of $\varepsilon$ above). Start with initial input parameters ${\bf w^{(0)}}$.</font> - <font size=+2>($t=0$ to start) Compute $\nabla f({\bf w^{(t)}})$ and update parameters: ${\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}})$.</font> - <font size=+2>Increase $t$ and repeat previous step unless some stopping condition (e.g. $|\eta\nabla f({\bf w^{(t)}})|$ fallen below specified tolerance)</font> --- <h3>Linear regression example</h3> - <font size=+2>If doing for lin.regression, $f=\operatorname{MSE}$, say have 100 data points from arrays `x` and `y` ($x$- and $y$-coords of data)</font> - <font size=+2>and initial parameters `wbold = np.array([0,0])`. ```python eta = 0.1 tolerance = 0.005 #compute partial w.r.t. w and w.r.t. b; some Numpy array "magic" here partial_w = 0.02*sum((wbold[0]*x + wbold[1] - y)*x) partial_b = 0.02*sum(wbold[0]*x + wbold[1] - y) grad_mse = np.array([partial_w, partial_b]) while np.linalg.norm( eta*grad_mse ) > tolerance: # update wbold wbold = wbold - eta*grad_mse # update grad_mse using new wbold partial_w = 0.02*sum((wbold[0]*x + wbold[1] - y)*x) partial_b = 0.02*sum(wbold[0]*x + wbold[1] - y) grad_mse = np.array([partial_w, partial_b]) ``` --- <h3>Linear regression example</h3> ![](https://i.imgur.com/XyMwaXj.png =350x) - <font size=+2>True LSR line: $y = -2x + 1$;</font> - <font size=+2>Slope, intercept from Grad Descent: $w \approx -1.850, b \approx 0.974$.</font> - <font size=+1>Used code of previous slide; Num. updates: 51; $\approx$ 0.3 secs</font> --- <h3>Does gradient descent converge (to minimum)?</h3> - <font size=+2>Not always, even for nice functions.</font> - <font size=+2>e.g. say $f(x) = x^2$ ![](https://i.imgur.com/LUFEMoB.png =750x) - <font size=+2>$\eta = 1.05$ on left; $\eta = 0.95$ on right</font> --- <h3>Does gradient descent converge (to minimum)?</h3> - <font size=+2>In some special cases, yes.</font> - <font size=+2>Suppose that $f$ is differentiable and **convex**...</font> <span style="color:#181818;"> - <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$... and that $\eta \le 1/L$</font> </span> <span style="color:#181818;background-color:#181818;"> > Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then $|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$ </span> ---- <h3>Does gradient descent converge (to minimum)?</h3> - <font size=+2>In some special cases, yes.</font> - <font size=+2>Suppose that $f$ is differentiable and **convex**...</font> - <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$... and that $\eta \le 1/L$</font> <span style="color:#181818;background-color:#181818;"> > Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then $|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$ </span> ---- <h3>Does gradient descent converge (to minimum)?</h3> - <font size=+2>In some special cases, yes.</font> - <font size=+2>Suppose that $f$ is differentiable and **convex**...</font> - <font size=+2>Suppose $\nabla f$ is also **Lipschitz continuous** for some positive constant $L$... and that $\eta \le 1/L$</font> > Theorem: If $\hat{\bf w}$ is (global) minimizer of $f$, then $|f({\bf w^{(t)}}) - f(\hat{\bf w})| \le \frac{|{\bf w^{(0)}} - \hat{\bf w}|^2}{2\eta t}$ --- <h3> Discussion </h3> <br /> <br /> <br /> <br /> <br /> <br /> <br /> --- <h3>Wrap up</h3> - <font size=+2>The MSE does satisfy the conditions (convex, and grad. is Lipshitz continuous).</font> <span style="color:#181818;"> - <font size=+2>Learning a good logistic regression model (learning the half-space ${\bf w}\cdot{\bf x} + b = 0$) uses a loss function where convergence is guaranteed.</font> - <font size=+2>Many loss functions we will want to use are not convex.</font> </span> ---- <h3>Wrap up</h3> - <font size=+2>The MSE does satisfy the conditions (convex, and grad. is Lipshitz continuous).</font> - <font size=+2>Learning a good logistic regression model (learning the half-space ${\bf w}\cdot{\bf x} + b = 0$) uses a loss function where convergence is guaranteed.</font> - <font size=+2>Many loss functions we will want to use are not convex.</font> --- <h3> Discussion </h3> <br /> <br /> <br /> <br /> <br /> <br /> <br />