Miscellaneous topics, optimization

# Miscellaneous topics on optimization  slide: https://hackmd.io/@ccornwell/miscellaneous-topics --- <h3>Derivatives from composition</h3> - <font size=+2>Have $\ell:\mathbb R^n\to\mathbb R$ (e.g. a loss function).</font> - <font size=+2>Want derivative, but might want to think of $\ell$ as composition and use derivatives of each.</font> - <font size=+2 style="color:#181818;">For example, have $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, so that $\ell = g\circ F$.</font> - <font size=+2 style="color:#181818;">How can you think about the gradient of $\ell$ by using "appropriate" derivatives of $g$ and of $F$ (*General Chain Rule*)?</font> ---- <h3>Derivatives from composition</h3> - <font size=+2>Have $\ell:\mathbb R^n\to\mathbb R$ (e.g. a loss function).</font> - <font size=+2>Want derivative, but might want to think of $\ell$ as composition and use derivatives of each.</font> - <font size=+2>For example, have $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, so that $\ell = g\circ F$.</font> - <font size=+2>How can you think about the gradient of $\ell$ by using "appropriate" derivatives of $g$ and of $F$ (*General Chain Rule*)?</font> --- <h3>Recall the Jacobian</h3> - <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$ - <font size=+2 style="color:#181818;">the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font> - <font size=+2 style="color:#181818;">What may have been missed...the right way to think of the *derivative of* $T$.</font> ---- <h3>Recall the Jacobian</h3> - <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$ - <font size=+2>the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font> - <font size=+2 style="color:#181818;">What may have been missed...the right way to think of the *derivative of* $T$.</font> ---- <h3>Recall the Jacobian</h3> - <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$ - <font size=+2>the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font> - <font size=+2>What may have been missed...the right way to think of the *derivative of* $T$.</font> --- <h3>The derivative done right</h3> - <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$ - <font size=+2 style="color:#181818;">General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font> - <font size=+2 style="color:#181818;">The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font> - <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font> ---- <h3>The derivative done right</h3> - <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$ - <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font> - <font size=+2 style="color:#181818;">The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font> - <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font> ---- <h3>The derivative done right</h3> - <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$ - <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font> - <font size=+2>The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font> - <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font> ---- <h3>The derivative done right</h3> - <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$ - <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font> - <font size=+2>The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font> - <font size=+2>columns correspond to input variables; rows correspond to (output) coordinate functions.</font> --- <h3>General Chain Rule</h3> - <font size=+2>This perspective tells you how chain rule *must work*.</font> - <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font> - <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font> - <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font> - <font size=+2 style="color:#181818;">only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font> - <font size=+2 style="color:#181818;">To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font> - <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font> ---- <h3>General Chain Rule</h3> - <font size=+2>This perspective tells you how chain rule *must work*.</font> - <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font> - <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font> - <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font> - <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font> - <font size=+2 style="color:#181818;">To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font> - <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font> ---- <h3>General Chain Rule</h3> - <font size=+2>This perspective tells you how chain rule *must work*.</font> - <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font> - <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font> - <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font> - <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font> - <font size=+2>To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font> - <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font> ---- <h3>General Chain Rule</h3> - <font size=+2>This perspective tells you how chain rule *must work*.</font> - <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font> - <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font> - <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font> - <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font> - <font size=+2>To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font> - <font size=+2>but $F({\bf p})\in\mathbb R^m$...so</font> $$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$ --- <h3>Example</h3> - <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$ - <font size=+2 style="color:#181818;">suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$. Then $Dc = DT\cdot D\gamma$:</font> <font style="color:#181818;">$$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$</font> ---- <h3>Example</h3> - <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$ - <font size=+2>suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$.</font> <font size=+2 style="color:#181818;">Then $Dc = DT\cdot D\gamma$:</font> <font style="color:#181818;">$$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$</font> ---- <h3>Example</h3> - <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$ - <font size=+2>suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$. Then $Dc = DT\cdot D\gamma$:</font> $$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$ --- <h3>Momentum in gradient descent</h3> - <font size=+2>When using functions with high dimensional input, *ravines* occur.</font> ![](https://i.imgur.com/9bAPASS.jpg =x450) ---- <h3>Momentum in gradient descent</h3> - <font size=+2>When using functions with high dimensional input, *ravines* occur.</font> - <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font> - <font size=+2 style="color:#181818;">**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font> - <font size=+2 style="color:#181818;">Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font> - <font size=+2 style="color:#181818;">Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font style="color:#181818;">$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font> ---- <h3>Momentum in gradient descent</h3> - <font size=+2>When using functions with high dimensional input, *ravines* occur.</font> - <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font> - <font size=+2>**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font> - <font size=+2 style="color:#181818;">Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font> - <font size=+2 style="color:#181818;">Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font style="color:#181818;">$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font> ---- <h3>Momentum in gradient descent</h3> - <font size=+2>When using functions with high dimensional input, *ravines* occur.</font> - <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font> - <font size=+2>**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font> - <font size=+2>Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font> - <font size=+2>Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font>$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font> ---- <h3>Momentum in gradient descent</h3> ![](https://i.imgur.com/9bAPASS.jpg =x250) ![](https://i.imgur.com/KfjNIqk.jpg =x250)