# Miscellaneous topics on optimization
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/miscellaneous-topics
---
<h3>Derivatives from composition</h3>
- <font size=+2>Have $\ell:\mathbb R^n\to\mathbb R$ (e.g. a loss function).</font>
- <font size=+2>Want derivative, but might want to think of $\ell$ as composition and use derivatives of each.</font>
- <font size=+2 style="color:#181818;">For example, have $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, so that $\ell = g\circ F$.</font>
- <font size=+2 style="color:#181818;">How can you think about the gradient of $\ell$ by using "appropriate" derivatives of $g$ and of $F$ (*General Chain Rule*)?</font>
----
<h3>Derivatives from composition</h3>
- <font size=+2>Have $\ell:\mathbb R^n\to\mathbb R$ (e.g. a loss function).</font>
- <font size=+2>Want derivative, but might want to think of $\ell$ as composition and use derivatives of each.</font>
- <font size=+2>For example, have $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, so that $\ell = g\circ F$.</font>
- <font size=+2>How can you think about the gradient of $\ell$ by using "appropriate" derivatives of $g$ and of $F$ (*General Chain Rule*)?</font>
---
<h3>Recall the Jacobian</h3>
- <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$
- <font size=+2 style="color:#181818;">the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font>
- <font size=+2 style="color:#181818;">What may have been missed...the right way to think of the *derivative of* $T$.</font>
----
<h3>Recall the Jacobian</h3>
- <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$
- <font size=+2>the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font>
- <font size=+2 style="color:#181818;">What may have been missed...the right way to think of the *derivative of* $T$.</font>
----
<h3>Recall the Jacobian</h3>
- <font size=+2>An idea you saw in Calc III: for transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize T(u,v) = (f(u,v), g(u,v))$$
- <font size=+2>the *Jacobian* of $T$: determinant of $$\scriptsize\begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$</font>
- <font size=+2>What may have been missed...the right way to think of the *derivative of* $T$.</font>
---
<h3>The derivative done right</h3>
- <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$
- <font size=+2 style="color:#181818;">General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font>
- <font size=+2 style="color:#181818;">The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font>
- <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font>
----
<h3>The derivative done right</h3>
- <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$
- <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font>
- <font size=+2 style="color:#181818;">The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font>
- <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font>
----
<h3>The derivative done right</h3>
- <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$
- <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font>
- <font size=+2>The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font>
- <font size=+2 style="color:#181818;">columns correspond to input variables; rows correspond to (output) coordinate functions.</font>
----
<h3>The derivative done right</h3>
- <font size=+2>The derivative at a point $(u,v)$ is the matrix $DT$, evaluated at $(u,v)$:</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}$$
- <font size=+2>General function $T:\mathbb R^n\to\mathbb R^m$, with ${\bf p}$ in domain of $DT$, derivative is linear transformation ${\bf x} \mapsto DT({\bf p})\cdot{\bf x}.$</font>
- <font size=+2>The matrix $DT({\bf p})$ is $m\times n$; "matrix of partials"</font>
- <font size=+2>columns correspond to input variables; rows correspond to (output) coordinate functions.</font>
---
<h3>General Chain Rule</h3>
- <font size=+2>This perspective tells you how chain rule *must work*.</font>
- <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font>
- <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font>
- <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font>
- <font size=+2 style="color:#181818;">only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font>
- <font size=+2 style="color:#181818;">To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font>
- <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font>
----
<h3>General Chain Rule</h3>
- <font size=+2>This perspective tells you how chain rule *must work*.</font>
- <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font>
- <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font>
- <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font>
- <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font>
- <font size=+2 style="color:#181818;">To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font>
- <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font>
----
<h3>General Chain Rule</h3>
- <font size=+2>This perspective tells you how chain rule *must work*.</font>
- <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font>
- <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font>
- <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font>
- <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font>
- <font size=+2>To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font>
- <font size=+2 style="color:#181818;">but $F({\bf p})\in\mathbb R^m$...so</font> <font style="color:#181818;">$$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$</font>
----
<h3>General Chain Rule</h3>
- <font size=+2>This perspective tells you how chain rule *must work*.</font>
- <font size=+2>If $F:\mathbb R^n\to\mathbb R^m$ and $g:\mathbb R^m\to\mathbb R$, and we have $\ell = g\circ F$, then</font>
- <font size=+2>$DF$ is an $m\times n$ matrix of partials;</font>
- <font size=+2>$Dg$ is a $1\times m$ matrix of partials;</font>
- <font size=+2>only thing that will work: $D\ell = Dg\cdot DF$, which gives $1\times n$ matrix.</font>
- <font size=+2>To evaluate at a point: say ${\bf p}\in\mathbb R^n$; can evaluate $DF$ at ${\bf p}$, but not $Dg$...</font>
- <font size=+2>but $F({\bf p})\in\mathbb R^m$...so</font> $$\scriptsize D\ell({\bf p}) = Dg(F({\bf p}))\cdot DF({\bf p}).$$
---
<h3>Example</h3>
- <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$
- <font size=+2 style="color:#181818;">suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$. Then $Dc = DT\cdot D\gamma$:</font> <font style="color:#181818;">$$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$</font>
----
<h3>Example</h3>
- <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$
- <font size=+2>suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$.</font> <font size=+2 style="color:#181818;">Then $Dc = DT\cdot D\gamma$:</font> <font style="color:#181818;">$$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$</font>
----
<h3>Example</h3>
- <font size=+2>Returning to transformation $T:\mathbb R^2\to\mathbb R^2$, with</font> $$\scriptsize DT = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix},$$
- <font size=+2>suppose $u$ and $v$ are functions of $t$ (a curve); so say $\gamma:\mathbb R\to\mathbb R^2$, with $\gamma(t) = (u(t), v(t))$. Let $c(t) = T(\gamma(t))$. Then $Dc = DT\cdot D\gamma$:</font> $$\scriptsize Dc = DT\cdot D\gamma = \begin{bmatrix}\frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} \\ \frac{\partial g}{\partial u} & \frac{\partial g}{\partial v}\end{bmatrix}\begin{bmatrix}\frac{\partial u}{\partial t}\\ \frac{\partial v}{\partial t}\end{bmatrix} = \begin{bmatrix}\frac{\partial f}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial t}\\ \frac{\partial g}{\partial u}\frac{\partial u}{\partial t}+\frac{\partial g}{\partial v}\frac{\partial v}{\partial t}\end{bmatrix}$$
---
<h3>Momentum in gradient descent</h3>
- <font size=+2>When using functions with high dimensional input, *ravines* occur.</font>

----
<h3>Momentum in gradient descent</h3>
- <font size=+2>When using functions with high dimensional input, *ravines* occur.</font>
- <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font>
- <font size=+2 style="color:#181818;">**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font>
- <font size=+2 style="color:#181818;">Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font>
- <font size=+2 style="color:#181818;">Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font style="color:#181818;">$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font>
----
<h3>Momentum in gradient descent</h3>
- <font size=+2>When using functions with high dimensional input, *ravines* occur.</font>
- <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font>
- <font size=+2>**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font>
- <font size=+2 style="color:#181818;">Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font>
- <font size=+2 style="color:#181818;">Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font style="color:#181818;">$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font>
----
<h3>Momentum in gradient descent</h3>
- <font size=+2>When using functions with high dimensional input, *ravines* occur.</font>
- <font size=+2>Ravines cause gradient to point mostly to opposite side of ravine; gradient descent "hops" from one steep wall to other $\Rightarrow$ slow convergence.</font>
- <font size=+2>**Momentum** tries to fix this, by giving procedure some "memory" of past updates.</font>
- <font size=+2>Let $\Delta{\bf w}_t = {\bf w^{(t)}}-{\bf w^{(t-1)}}$.</font>
- <font size=+2>Gradient descent with momentum updates parameters by following, for some constant $\alpha$:</font> <font>$$\scriptsize {\bf w^{(t+1)}} = {\bf w^{(t)}} - \eta\nabla f({\bf w^{(t)}}) + \alpha\Delta{\bf w}_t.$$</font>
----
<h3>Momentum in gradient descent</h3>


{"metaMigratedAt":"2023-06-15T21:37:39.133Z","metaMigratedFrom":"YAML","title":"Miscellaneous topics, optimization","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":16950,\"del\":1862}]"}