# Optimization on manifolds - [geoopt/geoopt](https://github.com/geoopt/geoopt) ( [@ferrine](https://twitter.com/ferrine96/), [@rrkarim](https://github.com/rrkarim), [@SomeoneS](https://twitter.com/SomeoneSerge)) - **March 13**: an overview, @SomeoneS - Notes: https://hackmd.io/@SomeoneS/S1gDMMFV8 - **March 20**: Hyperbolic deep learning, graph embeddings, @ferrine96 --- References --- - Differential Geometry - Frederic Schuller's lectures on youtube - Lee's series - Optimization: Absil et. al - Details in notes --- Metagoals --- - This talk fails to provide examples (yet) - Aims at stating the problems - ...and actually asking questions --- "Riemannian Adaptive Optimization Methods" --- <center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" /> (Becigneul, Ganea)</center> --- Gradient descent --- - $p_0\in K,~\alpha\in\mathbb{R},~f: K\subset \mathbb{R}^d\to \mathbb{R},$ - $f(p)\to \min.$ > 1. $g_t = f'(p_t)^\top$ > 2. $x_{t+1} = x_t - \alpha g_t$ <center><img src="https://i.imgur.com/ZgWHr6T.jpg" class="img-responsive" width="25%" /></center> --- Manifolds --- An $n$-manifold: "a ['nice'](https://hackmd.io/@newkozlukov/manopt-notes#Manifolds) top. space $(M, \Theta)$ whose every $p\in M$ has open neighbourhood homeomorphic to (a piece of) $\mathbb{R}^n$" <center><img src="https://i.imgur.com/kzdBEOI.jpg" class="img-responsive" width="45%"/></center> $$\phi: U\xrightarrow{\sim} \phi(U)\subset \mathbb{R}^n$$ --- Manifolds. Missing tools --- - Descent direction: **no derivatives** - **No** notion of **differentiability** - Step along direction: **no addition** --- Smooth manifolds --- "A manifold covered with a maximal atlas of smoothly-compatible charts", $(M, \Theta, \mathcal{A})$ <center><img src="https://i.imgur.com/F0NBjOs.jpg" class="img-responsive" width="40%" /></center> --- Smooth *functions* --- <center><img src="https://i.imgur.com/8BHgHRv.jpg" class="img-responsive" width="75%" /></center> --- Directional derivatives --- - $(M, \Theta, \mathcal{A}),$ - $p\in M,$ - Smooth curve $\gamma:I\subset\mathbb{R}\to M,$ $\gamma(0)=p,$ - $f: M\to \mathbb{R}$ smooth. $$(f\circ \gamma)'(0).$$ --- Directional derivatives --- $$X_{\gamma, p}f = (f\circ \gamma)'(0)$$ "Directional derivative of $f$ along $\gamma$ at $p$" --- <center><img src="https://i.imgur.com/4Zu3yoc.jpg" style="border-radius: 1em; margin: 1em;" width="45%" class="img-responsive"/></center> --- Directional derivatives --- $$X_{\gamma, p}f = (f\circ \gamma)'(0).$$ - $X_{\gamma,p}:C^\infty(M)\to\mathbb{R}$ is a **linear operator**. - **Leibniz rule**: $X_{\gamma, p}(fg) = f(p)X_{\gamma, p}g + X_{\gamma, p}(f)g(p).$ --- <center><img src="https://i.imgur.com/mSv7diH.jpg" class="img-responsive" width="45%" style="border-radius: 1em; margin: 1em" /></center> --- Tangent space --- $$T_pM = \left\{ X_{\gamma,p}\left| \gamma:I\to M~\text{smooth}, ~\gamma(0)=p \right. \right\}$$ - "Tangent space to $M$ at $p$" - Coincides with the set of all _derivations_ - _Derivations_ :left_right_arrow: _curve_ segments via *charts* - Forms a linear space --- Coordinate basis --- A chart $(U,x)$, $p\in U$, corresponds to a basis in $T_p M$, represented by _coordinate curves_. <center><img src="https://i.imgur.com/wypzdvi.jpg" class="img-responsive" width="30%" /></center> - Denoted: $\frac{\partial}{\partial x^1},~\ldots,~\frac{\partial}{\partial x^n}.$ - Acts: $\frac{\partial}{\partial x^k}f = \partial_k (\underbrace{f\circ (x^k)^{-1}}_{\mathbb{R}^n\to\mathbb{R}})(\underbrace{x(p)}_{\mathbb{R}^n})$ where $\partial_k$ is partial derivative --- Exponential map --- _TODO: move to after geodesics_ - "_Direction_" means $X\in T_pM$ - $\exists$ open $C(p)\subset M$ (cut locus) s.t. ~~curve~~ geodesic $\gamma$ representing $X$ is **unique** $$\exp_pX = \gamma(1)~\text{if defined}.$$ <center><img src="https://i.imgur.com/d7wI2OA.jpg" class="img-responsive" width="30%" /></center> --- Smooth manifolds. Missing tools --- - Before: $p_{t+1} = p_t - \alpha g_{t},$ - After: $p_{t+1} = \exp_{p_t}(-\alpha X_t).$ :confused: Given objective $f:M\to\mathbb{R}$ and $p_t\in M$, **how to pick $X_t$**? --- Derivative of a function --- Given smooth $F:M\to N,$ "the _push-forward_ of $F$ at $p$" is the linear map $F_{*,p}:T_p{M}\to T_{F(p)}N$ $$(\underbrace{F_{*,p}X}_{T_{F(p)}N})\underbrace{g}_{C^\infty(N)} = \underbrace{X}_{T_p M}(\underbrace{g\circ F}_{C^\infty(M)}).$$ - In coordinate basis: Jacobian matrix --- Derivative of objective --- Given objective $f:M\to \mathbb{R}$ its derivative at point $p\in M$ is a linear functional $$T_p{M}\to T\mathbb{R}\cong \mathbb{R}.$$ - $f_{*,p}\in T_p^* M$, - need $T_p M$ :confused: . --- Metric manifolds --- "A smooth manifold endowed with _smooth_ assignment of (almost) '_inner products_' $g:T_p M\times T_p M\to \mathbb{R}$ at each $p$". Notes for details - Rieszc in $\mathbb{R}^n$: vectors :left_right_arrow: linear functionals via _inner product_ - $\omega = g(X, \cdot) = X^\flat$ - $\omega \mapsto X=\omega^\sharp$ as unique $X$ s.t. $\omega(Y) = g(X, Y)$. --- RGD --- $$X_t = \operatorname{grad}f(p_t) \equiv f_{*,p_t}^\sharp,$$ $$x_{t+1} = \exp_{p_t}(-\alpha X_t).$$ Ready to implement? --- Extrinsic representation --- _Immersion_: smooth $\phi: M\to\mathbb{R}^n$ such that $\phi_{*,p}$ is injective for every $p\in M$. _Embedding_: a diffeomorphic immersion. [Strong] Whitney: can embed in $\mathbb{R}^{2\operatorname{dim}M}.$ --- Extrinsic representation --- $M\subset\mathbb{R}^d$ be a level-set ($\{p\left|F(p)=0\right.\}$). - $\left.\operatorname{id}\right|_{M}$ is an embedding and - can represent points and tangents via (redundant) $\mathbb{R}^d$ coordinates, - can induce metric from Euclidean using $F$, - need __projection__ op. - (prevalent approach, including geoopt) - (examples using local charts?) --- <center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" /> (Becigneul, Ganea)</center> --- Retractions --- - Closed-form $\exp$ and $\sharp$? - Usually unknown or infeasible - First-order approximations to $\exp$: **retractions** $$R_p:T_p M \to M,$$ $$R_p(0) = p,$$ $$(R_p)_{*,0} "=" \operatorname{id}_{T_p M}.$$ --- Faster methods --- - In vector spaces: - second-order methods (Newton) - dynamic approximations of Hessian (quasi-Newton) - "momentum", "adaptivity term" - On manifold: - First derivative is a vector field - **How to differentiate vector fields**? --- Vector fields --- A **Bundle** is a triple $(E, \pi, M)$, $$E\xrightarrow{\pi} M$$ - $E$ -- "total space" (smooth mfd), - $\pi$ -- "projection" (smooth map), - $M$ -- "base space" (smooth mfd), --- Vector fields --- Tangent bundle: $$E = TM = \cup_p T_p M$$ $$\pi = X_{\gamma, p} \mapsto p.$$ --- Vector fields --- A smooth **section** of a bundle $E\xrightarrow\pi M$ is a smooth map $$\sigma:M\to E$$ such that $$\pi\circ\sigma = \operatorname{id}_M$$ Set of all smooth sections is denoted $\Gamma(E)$. --- Vector fields --- A **vector field** is a smooth section of the tangent bundle $TM$. - To each $p\in M$ assigns $X_p\in T_p M$ - but in a "smooth way" Another example: metric $g$ is a smooth section of the bundle of bilinear forms ($(0,2)$-tensor field) --- Covariant derivative --- "Covariant derivative of vf $Y$ along $X$": $$\nabla_XY$$ Wishlist: 1. $\nabla_{fX+Y}Z = f\nabla_XZ + \nabla_Y Z,$ 2. $\nabla_X(Y+Z) = \nabla_XY + \nabla_X Z,$ 3. $\nabla_X(fY) = X(f)Y + f\nabla_XY.$ --- ![](https://i.imgur.com/6nISqDk.jpg) --- Covariant derivative --- In chart: ... (Christoffel symbols) --- Levi-Civita connection --- ... --- ![](https://i.imgur.com/LUIGeV8.jpg) --- Vector field along a curve --- - $\gamma:I\to M$ - $X:I\to TM$ s.t. $X_\lambda \in T_{\gamma(\lambda)}M$ **Extendible** if there is $\tilde{X}\in\Gamma(TM)$ s.t. $X_\lambda = \tilde{X}_{\gamma(\lambda)}.$ - Thm: locally unique --- Parallel translate --- If $X:I \to TM$ is a vf along $\gamma$ extendible to $\tilde{X}$ and $$\nabla_X\tilde{X} = 0$$ then $X$ is called **autoparallely-transported** - "Transports" $X_0$ in $T_{\gamma(0)}M$ to $X_\lambda\in T_{\gamma(\lambda)}M$ --- ![](https://i.imgur.com/kimW3tk.jpg) --- <center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" /> (Becigneul, Ganea)</center> --- Vector transport --- - No closed-form parallel translation either - "Vector transport" -- 1st-order approx (see Absil) --- (here should go some qualitative derivation vector transports for levelsets) --- Newton --- _later addition_ > $f: \mathbb{R} \to \mathbb{R},$ find $x$ s.t. $f(x)=0$. - Instead, replace $f$ with linear $h(x) = f(x_t) + f'(x_t)(x - x_t)$ - and solve $h(x)=0$: - $x = x_t-f'(x_t)^{-1}f(x_t)$ --- Newton --- _later addition_ > $f: \mathbb{R}\to \mathbb{R},$ find $x$ s.t. $f'(x)=0$. - Apply Newton's iteration to $f'$ - $x_{t+1} = x_t - f''(x_t)^{-1} f'(x_t)$ --- Newton --- _later addition_ > $f: \mathbb{R}^n\to\mathbb{R}^m$, find $x$ s.t. $f(x)=0$. - $\tilde{f}(x) = f(x_t) + f'(x_t)(x-x_t),$ - $\tilde{f}(x_{t+1}) = 0.$ --- Newton --- _later addition_ > $f: \mathbb{R}^n\to\mathbb{R}$, find $x$ s.t. $f'(x)=0$. - $\tilde{f'}(x) = f'(x_t) + \underbrace{f''(x_t)(x - x_t)}_{\in~{^n}\mathbb{R}},$ - $f''(x_t)(x - x_t) = -f'(x_t)$ --- Newton --- Solve $$\nabla_{X_t}\operatorname{grad}f(p_t) = -\operatorname{grad}f(p_t)$$ for $X_t\in T_{p_t}M$. $$p_{t+1} = \exp_{p_t}X_t.$$ --- Accelerated first-order methods --- - Stochastic optimization - Accelerated SGD? - Adam: coordinate-wise adaptive learning rate (more or less) - Parallel translation of "momentum" "Coordinates" on manifolds? --- Product manifolds --- $$M = M_1\times \cdots\times M_n,$$ Product topology, product metric, ... **RAdam**: adaptivity term for each product component! --- <center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" /> (Becigneul, Ganea)</center> --- $\varkappa$-Stereographic space --- - (Becigneul, Ganea) use RAdam to embed wordnet in a Poincare ball (constant curvature $c$) - Hyperbolic neural networks - $c$ -- hyperparameter - Optimize for $c$? --- $\varkappa$-Stereographic space --- - $S^\varkappa$ -- $\varkappa$-stereographic space, universal for positive and negative curvature ([Bloch](https://andbloch.github.io/K-Stereographic-Model/)) - "Almost $\mathbb{R}$-line bundle" - $E\xrightarrow\pi \mathbb{R}$ - $\pi^{-1}(\varkappa) \simeq S^\varkappa$ - Can we induce metric, etc? OK for optimization?
{"metaMigratedAt":"2023-06-15T04:45:26.758Z","metaMigratedFrom":"YAML","title":"Riemannopt@HSE - SLIDES, pt. 1/2","breaks":false,"description":"Sketch of slides for the Riemann opts talk at HSE. View the slide with \"Slide Mode\".","slideOptions":"{\"transition\":\"fade\",\"allottedMinutes\":60,\"theme\":\"white\"}","contributors":"[{\"id\":\"5df6d742-95e4-4a3e-8568-902ecaabcfca\",\"add\":15267,\"del\":4364}]"}
    440 views