Riemannopt@HSE - NOTES, pt. 1/2

Riemannopt@HSE === https://hackmd.io/@SomeoneS/manopt-slides It's a two-parts talk about optimization on manifolds and hyperbolic embeddings in deep learning. - The preliminary talk, read by [@SomeoneS](https://github.com/SomeoneSerge/), takes place on ~~March, 6~~ **March, 13** and covers the basics. - What are manifolds? Why Riemannian optimization? - What's it mean to make a "step in a direction" on a manifold? - How to differentiate functions between manifolds? - What's tricky about higher-order derivatives, and what do "acceleration" and "parallel" mean? - How to measure distances? - How to represent manifolds with numerical arrays? Embedded manifolds - How to get by without closed-form expressions? Retractions - How to speed up? Adaptive methods on product-manifolds and discussion. - Second talk, given by [@ferrine96](twitter.com/ferrine96/), is devoted to applications in deep learning, and in particular to graph embeddings. Date is ~~March, 13~~, **March, 20**. - Embedding graphs in hyperbolic spaces - Mixed curvature and [$\kappa$-stereographic model](https://andbloch.github.io/K-Stereographic-Model/) - [Hyperbolic neural networks](https://arxiv.org/abs/1805.09112) - Batch-Normalization on manifolds - Generalizing convolutional layers - And more Notes and slides: - [Notes](https://hackmd.io/@newkozlukov/manopt-notes) and [slides](https://hackmd.io/@newkozlukov/manopt-slides) for the first talk. The notes are a work-in-progress, many sections are still incomplete (some lack visualizations, some lack accompanying text, some both). We'll be updating the script as we approach the D-day. - Notes and slides for the second talk: coming soon References: - Gravity and Light workshop: https://gravity-and-light.herokuapp.com/tutorials - "Geometrical Anatomy of Theoretical Physics" by Frederic Schuller on [YouTube](https://youtu.be/V49i_LM8B0E?list=PLPH7f_7ZlzxTi6kS4vCmv4ZKm9u8g5yic) - Lee's series - Introduction to Topological Manifolds: https://archive.org/details/springer_10.1007-978-0-387-22727-6/mode/2up - Introduction to Smooth Manifolds: http://webmath2.unito.it/paginepersonali/sergio.console/lee.pdf - Riemannian Manifolds: https://www.maths.ed.ac.uk/~v1ranick/papers/leeriemm.pdf - - Absil's optimization on Matrix Manifolds: - Book online: https://press.princeton.edu/absil - Book: http://www.eeci-institute.eu/GSC2011/Photos-EECI/EECI-GSC-2011-M5/book_AMS.pdf - Slides, Feb, 2011: http://www.eeci-institute.eu/GSC2011/Photos-EECI/EECI-GSC-2011-M5/slides_lectures.pdf - Slides, Nov, 2014: http://gdr-mia.math.cnrs.fr/events/optimgeo-14/program/slides/absil.pdf Additional: - http://www.cnbc.cmu.edu/cns/papers/edelman98.pdf - https://arxiv.org/abs/1907.00949 - https://arxiv.org/abs/1607.02833 Starting point -- RAdam --- Let us first set the goals for to-day. Somewhat unconventionally, we jump off with the algorithm from <a href="https://openreview.net/forum?id=r1eiqi09K7&noteId=HygEnIClNE"> "Riemannian Adaptive Optimization Methods"</a> by Becigneul and Ganea. It might look scary at first, but what we're now going to do is iterate, decrypting it bit by bit.  ![](https://i.imgur.com/Wd5ofKf.png) Manifolds --- An $n$-dimensional manifold $M$ is a topological space $(M, \theta)$ that - is "nice": - Hausdorff (points have disjoint neighbourhoods) - second-countable (has countable basis for topology $\theta$) - at every point $p\in M$ has neighbourhood $U\subset M$ in which it looks like Euclidean space: there is a homeomorphism $x:U \to x(U)\subset \mathbb{R}^n$. - $x$ is called a (local) "chart" at $p$; it provides "coordinates" on that piece of manifold - "homeromorphism" means: continuous, invertible, with continuous inverse Roughly speaking, it's a "locally-Euclidean space". For instance, $\mathbb{R}^n$ itself is a manifold, with (global) chart being identity function. An additional piece of structure we're going to need (in addition to that of topology and local-Eucliddean-ness), is that of a **smooth manifold**. Consider two charts $(U, x)$, $(V, y)$ that share some part of their domains: $U\cap V \neq \varnothing.$ Both charts being invertible functions, we can construct a "**change of coordinates**", or "**transition map**" from $x$ to $y$ as $y\circ x^{-1}$. We say that $x$ and $y$ are **smoothly compatible**, if either $U\cap V = \varnothing$ or transition map is a **diffeomorphism** (in Euclidean sense): $$y\circ x^{-1}: x(U\cap V)\subset\mathbb{R}^n \to y(U\cap V)\subset\mathbb{R}^n.$$ Now, to endow $M$ with a **smooth structure**, means to provide a **maximal atlas** $A$ on $M$ -- a collection of *smoothly-compatible* charts, whose domains cover entire $M$, and such that this collection cannot be extended with a new chart in a smoothly-compatible way. <img src="https://i.imgur.com/F0NBjOs.jpg" style="width: 60%; margin: 1em; border-radius: 1em;"/> Now, we've done some hard work, we probably already want to see some benefits? What *smooth structure* allows us to define, is: what does it mean for a function to be a **smooth function between manifolds**? Given two **smooth manifolds** $(M, \theta_M, A_M)$ and $(N, \theta_N, A_N)$, we say that a function $F: M\to N$ is **smooth** (resp. $C^k$) if $$y\circ F\circ x^{-1}: x(U)\subset\mathbb{R}^m \to y(V)\subset\mathbb{R}^n$$ is smooth (resp. $C^k$) for every $(U, x)\in A_M$, $(V, y) \in A_N$. <img src="https://i.imgur.com/8BHgHRv.jpg" class="img-responsive" style="width: 60%; margin: 1em; border-radius: 1em;"/> ## Submanifolds, product manifolds, bundles, fibers, sections _On one hand, I want to introduce bundles&c after tangent spaces, in the context of vector fields&c; but on the other, it seems meaningful to introduce submanifolds&c at very beginning so that we have examples early on; this section will be probably split and moved later; this section follows Frederic Schuller's lecture_ _I also need bundles to try and introduce curvature learning on stereographic model spaces_ One way to construct manifolds, and the one that we're going to use the most, is from other manifolds. Given a topological $n$-manifold $(M, \theta)$, we call $(N, \left.\theta\right|_N)$, $N\subset M$, a **submanifold** of $M$ if $(N, \left.\theta\right|_N)$ is itself a manifold. Given $m$-manifold $M$ and $n$-manifold $N$, their **product manifold** is the manifold with underlying set the cartesian product and endowed with product topology: $(M{\times}N, \left.\theta\right._{M{\times}N})$. - $n$-torus $T^n$ is product $S^1{\times}\cdots{\times}S^1$. A **bundle** of topological manifolds is a triple $(E, \pi, M)$ of total space $E$, continuous surjective **projection** map $\pi:E\to M$, and base space $M$. Bundle is a weaker version of product: instead of saying that $E$ can be constructed from its components, we say that locally $E$ can be projected onto components - $\pi: E\to M$ - A **fiber** of $E$ at $p\in M$ is: $\pi^{-1}(p)$ - Future uses: vector bundles, tangent/cotangent bundle, sections Example: product manifold is a bundle (REMOVE?) - $E = M{\times}N,$ - $\pi: M{\times}N \to M,$ - $(p, q) \mapsto p$ (continuous w.r.t. product topology), Example: Mobius strip is a bundle, but not a product - TODO: PICTURE - $E$ -- Mobius Strip, - $M = S^1$, - $\pi^{-1}(p) = [-1, 1].$ A **fiber bundle** is a bundle $\pi:E\to M$ s.t. $\pi^{-1}(p) \cong F$ for all $p\in M$ and some ("typical fiber") manifold $F$. Lo and behold, given a bundle $\pi:E\to M,$ a **section** $\sigma: M\to E$ assigns to each point $p$ in manifold $M$ a point from its fiber $\sigma(p)\in \pi^{-1}(p)\subset E$. In other words, it's a map $\sigma: M\to E$ such that $\pi \circ \sigma = \operatorname{id}_M$. Example: vector fields. TODO - **Bundle morphisms** - **Bundle isomorphisms** - **Locally isomorphic bundles** - **Trivial bundles, locally trivial bundles**, and **local trivialization** - **Pullback bundle** Gradient descent, "step in a direction", and differentiation --- _In this section we talk about line-search along geodesics, derivations and differentiation of functions between manifolds_ **TODO:** actually, we need Riemannian metric before we can finish this section -- in order to convert derivative into a tangent vector. One of the simplest optimization methods one can think of is the gradient descent: given a (piecewise)-differentiable objective function $f:X\subset\mathbb{R}^n \to \mathbb{R}$ initial estimate $x_0 \in X$, and learning rate $\alpha\in\mathbb{R}$, we gradually improve our estimate, by making a small linear step in the direction of steepest descent of the loss function: $$x_t = x_{t-1} - \alpha \operatorname{grad} f (x_{t-1}).$$ One way to interpret such iteration, is that at step $t$, we approximate our objective $f$ Let's begin with the notion of "**direction**". The key observation for generalization of "directions" and "direction fields" to manifolds, is that "direction" attached to a point **acts** on **scalar functions** as a directional derivative at that point. Specifically, given a "test function" $f:\mathbb{R}^n \to \mathbb{R}$, a vector $X\in\mathbb{R}^n$ attached to point $p\in\mathbb{R}^n$ _acts_ on $f$ by computing $$(\lambda\mapsto f(x + \lambda X))'(0).$$ In fact, for a fixed $p$, there's one-to-one correspondence between such linear functionals and $\mathbb{R}^d$-vectors. For an abstract smooth manifold $M$, we can take $C^\infty(M)$ for the space of test functions, while "directional derivatives" would be _derivatives along smooth curves_. Given a smooth curve $\gamma: I\subset\mathbb{R}\to M$, passing through a point $p\in M$ (s.t. $\gamma(0)=p$), the derivative of a test function $f:M\to \mathbb{R}$ at $p$ along $\gamma$ is $$X_{\gamma, p}(f) = (f\circ \gamma)'(0).$$ We could read this formula as "initial velocity of $f$ along $\gamma$". There's a handful observations to be made about this concept: - Only local behaviour of $\gamma$ matters. Two curves may represent same directional derivative - $X_{\gamma, p}$ is a linear functional - It satisfies Leibniz rule: $X(fg) = X(f)g(p) + f(p)X(g)$. Such linear functionals are called **derivations**. <div style="margin: 1em;"> <img src="https://i.imgur.com/4Zu3yoc.jpg" style="border-radius: 1em; margin: 1em;" /> - The set $T_pM = \{X_{\gamma, p}:~\gamma~\text{smooth curve through}~p\}$ actually forms a linear space: the action of linear combinations of derivations is defined pointwise, and an example curve representing the combination may be constructed in a chart: <img src="https://i.imgur.com/mSv7diH.jpg" style="border-radius: 1em; margin: 1em" /> </div> **Leibniz rule and derivations** Derivatives --- _Total derivative, pullback, pushforward_ A $C^k$ function $F:M\to N$ between manifolds induces maps between (co-)tangent bundles of $M$ and $N$. The **push-forward** $F_*X_p$ of $X_p\in T_pM$ ("on the left") under $F$ is a tangent vector in $T_{F(p)}N$ such that for $f\in C^\infty(M)$ $$(F_*X_p)(f) = X_p(f\circ F).$$ Similarly, for a co-tangent $\omega\in T^*_{F(p)}N$ ("on the right"), its **pull-back** under $F$ is $F^*\omega \in T^*_pM$ such that for $X\in T_pM$ $$(F^*\omega)(X) = \omega(F_*X).$$ The notion of push-forward generalizes that of a derivative. In coordinate basis, push-forward corresponds to Jacobian matrix acting on column-vectors. _Hmm, pull-backs should correspond to Jacobian matrix too, matrix acting from the left, am I right? Yet we can't have correspondence between tangents and cotangents without metric tensor? Oh, wait, the derivative, apparently, should be a $(1,1)$-tensor?! Ah, not even so, it takes one index from left manifold and the other from the right mfd_ Coordinate frame in $T_pM$ --- _Push-forwards of inverse charts form a basis in $T_pM$. This also allows to specify the dimension of tangent bundle as a manifold_ Given a chart $(U, x)$, we can construct a basis in $T_p^M,~p\in\operatorname{dom}x$ from canonical basis in $T_{x(p)}\mathbb{R}^n \cong \mathbb{R}^n$, by push-forwarding it via inverse chart: $$E_i = \frac{\partial}{\partial x_i} = x^{-1}_* \partial_i,$$ where $\partial_i$ is the $i$'th partial derivative operator acting on functions on $\mathbb{R}^n$. This is visualized as follows: one can construct a curve $\gamma_i$ emanating from $p$ as following $i$'th coordinate in the chart $x$: $$\gamma_i(t)=x^{-1}(x(p) + t e_i).$$ Then $$\frac{\partial}{\partial x_i}(f) = \partial_i (f \circ x^{-1})(x(p)) = (t\mapsto f(x^{-1}(x(p) + te_i)))'(0) = X_{\gamma_i, p}(f).$$ In other words, $\frac{\partial}{\partial x_i}$ is represented by the curve $\gamma_i$. It now takes just a little exercise to show they are linearly independent a form a basis in $T_pM$. Distances, angles, isometry of $T_pM$ and $T_p^*M$ --- _So far, we've only built machinery for reasoning about **continuity** and **smoothness**. We still don't have any tools for discussing **geometry**. Now we introduce the notion of **Riemannian metric**, which will allow us to measure lengths of curves, compute angles, and, most importantly, convert the derivative of our objective function, which is a cotangent vector, into element of tangent space -- the **gradient**, that is, a direction to make a step in._ Roughly, a **metric** on $M$ is a smooth assignment of symmetric "_inner product_" $g_p = \langle\cdot,\cdot\rangle_p$ on each $T_pM$. Metric is **Riemannian**, if it's positive definite ($g(X, X)\geq 0$). _In fact, in the second talk on applications, we'll be more interested in **Lorentzian** metrics that have signature $(+--\cdots-)$ that correspond to hyperbolic spaces._ TLDR: - Now finally, given a curve, we can measure its _speeed_: $\|X\|_p^2 = \langle X, X \rangle_p,$ for $X\in\operatorname{T}_pM$. - We can measure cosangles. - We can now measure lengths: $L_{[a,b]}(\gamma) = \int_a^b \|\gamma'(t)\|_{\gamma(t)}\operatorname{d}t.$ - We can now convert elements of tangent and cotangent bundle back and forth: - $X\in T_pM$ now defines a linear functional $X^\flat=Y\mapsto \langle X, Y \rangle_p,$ - and a cotangent $\omega\in T_p^*M$ corresponds to tangent vector $\omega^\sharp\in T_pM$ satisfying equation $\langle \omega^\sharp, Y\rangle=\omega(Y)$ (i.e. resulting from application of inverse metric tensor) - TODO: Show in a chart and under change of coordinates? - This allows us to define, for $f:M\to\mathbb{R}$, $$\operatorname{grad}f = \operatorname{d}f^\sharp \in \Gamma(\operatorname{T}M).$$ Now, by a "smooth assignment" we mean that a **metric** on a smooth manifold $M$ is a smooth map $$g:\Gamma(\operatorname{T}M)\times\Gamma(\operatorname{T}M)\to C^\infty(M)$$ that takes two vector fields and produces a scalar function. _In other words, it's a $\operatorname{T}_0^2M$ field._ Comparing vectors from different tangent spaces --- _In this section we talk about differentiation of vector fields, linear connection, zero-acceleration, and parallel translation_ **NB**: in sssome of the following pictures we temporarily use the sometimes abusive and mostly obsolete notation from Absil et al. ![](https://i.imgur.com/Li8rtXY.jpg) ![](https://i.imgur.com/f5XgIeh.jpg) ![](https://i.imgur.com/lhRt8uG.jpg) ![](https://i.imgur.com/P0GkqCR.jpg) ![](https://i.imgur.com/2b54PN9.jpg) ![](https://i.imgur.com/ZBTgK3C.jpg) ![](https://i.imgur.com/v0iPoTY.jpg) Newton method --- _Now that we've learned to differentiate vector fields, we can generalize Newton's method to manifolds_ $$\nabla_{\eta, x_t}\operatorname{grad}f(x_t) = -\operatorname{grad}f(x_t)$$ Curvature --- _Not even sure if I should include it... would be useful to prepare for the second talk_ Hard truth -- we don't know closed-form expmaps and parallel translations --- _Here we discuss **retractions** and **vector transport** -- $1$'st order approximations to exponential map and parallel transport._