Riemannopt@HSE
===
https://hackmd.io/@SomeoneS/manopt-slides
It's a two-parts talk about optimization on manifolds
and hyperbolic embeddings in deep learning.
- The preliminary talk, read by [@SomeoneS](https://github.com/SomeoneSerge/), takes place on ~~March, 6~~ **March, 13** and covers the basics.
- What are manifolds? Why Riemannian optimization?
- What's it mean to make a "step in a direction" on a manifold?
- How to differentiate functions between manifolds?
- What's tricky about higher-order derivatives, and what do "acceleration" and "parallel" mean?
- How to measure distances?
- How to represent manifolds with numerical arrays? Embedded manifolds
- How to get by without closed-form expressions? Retractions
- How to speed up? Adaptive methods on product-manifolds and discussion.
- Second talk, given by [@ferrine96](twitter.com/ferrine96/), is devoted to applications in deep learning, and in particular to graph embeddings. Date is ~~March, 13~~, **March, 20**.
- Embedding graphs in hyperbolic spaces
- Mixed curvature and [$\kappa$-stereographic model](https://andbloch.github.io/K-Stereographic-Model/)
- [Hyperbolic neural networks](https://arxiv.org/abs/1805.09112)
- Batch-Normalization on manifolds
- Generalizing convolutional layers
- And more
Notes and slides:
- [Notes](https://hackmd.io/@newkozlukov/manopt-notes) and [slides](https://hackmd.io/@newkozlukov/manopt-slides) for the first talk.
The notes are a work-in-progress, many sections are still incomplete (some lack visualizations, some lack accompanying text, some both). We'll be updating the script as we approach the D-day.
- Notes and slides for the second talk: coming soon
References:
- Gravity and Light workshop: https://gravity-and-light.herokuapp.com/tutorials
- "Geometrical Anatomy of Theoretical Physics" by Frederic Schuller on [YouTube](https://youtu.be/V49i_LM8B0E?list=PLPH7f_7ZlzxTi6kS4vCmv4ZKm9u8g5yic)
- Lee's series
- Introduction to Topological Manifolds: https://archive.org/details/springer_10.1007-978-0-387-22727-6/mode/2up
- Introduction to Smooth Manifolds: http://webmath2.unito.it/paginepersonali/sergio.console/lee.pdf
- Riemannian Manifolds: https://www.maths.ed.ac.uk/~v1ranick/papers/leeriemm.pdf
-
- Absil's optimization on Matrix Manifolds:
- Book online: https://press.princeton.edu/absil
- Book: http://www.eeci-institute.eu/GSC2011/Photos-EECI/EECI-GSC-2011-M5/book_AMS.pdf
- Slides, Feb, 2011: http://www.eeci-institute.eu/GSC2011/Photos-EECI/EECI-GSC-2011-M5/slides_lectures.pdf
- Slides, Nov, 2014: http://gdr-mia.math.cnrs.fr/events/optimgeo-14/program/slides/absil.pdf
Additional:
- http://www.cnbc.cmu.edu/cns/papers/edelman98.pdf
- https://arxiv.org/abs/1907.00949
- https://arxiv.org/abs/1607.02833
Starting point -- RAdam
---
Let us first set the goals for to-day.
Somewhat unconventionally, we jump off with the algorithm from
<a href="https://openreview.net/forum?id=r1eiqi09K7¬eId=HygEnIClNE">
"Riemannian Adaptive Optimization Methods"</a>
by Becigneul and Ganea. It might look scary at first,
but what we're now going to do is iterate, decrypting it bit by bit.
<!--<img src="https://i.imgur.com/pjKeUTZ.png" alt="RAMSGRAD" style="width: 60%" />
Let's take a closer look and see what terms we'd need to learn first:
-->

Manifolds
---
An $n$-dimensional manifold $M$ is a topological space $(M, \theta)$
that
- is "nice":
- Hausdorff (points have disjoint neighbourhoods)
- second-countable (has countable basis for topology $\theta$)
- at every point $p\in M$ has neighbourhood $U\subset M$
in which it looks like Euclidean space:
there is a homeomorphism $x:U \to x(U)\subset \mathbb{R}^n$.
- $x$ is called a (local) "chart" at $p$; it provides "coordinates" on that piece of manifold
- "homeromorphism" means: continuous, invertible, with continuous inverse
Roughly speaking, it's a "locally-Euclidean space".
For instance, $\mathbb{R}^n$ itself is a manifold,
with (global) chart being identity function.
An additional piece of structure we're going to need
(in addition to that of topology and local-Eucliddean-ness),
is that of a **smooth manifold**. Consider
two charts $(U, x)$, $(V, y)$ that share some part of their domains:
$U\cap V \neq \varnothing.$ Both charts being invertible functions,
we can construct a "**change of coordinates**", or "**transition map**"
from $x$ to $y$ as $y\circ x^{-1}$. We say that $x$ and $y$
are **smoothly compatible**, if either $U\cap V = \varnothing$
or transition map is a **diffeomorphism** (in Euclidean sense):
$$y\circ x^{-1}: x(U\cap V)\subset\mathbb{R}^n
\to y(U\cap V)\subset\mathbb{R}^n.$$
Now, to endow $M$ with a **smooth structure**, means to provide
a **maximal atlas** $A$ on $M$ -- a collection of *smoothly-compatible*
charts, whose domains cover entire $M$, and such that
this collection cannot be extended with a new chart in a smoothly-compatible way.
<img src="https://i.imgur.com/F0NBjOs.jpg" style="width: 60%; margin: 1em; border-radius: 1em;"/>
Now, we've done some hard work, we probably already want to see some benefits?
What *smooth structure* allows us to define, is: what does it mean for a
function to be a **smooth function between manifolds**?
Given two **smooth manifolds** $(M, \theta_M, A_M)$
and $(N, \theta_N, A_N)$, we say that a function $F: M\to N$
is **smooth** (resp. $C^k$) if
$$y\circ F\circ x^{-1}: x(U)\subset\mathbb{R}^m \to y(V)\subset\mathbb{R}^n$$
is smooth (resp. $C^k$)
for every $(U, x)\in A_M$, $(V, y) \in A_N$.
<img src="https://i.imgur.com/8BHgHRv.jpg" class="img-responsive" style="width: 60%; margin: 1em; border-radius: 1em;"/>
## Submanifolds, product manifolds, bundles, fibers, sections
_On one hand, I want to introduce bundles&c after tangent spaces, in the
context of vector fields&c; but on the other, it seems meaningful
to introduce submanifolds&c at very beginning so that we have examples
early on; this section will be probably split and moved later;
this section follows Frederic Schuller's lecture_
_I also need bundles to try and introduce curvature learning on stereographic model spaces_
One way to construct manifolds, and the one that we're going
to use the most, is from other manifolds.
Given a topological $n$-manifold $(M, \theta)$,
we call $(N, \left.\theta\right|_N)$, $N\subset M$, a **submanifold** of $M$
if $(N, \left.\theta\right|_N)$ is itself a manifold.
Given $m$-manifold $M$ and $n$-manifold $N$,
their **product manifold** is the manifold
with underlying set the cartesian product
and endowed with product topology:
$(M{\times}N, \left.\theta\right._{M{\times}N})$.
- $n$-torus $T^n$ is product $S^1{\times}\cdots{\times}S^1$.
A **bundle** of topological manifolds
is a triple $(E, \pi, M)$
of total space $E$, continuous surjective **projection** map $\pi:E\to M$,
and base space $M$.
Bundle is a weaker version of product: instead of saying that $E$
can be constructed from its components, we say that locally $E$ can
be projected onto components
- $\pi: E\to M$
- A **fiber** of $E$ at $p\in M$ is: $\pi^{-1}(p)$
- Future uses: vector bundles, tangent/cotangent bundle, sections
Example: product manifold is a bundle (REMOVE?)
- $E = M{\times}N,$
- $\pi: M{\times}N \to M,$
- $(p, q) \mapsto p$ (continuous w.r.t. product topology),
Example: Mobius strip is a bundle, but not a product
- TODO: PICTURE
- $E$ -- Mobius Strip,
- $M = S^1$,
- $\pi^{-1}(p) = [-1, 1].$
A **fiber bundle** is a bundle $\pi:E\to M$ s.t.
$\pi^{-1}(p) \cong F$ for all $p\in M$ and some ("typical fiber") manifold $F$.
Lo and behold, given a bundle $\pi:E\to M,$
a **section** $\sigma: M\to E$
assigns to each point $p$ in manifold $M$
a point from its fiber $\sigma(p)\in \pi^{-1}(p)\subset E$.
In other words, it's a map $\sigma: M\to E$ such that $\pi \circ \sigma = \operatorname{id}_M$.
Example: vector fields.
TODO
- **Bundle morphisms**
- **Bundle isomorphisms**
- **Locally isomorphic bundles**
- **Trivial bundles, locally trivial bundles**, and **local trivialization**
- **Pullback bundle**
Gradient descent, "step in a direction", and differentiation
---
_In this section we talk about line-search along geodesics,
derivations and differentiation of functions between manifolds_
**TODO:** actually, we need Riemannian metric
before we can finish this section -- in order to convert
derivative into a tangent vector.
One of the simplest optimization methods one can think of
is the gradient descent:
given a (piecewise)-differentiable objective function
$f:X\subset\mathbb{R}^n \to \mathbb{R}$
initial estimate $x_0 \in X$,
and learning rate $\alpha\in\mathbb{R}$,
we gradually improve our estimate, by making
a small linear step in the direction of steepest descent
of the loss function:
$$x_t = x_{t-1} - \alpha \operatorname{grad} f (x_{t-1}).$$
One way to interpret such iteration, is that at step $t$,
we approximate our objective $f$
Let's begin with the notion of "**direction**".
The key observation for generalization of "directions" and "direction fields" to manifolds, is that "direction" attached to a point **acts** on **scalar functions** as a directional derivative at that point.
Specifically, given a "test function" $f:\mathbb{R}^n \to \mathbb{R}$, a vector $X\in\mathbb{R}^n$
attached to point $p\in\mathbb{R}^n$
_acts_ on $f$ by computing
$$(\lambda\mapsto f(x + \lambda X))'(0).$$
In fact, for a fixed $p$, there's one-to-one correspondence between such linear functionals and $\mathbb{R}^d$-vectors.
For an abstract smooth manifold $M$,
we can take $C^\infty(M)$ for the space of test functions,
while "directional derivatives" would be
_derivatives along smooth curves_.
Given a smooth curve $\gamma: I\subset\mathbb{R}\to M$, passing through a point $p\in M$ (s.t. $\gamma(0)=p$), the derivative of a test function $f:M\to \mathbb{R}$ at $p$ along $\gamma$ is
$$X_{\gamma, p}(f) = (f\circ \gamma)'(0).$$
We could read this formula as "initial velocity of $f$ along $\gamma$".
There's a handful observations to be made about this concept:
- Only local behaviour of $\gamma$ matters. Two curves may represent same directional derivative
- $X_{\gamma, p}$ is a linear functional
- It satisfies Leibniz rule: $X(fg) = X(f)g(p) + f(p)X(g)$. Such linear functionals are called **derivations**.
<div style="margin: 1em;">
<img src="https://i.imgur.com/4Zu3yoc.jpg" style="border-radius: 1em; margin: 1em;" />
- The set $T_pM = \{X_{\gamma, p}:~\gamma~\text{smooth curve through}~p\}$ actually forms a linear space: the action of linear combinations of derivations is defined pointwise, and an example curve representing the combination may be constructed in a chart:
<img src="https://i.imgur.com/mSv7diH.jpg" style="border-radius: 1em; margin: 1em" />
</div>
**Leibniz rule and derivations**
Derivatives
---
_Total derivative, pullback, pushforward_
A $C^k$ function $F:M\to N$ between manifolds
induces maps between (co-)tangent bundles
of $M$ and $N$.
The **push-forward** $F_*X_p$ of $X_p\in T_pM$ ("on the left") under $F$
is a tangent vector in $T_{F(p)}N$ such that for $f\in C^\infty(M)$
$$(F_*X_p)(f) = X_p(f\circ F).$$
Similarly, for a co-tangent $\omega\in T^*_{F(p)}N$ ("on the right"), its **pull-back** under $F$ is $F^*\omega \in T^*_pM$ such that for $X\in T_pM$
$$(F^*\omega)(X) = \omega(F_*X).$$
The notion of push-forward generalizes that of a derivative.
In coordinate basis, push-forward corresponds to Jacobian matrix acting on column-vectors.
_Hmm, pull-backs should correspond to Jacobian matrix too, matrix acting from the left, am I right? Yet we can't have correspondence between tangents and cotangents without metric tensor? Oh, wait, the derivative, apparently, should be a $(1,1)$-tensor?! Ah, not even so, it takes one index from left manifold and the other from the right mfd_
Coordinate frame in $T_pM$
---
_Push-forwards of inverse charts form a basis in $T_pM$.
This also allows to specify the dimension of tangent bundle as a manifold_
Given a chart $(U, x)$, we can construct a basis in $T_p^M,~p\in\operatorname{dom}x$ from canonical basis in $T_{x(p)}\mathbb{R}^n \cong \mathbb{R}^n$,
by push-forwarding it via inverse chart:
$$E_i = \frac{\partial}{\partial x_i} = x^{-1}_* \partial_i,$$
where $\partial_i$ is the $i$'th partial derivative operator acting on functions on $\mathbb{R}^n$.
This is visualized as follows: one can construct a curve $\gamma_i$ emanating from $p$
as following $i$'th coordinate in the chart $x$:
$$\gamma_i(t)=x^{-1}(x(p) + t e_i).$$
Then $$\frac{\partial}{\partial x_i}(f) = \partial_i (f \circ x^{-1})(x(p)) = (t\mapsto f(x^{-1}(x(p) + te_i)))'(0) = X_{\gamma_i, p}(f).$$
In other words, $\frac{\partial}{\partial x_i}$ is represented by the curve $\gamma_i$. It now takes just a little exercise to show they
are linearly independent a form a basis in $T_pM$.
Distances, angles, isometry of $T_pM$ and $T_p^*M$
---
_So far, we've only built machinery for reasoning
about **continuity** and **smoothness**. We still
don't have any tools for discussing **geometry**.
Now we introduce the notion of **Riemannian metric**,
which will allow us to measure lengths of curves,
compute angles, and, most importantly,
convert the derivative of our objective function,
which is a cotangent vector, into element
of tangent space -- the **gradient**,
that is, a direction to make a step in._
Roughly, a **metric** on $M$ is a smooth assignment
of symmetric "_inner product_" $g_p = \langle\cdot,\cdot\rangle_p$ on each $T_pM$.
Metric is **Riemannian**, if it's positive definite ($g(X, X)\geq 0$). _In fact, in the second talk on applications, we'll be more interested in **Lorentzian** metrics that have signature $(+--\cdots-)$ that correspond to hyperbolic spaces._
TLDR:
- Now finally, given a curve, we can measure its _speeed_: $\|X\|_p^2 = \langle X, X \rangle_p,$ for $X\in\operatorname{T}_pM$.
- We can measure cosangles.
- We can now measure lengths: $L_{[a,b]}(\gamma) = \int_a^b \|\gamma'(t)\|_{\gamma(t)}\operatorname{d}t.$
- We can now convert elements of tangent and cotangent bundle back and forth:
- $X\in T_pM$ now defines a linear functional $X^\flat=Y\mapsto \langle X, Y \rangle_p,$
- and a cotangent $\omega\in T_p^*M$ corresponds to tangent vector $\omega^\sharp\in T_pM$ satisfying equation $\langle \omega^\sharp, Y\rangle=\omega(Y)$ (i.e. resulting from application of inverse metric tensor)
- TODO: Show in a chart and under change of coordinates?
- This allows us to define, for $f:M\to\mathbb{R}$, $$\operatorname{grad}f = \operatorname{d}f^\sharp \in \Gamma(\operatorname{T}M).$$
Now, by a "smooth assignment" we mean that
a **metric** on a smooth manifold $M$ is a smooth map
$$g:\Gamma(\operatorname{T}M)\times\Gamma(\operatorname{T}M)\to C^\infty(M)$$
that takes two vector fields
and produces a scalar function.
_In other words, it's a $\operatorname{T}_0^2M$ field._
Comparing vectors from different tangent spaces
---
_In this section we talk about differentiation of vector fields,
linear connection, zero-acceleration, and parallel translation_
**NB**: in sssome of the following pictures we temporarily
use the sometimes abusive and mostly obsolete notation from Absil et al.







Newton method
---
_Now that we've learned to differentiate vector fields,
we can generalize Newton's method to manifolds_
$$\nabla_{\eta, x_t}\operatorname{grad}f(x_t)
= -\operatorname{grad}f(x_t)$$
Curvature
---
_Not even sure if I should include it...
would be useful to prepare for the second talk_
Hard truth -- we don't know closed-form expmaps and parallel translations
---
_Here we discuss **retractions** and **vector transport** -- $1$'st order approximations to exponential map and parallel transport._
{"metaMigratedAt":"2023-06-15T04:47:45.475Z","metaMigratedFrom":"YAML","title":"Riemannopt@HSE - NOTES, pt. 1/2","breaks":false,"description":"https://hackmd.io/@newkozlukov/manopt-slides","contributors":"[{\"id\":\"5df6d742-95e4-4a3e-8568-902ecaabcfca\",\"add\":18831,\"del\":2630},{\"id\":null,\"add\":1,\"del\":0}]"}