Riemannopt@HSE

https://hackmd.io/@SomeoneS/manopt-slides

It's a two-parts talk about optimization on manifolds and hyperbolic embeddings in deep learning.

  • The preliminary talk, read by @SomeoneS, takes place on March, 6 March, 13 and covers the basics.
    • What are manifolds? Why Riemannian optimization?
    • What's it mean to make a "step in a direction" on a manifold?
    • How to differentiate functions between manifolds?
    • What's tricky about higher-order derivatives, and what do "acceleration" and "parallel" mean?
    • How to measure distances?
    • How to represent manifolds with numerical arrays? Embedded manifolds
    • How to get by without closed-form expressions? Retractions
    • How to speed up? Adaptive methods on product-manifolds and discussion.
  • Second talk, given by @ferrine96, is devoted to applications in deep learning, and in particular to graph embeddings. Date is March, 13, March, 20.

Notes and slides:

  • Notes and slides for the first talk. The notes are a work-in-progress, many sections are still incomplete (some lack visualizations, some lack accompanying text, some both). We'll be updating the script as we approach the D-day.
  • Notes and slides for the second talk: coming soon

References:

Additional:

Starting point RAdam

Let us first set the goals for to-day. Somewhat unconventionally, we jump off with the algorithm from "Riemannian Adaptive Optimization Methods" by Becigneul and Ganea. It might look scary at first, but what we're now going to do is iterate, decrypting it bit by bit.

Manifolds

An \(n\)-dimensional manifold \(M\) is a topological space \((M, \theta)\) that

  • is "nice":
    • Hausdorff (points have disjoint neighbourhoods)
    • second-countable (has countable basis for topology \(\theta\))
  • at every point \(p\in M\) has neighbourhood \(U\subset M\) in which it looks like Euclidean space: there is a homeomorphism \(x:U \to x(U)\subset \mathbb{R}^n\).
    • \(x\) is called a (local) "chart" at \(p\); it provides "coordinates" on that piece of manifold
    • "homeromorphism" means: continuous, invertible, with continuous inverse

Roughly speaking, it's a "locally-Euclidean space".

For instance, \(\mathbb{R}^n\) itself is a manifold, with (global) chart being identity function.

An additional piece of structure we're going to need (in addition to that of topology and local-Eucliddean-ness), is that of a smooth manifold. Consider two charts \((U, x)\), \((V, y)\) that share some part of their domains: \(U\cap V \neq \varnothing.\) Both charts being invertible functions, we can construct a "change of coordinates", or "transition map" from \(x\) to \(y\) as \(y\circ x^{-1}\). We say that \(x\) and \(y\) are smoothly compatible, if either \(U\cap V = \varnothing\) or transition map is a diffeomorphism (in Euclidean sense): \[y\circ x^{-1}: x(U\cap V)\subset\mathbb{R}^n \to y(U\cap V)\subset\mathbb{R}^n.\] Now, to endow \(M\) with a smooth structure, means to provide a maximal atlas \(A\) on \(M\) a collection of smoothly-compatible charts, whose domains cover entire \(M\), and such that this collection cannot be extended with a new chart in a smoothly-compatible way.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Now, we've done some hard work, we probably already want to see some benefits? What smooth structure allows us to define, is: what does it mean for a function to be a smooth function between manifolds? Given two smooth manifolds \((M, \theta_M, A_M)\) and \((N, \theta_N, A_N)\), we say that a function \(F: M\to N\) is smooth (resp. \(C^k\)) if \[y\circ F\circ x^{-1}: x(U)\subset\mathbb{R}^m \to y(V)\subset\mathbb{R}^n\] is smooth (resp. \(C^k\)) for every \((U, x)\in A_M\), \((V, y) \in A_N\).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Submanifolds, product manifolds, bundles, fibers, sections

On one hand, I want to introduce bundles&c after tangent spaces, in the context of vector fields&c; but on the other, it seems meaningful to introduce submanifolds&c at very beginning so that we have examples early on; this section will be probably split and moved later; this section follows Frederic Schuller's lecture

I also need bundles to try and introduce curvature learning on stereographic model spaces

One way to construct manifolds, and the one that we're going to use the most, is from other manifolds.

Given a topological \(n\)-manifold \((M, \theta)\), we call \((N, \left.\theta\right|_N)\), \(N\subset M\), a submanifold of \(M\) if \((N, \left.\theta\right|_N)\) is itself a manifold.

Given \(m\)-manifold \(M\) and \(n\)-manifold \(N\), their product manifold is the manifold with underlying set the cartesian product and endowed with product topology: \((M{\times}N, \left.\theta\right._{M{\times}N})\).

  • \(n\)-torus \(T^n\) is product \(S^1{\times}\cdots{\times}S^1\).

A bundle of topological manifolds is a triple \((E, \pi, M)\) of total space \(E\), continuous surjective projection map \(\pi:E\to M\), and base space \(M\). Bundle is a weaker version of product: instead of saying that \(E\) can be constructed from its components, we say that locally \(E\) can be projected onto components

  • \(\pi: E\to M\)
  • A fiber of \(E\) at \(p\in M\) is: \(\pi^{-1}(p)\)
  • Future uses: vector bundles, tangent/cotangent bundle, sections

Example: product manifold is a bundle (REMOVE?)

  • \(E = M{\times}N,\)
  • \(\pi: M{\times}N \to M,\)
  • \((p, q) \mapsto p\) (continuous w.r.t. product topology),

Example: Mobius strip is a bundle, but not a product

  • TODO: PICTURE
  • \(E\) Mobius Strip,
  • \(M = S^1\),
  • \(\pi^{-1}(p) = [-1, 1].\)

A fiber bundle is a bundle \(\pi:E\to M\) s.t. \(\pi^{-1}(p) \cong F\) for all \(p\in M\) and some ("typical fiber") manifold \(F\).

Lo and behold, given a bundle \(\pi:E\to M,\) a section \(\sigma: M\to E\) assigns to each point \(p\) in manifold \(M\) a point from its fiber \(\sigma(p)\in \pi^{-1}(p)\subset E\). In other words, it's a map \(\sigma: M\to E\) such that \(\pi \circ \sigma = \operatorname{id}_M\). Example: vector fields.

TODO

  • Bundle morphisms
  • Bundle isomorphisms
  • Locally isomorphic bundles
  • Trivial bundles, locally trivial bundles, and local trivialization
  • Pullback bundle

Gradient descent, "step in a direction", and differentiation

In this section we talk about line-search along geodesics, derivations and differentiation of functions between manifolds

TODO: actually, we need Riemannian metric before we can finish this section in order to convert derivative into a tangent vector.

One of the simplest optimization methods one can think of is the gradient descent: given a (piecewise)-differentiable objective function \(f:X\subset\mathbb{R}^n \to \mathbb{R}\) initial estimate \(x_0 \in X\), and learning rate \(\alpha\in\mathbb{R}\), we gradually improve our estimate, by making a small linear step in the direction of steepest descent of the loss function: \[x_t = x_{t-1} - \alpha \operatorname{grad} f (x_{t-1}).\]

One way to interpret such iteration, is that at step \(t\), we approximate our objective \(f\)

Let's begin with the notion of "direction". The key observation for generalization of "directions" and "direction fields" to manifolds, is that "direction" attached to a point acts on scalar functions as a directional derivative at that point. Specifically, given a "test function" \(f:\mathbb{R}^n \to \mathbb{R}\), a vector \(X\in\mathbb{R}^n\) attached to point \(p\in\mathbb{R}^n\) acts on \(f\) by computing \[(\lambda\mapsto f(x + \lambda X))'(0).\]

In fact, for a fixed \(p\), there's one-to-one correspondence between such linear functionals and \(\mathbb{R}^d\)-vectors.

For an abstract smooth manifold \(M\), we can take \(C^\infty(M)\) for the space of test functions, while "directional derivatives" would be derivatives along smooth curves.

Given a smooth curve \(\gamma: I\subset\mathbb{R}\to M\), passing through a point \(p\in M\) (s.t. \(\gamma(0)=p\)), the derivative of a test function \(f:M\to \mathbb{R}\) at \(p\) along \(\gamma\) is \[X_{\gamma, p}(f) = (f\circ \gamma)'(0).\]

We could read this formula as "initial velocity of \(f\) along \(\gamma\)".

There's a handful observations to be made about this concept:

  • Only local behaviour of \(\gamma\) matters. Two curves may represent same directional derivative
  • \(X_{\gamma, p}\) is a linear functional
  • It satisfies Leibniz rule: \(X(fg) = X(f)g(p) + f(p)X(g)\). Such linear functionals are called derivations.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • The set \(T_pM = \{X_{\gamma, p}:~\gamma~\text{smooth curve through}~p\}\) actually forms a linear space: the action of linear combinations of derivations is defined pointwise, and an example curve representing the combination may be constructed in a chart:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Leibniz rule and derivations

Derivatives

Total derivative, pullback, pushforward

A \(C^k\) function \(F:M\to N\) between manifolds induces maps between (co-)tangent bundles of \(M\) and \(N\).

The push-forward \(F_*X_p\) of \(X_p\in T_pM\) ("on the left") under \(F\) is a tangent vector in \(T_{F(p)}N\) such that for \(f\in C^\infty(M)\) \[(F_*X_p)(f) = X_p(f\circ F).\]

Similarly, for a co-tangent \(\omega\in T^*_{F(p)}N\) ("on the right"), its pull-back under \(F\) is \(F^*\omega \in T^*_pM\) such that for \(X\in T_pM\) \[(F^*\omega)(X) = \omega(F_*X).\]

The notion of push-forward generalizes that of a derivative. In coordinate basis, push-forward corresponds to Jacobian matrix acting on column-vectors. Hmm, pull-backs should correspond to Jacobian matrix too, matrix acting from the left, am I right? Yet we can't have correspondence between tangents and cotangents without metric tensor? Oh, wait, the derivative, apparently, should be a \((1,1)\)-tensor?! Ah, not even so, it takes one index from left manifold and the other from the right mfd

Coordinate frame in \(T_pM\)

Push-forwards of inverse charts form a basis in \(T_pM\). This also allows to specify the dimension of tangent bundle as a manifold

Given a chart \((U, x)\), we can construct a basis in \(T_p^M,~p\in\operatorname{dom}x\) from canonical basis in \(T_{x(p)}\mathbb{R}^n \cong \mathbb{R}^n\), by push-forwarding it via inverse chart: \[E_i = \frac{\partial}{\partial x_i} = x^{-1}_* \partial_i,\] where \(\partial_i\) is the \(i\)'th partial derivative operator acting on functions on \(\mathbb{R}^n\).

This is visualized as follows: one can construct a curve \(\gamma_i\) emanating from \(p\) as following \(i\)'th coordinate in the chart \(x\): \[\gamma_i(t)=x^{-1}(x(p) + t e_i).\]

Then \[\frac{\partial}{\partial x_i}(f) = \partial_i (f \circ x^{-1})(x(p)) = (t\mapsto f(x^{-1}(x(p) + te_i)))'(0) = X_{\gamma_i, p}(f).\]

In other words, \(\frac{\partial}{\partial x_i}\) is represented by the curve \(\gamma_i\). It now takes just a little exercise to show they are linearly independent a form a basis in \(T_pM\).

Distances, angles, isometry of \(T_pM\) and \(T_p^*M\)

So far, we've only built machinery for reasoning about continuity and smoothness. We still don't have any tools for discussing geometry. Now we introduce the notion of Riemannian metric, which will allow us to measure lengths of curves, compute angles, and, most importantly, convert the derivative of our objective function, which is a cotangent vector, into element of tangent space the gradient, that is, a direction to make a step in.

Roughly, a metric on \(M\) is a smooth assignment of symmetric "inner product" \(g_p = \langle\cdot,\cdot\rangle_p\) on each \(T_pM\). Metric is Riemannian, if it's positive definite (\(g(X, X)\geq 0\)). In fact, in the second talk on applications, we'll be more interested in Lorentzian metrics that have signature \((+--\cdots-)\) that correspond to hyperbolic spaces.

TLDR:

  • Now finally, given a curve, we can measure its speeed: \(\|X\|_p^2 = \langle X, X \rangle_p,\) for \(X\in\operatorname{T}_pM\).
  • We can measure cosangles.
  • We can now measure lengths: \(L_{[a,b]}(\gamma) = \int_a^b \|\gamma'(t)\|_{\gamma(t)}\operatorname{d}t.\)
  • We can now convert elements of tangent and cotangent bundle back and forth:
    • \(X\in T_pM\) now defines a linear functional \(X^\flat=Y\mapsto \langle X, Y \rangle_p,\)
    • and a cotangent \(\omega\in T_p^*M\) corresponds to tangent vector \(\omega^\sharp\in T_pM\) satisfying equation \(\langle \omega^\sharp, Y\rangle=\omega(Y)\) (i.e. resulting from application of inverse metric tensor)
    • TODO: Show in a chart and under change of coordinates?
  • This allows us to define, for \(f:M\to\mathbb{R}\), \[\operatorname{grad}f = \operatorname{d}f^\sharp \in \Gamma(\operatorname{T}M).\]

Now, by a "smooth assignment" we mean that a metric on a smooth manifold \(M\) is a smooth map \[g:\Gamma(\operatorname{T}M)\times\Gamma(\operatorname{T}M)\to C^\infty(M)\] that takes two vector fields and produces a scalar function. In other words, it's a \(\operatorname{T}_0^2M\) field.

Comparing vectors from different tangent spaces

In this section we talk about differentiation of vector fields, linear connection, zero-acceleration, and parallel translation

NB: in sssome of the following pictures we temporarily use the sometimes abusive and mostly obsolete notation from Absil et al.

Newton method

Now that we've learned to differentiate vector fields, we can generalize Newton's method to manifolds

\[\nabla_{\eta, x_t}\operatorname{grad}f(x_t) = -\operatorname{grad}f(x_t)\]

Curvature

Not even sure if I should include it would be useful to prepare for the second talk

Hard truth we don't know closed-form expmaps and parallel translations

Here we discuss retractions and vector transport \(1\)'st order approximations to exponential map and parallel transport.

Select a repo