https://hackmd.io/@SomeoneS/manopt-slides
It's a two-parts talk about optimization on manifolds and hyperbolic embeddings in deep learning.
Notes and slides:
References:
Additional:
Let us first set the goals for to-day. Somewhat unconventionally, we jump off with the algorithm from "Riemannian Adaptive Optimization Methods" by Becigneul and Ganea. It might look scary at first, but what we're now going to do is iterate, decrypting it bit by bit.
An \(n\)-dimensional manifold \(M\) is a topological space \((M, \theta)\) that
Roughly speaking, it's a "locally-Euclidean space".
For instance, \(\mathbb{R}^n\) itself is a manifold, with (global) chart being identity function.
An additional piece of structure we're going to need (in addition to that of topology and local-Eucliddean-ness), is that of a smooth manifold. Consider two charts \((U, x)\), \((V, y)\) that share some part of their domains: \(U\cap V \neq \varnothing.\) Both charts being invertible functions, we can construct a "change of coordinates", or "transition map" from \(x\) to \(y\) as \(y\circ x^{-1}\). We say that \(x\) and \(y\) are smoothly compatible, if either \(U\cap V = \varnothing\) or transition map is a diffeomorphism (in Euclidean sense): \[y\circ x^{-1}: x(U\cap V)\subset\mathbb{R}^n \to y(U\cap V)\subset\mathbb{R}^n.\] Now, to endow \(M\) with a smooth structure, means to provide a maximal atlas \(A\) on \(M\) – a collection of smoothly-compatible charts, whose domains cover entire \(M\), and such that this collection cannot be extended with a new chart in a smoothly-compatible way.
Now, we've done some hard work, we probably already want to see some benefits? What smooth structure allows us to define, is: what does it mean for a function to be a smooth function between manifolds? Given two smooth manifolds \((M, \theta_M, A_M)\) and \((N, \theta_N, A_N)\), we say that a function \(F: M\to N\) is smooth (resp. \(C^k\)) if \[y\circ F\circ x^{-1}: x(U)\subset\mathbb{R}^m \to y(V)\subset\mathbb{R}^n\] is smooth (resp. \(C^k\)) for every \((U, x)\in A_M\), \((V, y) \in A_N\).
On one hand, I want to introduce bundles&c after tangent spaces, in the context of vector fields&c; but on the other, it seems meaningful to introduce submanifolds&c at very beginning so that we have examples early on; this section will be probably split and moved later; this section follows Frederic Schuller's lecture
I also need bundles to try and introduce curvature learning on stereographic model spaces
One way to construct manifolds, and the one that we're going to use the most, is from other manifolds.
Given a topological \(n\)-manifold \((M, \theta)\), we call \((N, \left.\theta\right|_N)\), \(N\subset M\), a submanifold of \(M\) if \((N, \left.\theta\right|_N)\) is itself a manifold.
Given \(m\)-manifold \(M\) and \(n\)-manifold \(N\), their product manifold is the manifold with underlying set the cartesian product and endowed with product topology: \((M{\times}N, \left.\theta\right._{M{\times}N})\).
A bundle of topological manifolds is a triple \((E, \pi, M)\) of total space \(E\), continuous surjective projection map \(\pi:E\to M\), and base space \(M\). Bundle is a weaker version of product: instead of saying that \(E\) can be constructed from its components, we say that locally \(E\) can be projected onto components
Example: product manifold is a bundle (REMOVE?)
Example: Mobius strip is a bundle, but not a product
A fiber bundle is a bundle \(\pi:E\to M\) s.t. \(\pi^{-1}(p) \cong F\) for all \(p\in M\) and some ("typical fiber") manifold \(F\).
Lo and behold, given a bundle \(\pi:E\to M,\) a section \(\sigma: M\to E\) assigns to each point \(p\) in manifold \(M\) a point from its fiber \(\sigma(p)\in \pi^{-1}(p)\subset E\). In other words, it's a map \(\sigma: M\to E\) such that \(\pi \circ \sigma = \operatorname{id}_M\). Example: vector fields.
TODO
In this section we talk about line-search along geodesics, derivations and differentiation of functions between manifolds
TODO: actually, we need Riemannian metric before we can finish this section – in order to convert derivative into a tangent vector.
One of the simplest optimization methods one can think of is the gradient descent: given a (piecewise)-differentiable objective function \(f:X\subset\mathbb{R}^n \to \mathbb{R}\) initial estimate \(x_0 \in X\), and learning rate \(\alpha\in\mathbb{R}\), we gradually improve our estimate, by making a small linear step in the direction of steepest descent of the loss function: \[x_t = x_{t-1} - \alpha \operatorname{grad} f (x_{t-1}).\]
One way to interpret such iteration, is that at step \(t\), we approximate our objective \(f\)
Let's begin with the notion of "direction". The key observation for generalization of "directions" and "direction fields" to manifolds, is that "direction" attached to a point acts on scalar functions as a directional derivative at that point. Specifically, given a "test function" \(f:\mathbb{R}^n \to \mathbb{R}\), a vector \(X\in\mathbb{R}^n\) attached to point \(p\in\mathbb{R}^n\) acts on \(f\) by computing \[(\lambda\mapsto f(x + \lambda X))'(0).\]
In fact, for a fixed \(p\), there's one-to-one correspondence between such linear functionals and \(\mathbb{R}^d\)-vectors.
For an abstract smooth manifold \(M\), we can take \(C^\infty(M)\) for the space of test functions, while "directional derivatives" would be derivatives along smooth curves.
Given a smooth curve \(\gamma: I\subset\mathbb{R}\to M\), passing through a point \(p\in M\) (s.t. \(\gamma(0)=p\)), the derivative of a test function \(f:M\to \mathbb{R}\) at \(p\) along \(\gamma\) is \[X_{\gamma, p}(f) = (f\circ \gamma)'(0).\]
We could read this formula as "initial velocity of \(f\) along \(\gamma\)".
There's a handful observations to be made about this concept:
Leibniz rule and derivations
Total derivative, pullback, pushforward
A \(C^k\) function \(F:M\to N\) between manifolds induces maps between (co-)tangent bundles of \(M\) and \(N\).
The push-forward \(F_*X_p\) of \(X_p\in T_pM\) ("on the left") under \(F\) is a tangent vector in \(T_{F(p)}N\) such that for \(f\in C^\infty(M)\) \[(F_*X_p)(f) = X_p(f\circ F).\]
Similarly, for a co-tangent \(\omega\in T^*_{F(p)}N\) ("on the right"), its pull-back under \(F\) is \(F^*\omega \in T^*_pM\) such that for \(X\in T_pM\) \[(F^*\omega)(X) = \omega(F_*X).\]
The notion of push-forward generalizes that of a derivative. In coordinate basis, push-forward corresponds to Jacobian matrix acting on column-vectors. Hmm, pull-backs should correspond to Jacobian matrix too, matrix acting from the left, am I right? Yet we can't have correspondence between tangents and cotangents without metric tensor? Oh, wait, the derivative, apparently, should be a \((1,1)\)-tensor?! Ah, not even so, it takes one index from left manifold and the other from the right mfd
Push-forwards of inverse charts form a basis in \(T_pM\). This also allows to specify the dimension of tangent bundle as a manifold
Given a chart \((U, x)\), we can construct a basis in \(T_p^M,~p\in\operatorname{dom}x\) from canonical basis in \(T_{x(p)}\mathbb{R}^n \cong \mathbb{R}^n\), by push-forwarding it via inverse chart: \[E_i = \frac{\partial}{\partial x_i} = x^{-1}_* \partial_i,\] where \(\partial_i\) is the \(i\)'th partial derivative operator acting on functions on \(\mathbb{R}^n\).
This is visualized as follows: one can construct a curve \(\gamma_i\) emanating from \(p\) as following \(i\)'th coordinate in the chart \(x\): \[\gamma_i(t)=x^{-1}(x(p) + t e_i).\]
Then \[\frac{\partial}{\partial x_i}(f) = \partial_i (f \circ x^{-1})(x(p)) = (t\mapsto f(x^{-1}(x(p) + te_i)))'(0) = X_{\gamma_i, p}(f).\]
In other words, \(\frac{\partial}{\partial x_i}\) is represented by the curve \(\gamma_i\). It now takes just a little exercise to show they are linearly independent a form a basis in \(T_pM\).
So far, we've only built machinery for reasoning about continuity and smoothness. We still don't have any tools for discussing geometry. Now we introduce the notion of Riemannian metric, which will allow us to measure lengths of curves, compute angles, and, most importantly, convert the derivative of our objective function, which is a cotangent vector, into element of tangent space – the gradient, that is, a direction to make a step in.
Roughly, a metric on \(M\) is a smooth assignment of symmetric "inner product" \(g_p = \langle\cdot,\cdot\rangle_p\) on each \(T_pM\). Metric is Riemannian, if it's positive definite (\(g(X, X)\geq 0\)). In fact, in the second talk on applications, we'll be more interested in Lorentzian metrics that have signature \((+--\cdots-)\) that correspond to hyperbolic spaces.
TLDR:
Now, by a "smooth assignment" we mean that a metric on a smooth manifold \(M\) is a smooth map \[g:\Gamma(\operatorname{T}M)\times\Gamma(\operatorname{T}M)\to C^\infty(M)\] that takes two vector fields and produces a scalar function. In other words, it's a \(\operatorname{T}_0^2M\) field.
In this section we talk about differentiation of vector fields, linear connection, zero-acceleration, and parallel translation
NB: in sssome of the following pictures we temporarily use the sometimes abusive and mostly obsolete notation from Absil et al.
Now that we've learned to differentiate vector fields, we can generalize Newton's method to manifolds
\[\nabla_{\eta, x_t}\operatorname{grad}f(x_t) = -\operatorname{grad}f(x_t)\]
Not even sure if I should include it… would be useful to prepare for the second talk
Here we discuss retractions and vector transport – \(1\)'st order approximations to exponential map and parallel transport.