Optimization on manifolds
References
- Differential Geometry
- Frederic Schuller's lectures on youtube
- Lee's series
- Optimization: Absil et. al
- Details in notes
- This talk fails to provide examples (yet)
- Aims at stating the problems
- …and actually asking questions
"Riemannian Adaptive Optimization Methods"
(Becigneul, Ganea)
Gradient descent
- \(p_0\in K,~\alpha\in\mathbb{R},~f: K\subset \mathbb{R}^d\to \mathbb{R},\)
- \(f(p)\to \min.\)
- \(g_t = f'(p_t)^\top\)
- \(x_{t+1} = x_t - \alpha g_t\)

Manifolds
An \(n\)-manifold: "a 'nice' top. space \((M, \Theta)\)
whose every \(p\in M\) has open neighbourhood
homeomorphic to (a piece of) \(\mathbb{R}^n\)"

\[\phi: U\xrightarrow{\sim} \phi(U)\subset \mathbb{R}^n\]
- Descent direction: no derivatives
- No notion of differentiability
- Step along direction: no addition
Smooth manifolds
"A manifold covered with a maximal atlas of smoothly-compatible charts", \((M, \Theta, \mathcal{A})\)

Smooth functions

Directional derivatives
- \((M, \Theta, \mathcal{A}),\)
- \(p\in M,\)
- Smooth curve \(\gamma:I\subset\mathbb{R}\to M,\) \(\gamma(0)=p,\)
- \(f: M\to \mathbb{R}\) smooth.
\[(f\circ \gamma)'(0).\]
Directional derivatives
\[X_{\gamma, p}f = (f\circ \gamma)'(0)\]
"Directional derivative of \(f\) along \(\gamma\) at \(p\)"

Directional derivatives
\[X_{\gamma, p}f = (f\circ \gamma)'(0).\]
- \(X_{\gamma,p}:C^\infty(M)\to\mathbb{R}\) is a linear operator.
- Leibniz rule:
\(X_{\gamma, p}(fg) = f(p)X_{\gamma, p}g + X_{\gamma, p}(f)g(p).\)

Tangent space
\[T_pM = \left\{
X_{\gamma,p}\left|
\gamma:I\to M~\text{smooth},
~\gamma(0)=p
\right.
\right\}\]
- "Tangent space to \(M\) at \(p\)"
- Coincides with the set of all derivations
- Derivations
curve segments via charts
- Forms a linear space
Coordinate basis
A chart \((U,x)\), \(p\in U\), corresponds to a basis in \(T_p M\),
represented by coordinate curves.

- Denoted: \(\frac{\partial}{\partial x^1},~\ldots,~\frac{\partial}{\partial x^n}.\)
- Acts: \(\frac{\partial}{\partial x^k}f
= \partial_k (\underbrace{f\circ (x^k)^{-1}}_{\mathbb{R}^n\to\mathbb{R}})(\underbrace{x(p)}_{\mathbb{R}^n})\)
where \(\partial_k\) is partial derivative
Exponential map
TODO: move to after geodesics
- "Direction" means \(X\in T_pM\)
- \(\exists\) open \(C(p)\subset M\) (cut locus) s.t.
curve geodesic \(\gamma\) representing \(X\) is unique
\[\exp_pX = \gamma(1)~\text{if defined}.\]

- Before: \(p_{t+1} = p_t - \alpha g_{t},\)
- After: \(p_{t+1} = \exp_{p_t}(-\alpha X_t).\)

Given objective \(f:M\to\mathbb{R}\) and \(p_t\in M\),
how to pick \(X_t\)?
Derivative of a function
Given smooth \(F:M\to N,\)
"the push-forward of \(F\) at \(p\)" is the linear map
\(F_{*,p}:T_p{M}\to T_{F(p)}N\)
\[(\underbrace{F_{*,p}X}_{T_{F(p)}N})\underbrace{g}_{C^\infty(N)} = \underbrace{X}_{T_p M}(\underbrace{g\circ F}_{C^\infty(M)}).\]
- In coordinate basis: Jacobian matrix
Derivative of objective
Given objective \(f:M\to \mathbb{R}\)
its derivative at point \(p\in M\)
is a linear functional \[T_p{M}\to T\mathbb{R}\cong \mathbb{R}.\]
- \(f_{*,p}\in T_p^* M\),
- need \(T_p M\)
.
Metric manifolds
"A smooth manifold endowed with smooth assignment of (almost) 'inner products' \(g:T_p M\times T_p M\to \mathbb{R}\) at each \(p\)". Notes for details
- Rieszc in \(\mathbb{R}^n\): vectors
linear functionals via inner product
- \(\omega = g(X, \cdot) = X^\flat\)
- \(\omega \mapsto X=\omega^\sharp\) as unique \(X\) s.t. \(\omega(Y) = g(X, Y)\).
RGD
\[X_t = \operatorname{grad}f(p_t) \equiv f_{*,p_t}^\sharp,\]
\[x_{t+1} = \exp_{p_t}(-\alpha X_t).\]
Ready to implement?
Extrinsic representation
Immersion: smooth \(\phi: M\to\mathbb{R}^n\) such that \(\phi_{*,p}\) is injective for every \(p\in M\).
Embedding: a diffeomorphic immersion.
[Strong] Whitney: can embed in \(\mathbb{R}^{2\operatorname{dim}M}.\)
Extrinsic representation
\(M\subset\mathbb{R}^d\) be a level-set (\(\{p\left|F(p)=0\right.\}\)).
-
\(\left.\operatorname{id}\right|_{M}\) is an embedding and
-
can represent points and tangents via (redundant) \(\mathbb{R}^d\) coordinates,
-
can induce metric from Euclidean using \(F\),
-
need projection op.
-
(prevalent approach, including geoopt)
-
(examples using local charts?)
(Becigneul, Ganea)
Retractions
- Closed-form \(\exp\) and \(\sharp\)?
- Usually unknown or infeasible
- First-order approximations to \(\exp\): retractions
\[R_p:T_p M \to M,\]
\[R_p(0) = p,\]
\[(R_p)_{*,0} "=" \operatorname{id}_{T_p M}.\]
Faster methods
- In vector spaces:
- second-order methods (Newton)
- dynamic approximations of Hessian (quasi-Newton)
- "momentum", "adaptivity term"
- On manifold:
- First derivative is a vector field
- How to differentiate vector fields?
Vector fields
A Bundle is a triple \((E, \pi, M)\), \[E\xrightarrow{\pi} M\]
- \(E\) – "total space" (smooth mfd),
- \(\pi\) – "projection" (smooth map),
- \(M\) – "base space" (smooth mfd),
Vector fields
Tangent bundle:
\[E = TM = \cup_p T_p M\]
\[\pi = X_{\gamma, p} \mapsto p.\]
Vector fields
A smooth section of a bundle \(E\xrightarrow\pi M\)
is a smooth map \[\sigma:M\to E\] such that \[\pi\circ\sigma = \operatorname{id}_M\]
Set of all smooth sections is denoted \(\Gamma(E)\).
Vector fields
A vector field is a smooth section of the tangent bundle \(TM\).
- To each \(p\in M\) assigns \(X_p\in T_p M\)
- but in a "smooth way"
Another example: metric \(g\) is a smooth section of the bundle of bilinear forms (\((0,2)\)-tensor field)
Covariant derivative
"Covariant derivative of vf \(Y\) along \(X\)":
\[\nabla_XY\]
Wishlist:
- \(\nabla_{fX+Y}Z = f\nabla_XZ + \nabla_Y Z,\)
- \(\nabla_X(Y+Z) = \nabla_XY + \nabla_X Z,\)
- \(\nabla_X(fY) = X(f)Y + f\nabla_XY.\)

Covariant derivative
In chart:
…
(Christoffel symbols)
Levi-Civita connection
…

Vector field along a curve
- \(\gamma:I\to M\)
- \(X:I\to TM\) s.t. \(X_\lambda \in T_{\gamma(\lambda)}M\)
Extendible if there is \(\tilde{X}\in\Gamma(TM)\) s.t.
\(X_\lambda = \tilde{X}_{\gamma(\lambda)}.\)
Parallel translate
If \(X:I \to TM\) is a vf along \(\gamma\)
extendible to \(\tilde{X}\)
and
\[\nabla_X\tilde{X} = 0\]
then \(X\) is called autoparallely-transported
- "Transports" \(X_0\) in \(T_{\gamma(0)}M\) to \(X_\lambda\in T_{\gamma(\lambda)}M\)

(Becigneul, Ganea)
Vector transport
- No closed-form parallel translation either
- "Vector transport" – 1st-order approx (see Absil)
(here should go some qualitative derivation vector transports for levelsets)
Newton
later addition
\(f: \mathbb{R} \to \mathbb{R},\) find \(x\) s.t. \(f(x)=0\).
- Instead, replace \(f\) with linear \(h(x) = f(x_t) + f'(x_t)(x - x_t)\)
- and solve \(h(x)=0\):
- \(x = x_t-f'(x_t)^{-1}f(x_t)\)
Newton
later addition
\(f: \mathbb{R}\to \mathbb{R},\) find \(x\) s.t. \(f'(x)=0\).
- Apply Newton's iteration to \(f'\)
- \(x_{t+1} = x_t - f''(x_t)^{-1} f'(x_t)\)
Newton
later addition
\(f: \mathbb{R}^n\to\mathbb{R}^m\), find \(x\) s.t. \(f(x)=0\).
- \(\tilde{f}(x) = f(x_t) + f'(x_t)(x-x_t),\)
- \(\tilde{f}(x_{t+1}) = 0.\)
Newton
later addition
\(f: \mathbb{R}^n\to\mathbb{R}\), find \(x\) s.t. \(f'(x)=0\).
- \(\tilde{f'}(x) = f'(x_t) + \underbrace{f''(x_t)(x - x_t)}_{\in~{^n}\mathbb{R}},\)
- \(f''(x_t)(x - x_t) = -f'(x_t)\)
Newton
Solve
\[\nabla_{X_t}\operatorname{grad}f(p_t)
= -\operatorname{grad}f(p_t)\]
for \(X_t\in T_{p_t}M\).
\[p_{t+1} = \exp_{p_t}X_t.\]
Accelerated first-order methods
- Stochastic optimization
- Accelerated SGD?
- Adam: coordinate-wise adaptive learning rate (more or less)
- Parallel translation of "momentum"
"Coordinates" on manifolds?
Product manifolds
\[M = M_1\times \cdots\times M_n,\]
Product topology, product metric, …
RAdam: adaptivity term for each product component!
(Becigneul, Ganea)
\(\varkappa\)-Stereographic space
- (Becigneul, Ganea) use RAdam to embed wordnet in a Poincare ball (constant curvature \(c\))
- Hyperbolic neural networks
- \(c\) – hyperparameter
- Optimize for \(c\)?
\(\varkappa\)-Stereographic space
- \(S^\varkappa\) – \(\varkappa\)-stereographic space, universal for positive and negative curvature (Bloch)
- "Almost \(\mathbb{R}\)-line bundle"
- \(E\xrightarrow\pi \mathbb{R}\)
- \(\pi^{-1}(\varkappa) \simeq S^\varkappa\)
- Can we induce metric, etc? OK for optimization?