changed 2 years ago
Published Linked with GitHub

Optimization on manifolds


References

  • Differential Geometry
    • Frederic Schuller's lectures on youtube
    • Lee's series
  • Optimization: Absil et. al
  • Details in notes

Metagoals

  • This talk fails to provide examples (yet)
  • Aims at stating the problems
  • and actually asking questions

"Riemannian Adaptive Optimization Methods"

(Becigneul, Ganea)

Gradient descent

  • \(p_0\in K,~\alpha\in\mathbb{R},~f: K\subset \mathbb{R}^d\to \mathbb{R},\)
  • \(f(p)\to \min.\)
  1. \(g_t = f'(p_t)^\top\)
  2. \(x_{t+1} = x_t - \alpha g_t\)

Manifolds

An \(n\)-manifold: "a 'nice' top. space \((M, \Theta)\) whose every \(p\in M\) has open neighbourhood homeomorphic to (a piece of) \(\mathbb{R}^n\)"

\[\phi: U\xrightarrow{\sim} \phi(U)\subset \mathbb{R}^n\]


Manifolds. Missing tools

  • Descent direction: no derivatives
    • No notion of differentiability
  • Step along direction: no addition

Smooth manifolds

"A manifold covered with a maximal atlas of smoothly-compatible charts", \((M, \Theta, \mathcal{A})\)


Smooth functions


Directional derivatives

  • \((M, \Theta, \mathcal{A}),\)
  • \(p\in M,\)
  • Smooth curve \(\gamma:I\subset\mathbb{R}\to M,\) \(\gamma(0)=p,\)
  • \(f: M\to \mathbb{R}\) smooth.

\[(f\circ \gamma)'(0).\]


Directional derivatives

\[X_{\gamma, p}f = (f\circ \gamma)'(0)\]

"Directional derivative of \(f\) along \(\gamma\) at \(p\)"



Directional derivatives

\[X_{\gamma, p}f = (f\circ \gamma)'(0).\]

  • \(X_{\gamma,p}:C^\infty(M)\to\mathbb{R}\) is a linear operator.
  • Leibniz rule: \(X_{\gamma, p}(fg) = f(p)X_{\gamma, p}g + X_{\gamma, p}(f)g(p).\)


Tangent space

\[T_pM = \left\{ X_{\gamma,p}\left| \gamma:I\to M~\text{smooth}, ~\gamma(0)=p \right. \right\}\]

  • "Tangent space to \(M\) at \(p\)"
  • Coincides with the set of all derivations
  • Derivations :left_right_arrow: curve segments via charts
  • Forms a linear space

Coordinate basis

A chart \((U,x)\), \(p\in U\), corresponds to a basis in \(T_p M\), represented by coordinate curves.

  • Denoted: \(\frac{\partial}{\partial x^1},~\ldots,~\frac{\partial}{\partial x^n}.\)
  • Acts: \(\frac{\partial}{\partial x^k}f = \partial_k (\underbrace{f\circ (x^k)^{-1}}_{\mathbb{R}^n\to\mathbb{R}})(\underbrace{x(p)}_{\mathbb{R}^n})\) where \(\partial_k\) is partial derivative

Exponential map

TODO: move to after geodesics

  • "Direction" means \(X\in T_pM\)
  • \(\exists\) open \(C(p)\subset M\) (cut locus) s.t. curve geodesic \(\gamma\) representing \(X\) is unique

\[\exp_pX = \gamma(1)~\text{if defined}.\]


Smooth manifolds. Missing tools

  • Before: \(p_{t+1} = p_t - \alpha g_{t},\)
  • After: \(p_{t+1} = \exp_{p_t}(-\alpha X_t).\)

:confused:

Given objective \(f:M\to\mathbb{R}\) and \(p_t\in M\), how to pick \(X_t\)?


Derivative of a function

Given smooth \(F:M\to N,\)

"the push-forward of \(F\) at \(p\)" is the linear map \(F_{*,p}:T_p{M}\to T_{F(p)}N\)

\[(\underbrace{F_{*,p}X}_{T_{F(p)}N})\underbrace{g}_{C^\infty(N)} = \underbrace{X}_{T_p M}(\underbrace{g\circ F}_{C^\infty(M)}).\]

  • In coordinate basis: Jacobian matrix

Derivative of objective

Given objective \(f:M\to \mathbb{R}\) its derivative at point \(p\in M\) is a linear functional \[T_p{M}\to T\mathbb{R}\cong \mathbb{R}.\]

  • \(f_{*,p}\in T_p^* M\),
  • need \(T_p M\) :confused: .

Metric manifolds

"A smooth manifold endowed with smooth assignment of (almost) 'inner products' \(g:T_p M\times T_p M\to \mathbb{R}\) at each \(p\)". Notes for details

  • Rieszc in \(\mathbb{R}^n\): vectors :left_right_arrow: linear functionals via inner product
  • \(\omega = g(X, \cdot) = X^\flat\)
  • \(\omega \mapsto X=\omega^\sharp\) as unique \(X\) s.t. \(\omega(Y) = g(X, Y)\).

RGD

\[X_t = \operatorname{grad}f(p_t) \equiv f_{*,p_t}^\sharp,\] \[x_{t+1} = \exp_{p_t}(-\alpha X_t).\]

Ready to implement?


Extrinsic representation

Immersion: smooth \(\phi: M\to\mathbb{R}^n\) such that \(\phi_{*,p}\) is injective for every \(p\in M\). Embedding: a diffeomorphic immersion.

[Strong] Whitney: can embed in \(\mathbb{R}^{2\operatorname{dim}M}.\)


Extrinsic representation

\(M\subset\mathbb{R}^d\) be a level-set (\(\{p\left|F(p)=0\right.\}\)).

  • \(\left.\operatorname{id}\right|_{M}\) is an embedding and

  • can represent points and tangents via (redundant) \(\mathbb{R}^d\) coordinates,

  • can induce metric from Euclidean using \(F\),

  • need projection op.

  • (prevalent approach, including geoopt)

  • (examples using local charts?)


(Becigneul, Ganea)

Retractions

  • Closed-form \(\exp\) and \(\sharp\)?
    • Usually unknown or infeasible
  • First-order approximations to \(\exp\): retractions

\[R_p:T_p M \to M,\] \[R_p(0) = p,\] \[(R_p)_{*,0} "=" \operatorname{id}_{T_p M}.\]


Faster methods

  • In vector spaces:
    • second-order methods (Newton)
    • dynamic approximations of Hessian (quasi-Newton)
      • "momentum", "adaptivity term"
  • On manifold:
    • First derivative is a vector field
    • How to differentiate vector fields?

Vector fields

A Bundle is a triple \((E, \pi, M)\), \[E\xrightarrow{\pi} M\]

  • \(E\) "total space" (smooth mfd),
  • \(\pi\) "projection" (smooth map),
  • \(M\) "base space" (smooth mfd),

Vector fields

Tangent bundle: \[E = TM = \cup_p T_p M\] \[\pi = X_{\gamma, p} \mapsto p.\]


Vector fields

A smooth section of a bundle \(E\xrightarrow\pi M\) is a smooth map \[\sigma:M\to E\] such that \[\pi\circ\sigma = \operatorname{id}_M\]

Set of all smooth sections is denoted \(\Gamma(E)\).


Vector fields

A vector field is a smooth section of the tangent bundle \(TM\).

  • To each \(p\in M\) assigns \(X_p\in T_p M\)
  • but in a "smooth way"

Another example: metric \(g\) is a smooth section of the bundle of bilinear forms (\((0,2)\)-tensor field)


Covariant derivative

"Covariant derivative of vf \(Y\) along \(X\)": \[\nabla_XY\] Wishlist:

  1. \(\nabla_{fX+Y}Z = f\nabla_XZ + \nabla_Y Z,\)
  2. \(\nabla_X(Y+Z) = \nabla_XY + \nabla_X Z,\)
  3. \(\nabla_X(fY) = X(f)Y + f\nabla_XY.\)


Covariant derivative

In chart:

(Christoffel symbols)


Levi-Civita connection



Vector field along a curve

  • \(\gamma:I\to M\)
  • \(X:I\to TM\) s.t. \(X_\lambda \in T_{\gamma(\lambda)}M\)

Extendible if there is \(\tilde{X}\in\Gamma(TM)\) s.t. \(X_\lambda = \tilde{X}_{\gamma(\lambda)}.\)

  • Thm: locally unique

Parallel translate

If \(X:I \to TM\) is a vf along \(\gamma\) extendible to \(\tilde{X}\) and \[\nabla_X\tilde{X} = 0\] then \(X\) is called autoparallely-transported

  • "Transports" \(X_0\) in \(T_{\gamma(0)}M\) to \(X_\lambda\in T_{\gamma(\lambda)}M\)


(Becigneul, Ganea)

Vector transport

  • No closed-form parallel translation either
  • "Vector transport" 1st-order approx (see Absil)

(here should go some qualitative derivation vector transports for levelsets)


Newton

later addition

\(f: \mathbb{R} \to \mathbb{R},\) find \(x\) s.t. \(f(x)=0\).

  • Instead, replace \(f\) with linear \(h(x) = f(x_t) + f'(x_t)(x - x_t)\)
  • and solve \(h(x)=0\):
  • \(x = x_t-f'(x_t)^{-1}f(x_t)\)

Newton

later addition

\(f: \mathbb{R}\to \mathbb{R},\) find \(x\) s.t. \(f'(x)=0\).

  • Apply Newton's iteration to \(f'\)
  • \(x_{t+1} = x_t - f''(x_t)^{-1} f'(x_t)\)

Newton

later addition

\(f: \mathbb{R}^n\to\mathbb{R}^m\), find \(x\) s.t. \(f(x)=0\).

  • \(\tilde{f}(x) = f(x_t) + f'(x_t)(x-x_t),\)
  • \(\tilde{f}(x_{t+1}) = 0.\)

Newton

later addition

\(f: \mathbb{R}^n\to\mathbb{R}\), find \(x\) s.t. \(f'(x)=0\).

  • \(\tilde{f'}(x) = f'(x_t) + \underbrace{f''(x_t)(x - x_t)}_{\in~{^n}\mathbb{R}},\)
  • \(f''(x_t)(x - x_t) = -f'(x_t)\)

Newton

Solve \[\nabla_{X_t}\operatorname{grad}f(p_t) = -\operatorname{grad}f(p_t)\] for \(X_t\in T_{p_t}M\).

\[p_{t+1} = \exp_{p_t}X_t.\]


Accelerated first-order methods

  • Stochastic optimization
  • Accelerated SGD?
  • Adam: coordinate-wise adaptive learning rate (more or less)
    • Parallel translation of "momentum"

"Coordinates" on manifolds?


Product manifolds

\[M = M_1\times \cdots\times M_n,\]

Product topology, product metric,

RAdam: adaptivity term for each product component!


(Becigneul, Ganea)

\(\varkappa\)-Stereographic space

  • (Becigneul, Ganea) use RAdam to embed wordnet in a Poincare ball (constant curvature \(c\))
  • Hyperbolic neural networks
  • \(c\) hyperparameter
  • Optimize for \(c\)?

\(\varkappa\)-Stereographic space

  • \(S^\varkappa\) \(\varkappa\)-stereographic space, universal for positive and negative curvature (Bloch)
  • "Almost \(\mathbb{R}\)-line bundle"
    • \(E\xrightarrow\pi \mathbb{R}\)
    • \(\pi^{-1}(\varkappa) \simeq S^\varkappa\)
  • Can we induce metric, etc? OK for optimization?
Select a repo