# Optimization on manifolds
- [geoopt/geoopt](https://github.com/geoopt/geoopt) ( [@ferrine](https://twitter.com/ferrine96/), [@rrkarim](https://github.com/rrkarim), [@SomeoneS](https://twitter.com/SomeoneSerge))
- **March 13**: an overview, @SomeoneS
- Notes: https://hackmd.io/@SomeoneS/S1gDMMFV8
- **March 20**: Hyperbolic deep learning, graph embeddings, @ferrine96
---
References
---
- Differential Geometry
- Frederic Schuller's lectures on youtube
- Lee's series
- Optimization: Absil et. al
- Details in notes
---
Metagoals
---
- This talk fails to provide examples (yet)
- Aims at stating the problems
- ...and actually asking questions
---
"Riemannian Adaptive Optimization Methods"
---
<center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" />
(Becigneul, Ganea)</center>
---
Gradient descent
---
- $p_0\in K,~\alpha\in\mathbb{R},~f: K\subset \mathbb{R}^d\to \mathbb{R},$
- $f(p)\to \min.$
> 1. $g_t = f'(p_t)^\top$
> 2. $x_{t+1} = x_t - \alpha g_t$
<center><img src="https://i.imgur.com/ZgWHr6T.jpg" class="img-responsive" width="25%" /></center>
---
Manifolds
---
An $n$-manifold: "a ['nice'](https://hackmd.io/@newkozlukov/manopt-notes#Manifolds) top. space $(M, \Theta)$
whose every $p\in M$ has open neighbourhood
homeomorphic to (a piece of) $\mathbb{R}^n$"
<center><img src="https://i.imgur.com/kzdBEOI.jpg" class="img-responsive" width="45%"/></center>
$$\phi: U\xrightarrow{\sim} \phi(U)\subset \mathbb{R}^n$$
---
Manifolds. Missing tools
---
- Descent direction: **no derivatives**
- **No** notion of **differentiability**
- Step along direction: **no addition**
---
Smooth manifolds
---
"A manifold covered with a maximal atlas of smoothly-compatible charts", $(M, \Theta, \mathcal{A})$
<center><img src="https://i.imgur.com/F0NBjOs.jpg" class="img-responsive" width="40%" /></center>
---
Smooth *functions*
---
<center><img src="https://i.imgur.com/8BHgHRv.jpg" class="img-responsive" width="75%" /></center>
---
Directional derivatives
---
- $(M, \Theta, \mathcal{A}),$
- $p\in M,$
- Smooth curve $\gamma:I\subset\mathbb{R}\to M,$ $\gamma(0)=p,$
- $f: M\to \mathbb{R}$ smooth.
$$(f\circ \gamma)'(0).$$
---
Directional derivatives
---
$$X_{\gamma, p}f = (f\circ \gamma)'(0)$$
"Directional derivative of $f$ along $\gamma$ at $p$"
---
<center><img src="https://i.imgur.com/4Zu3yoc.jpg" style="border-radius: 1em; margin: 1em;" width="45%" class="img-responsive"/></center>
---
Directional derivatives
---
$$X_{\gamma, p}f = (f\circ \gamma)'(0).$$
- $X_{\gamma,p}:C^\infty(M)\to\mathbb{R}$ is a **linear operator**.
- **Leibniz rule**:
$X_{\gamma, p}(fg) = f(p)X_{\gamma, p}g + X_{\gamma, p}(f)g(p).$
---
<center><img src="https://i.imgur.com/mSv7diH.jpg" class="img-responsive" width="45%" style="border-radius: 1em; margin: 1em" /></center>
---
Tangent space
---
$$T_pM = \left\{
X_{\gamma,p}\left|
\gamma:I\to M~\text{smooth},
~\gamma(0)=p
\right.
\right\}$$
- "Tangent space to $M$ at $p$"
- Coincides with the set of all _derivations_
- _Derivations_ :left_right_arrow: _curve_ segments via *charts*
- Forms a linear space
---
Coordinate basis
---
A chart $(U,x)$, $p\in U$, corresponds to a basis in $T_p M$,
represented by _coordinate curves_.
<center><img src="https://i.imgur.com/wypzdvi.jpg" class="img-responsive" width="30%" /></center>
- Denoted: $\frac{\partial}{\partial x^1},~\ldots,~\frac{\partial}{\partial x^n}.$
- Acts: $\frac{\partial}{\partial x^k}f
= \partial_k (\underbrace{f\circ (x^k)^{-1}}_{\mathbb{R}^n\to\mathbb{R}})(\underbrace{x(p)}_{\mathbb{R}^n})$
where $\partial_k$ is partial derivative
---
Exponential map
---
_TODO: move to after geodesics_
- "_Direction_" means $X\in T_pM$
- $\exists$ open $C(p)\subset M$ (cut locus) s.t. ~~curve~~ geodesic $\gamma$ representing $X$ is **unique**
$$\exp_pX = \gamma(1)~\text{if defined}.$$
<center><img src="https://i.imgur.com/d7wI2OA.jpg" class="img-responsive" width="30%" /></center>
---
Smooth manifolds. Missing tools
---
- Before: $p_{t+1} = p_t - \alpha g_{t},$
- After: $p_{t+1} = \exp_{p_t}(-\alpha X_t).$
:confused:
Given objective $f:M\to\mathbb{R}$ and $p_t\in M$,
**how to pick $X_t$**?
---
Derivative of a function
---
Given smooth $F:M\to N,$
"the _push-forward_ of $F$ at $p$" is the linear map
$F_{*,p}:T_p{M}\to T_{F(p)}N$
$$(\underbrace{F_{*,p}X}_{T_{F(p)}N})\underbrace{g}_{C^\infty(N)} = \underbrace{X}_{T_p M}(\underbrace{g\circ F}_{C^\infty(M)}).$$
- In coordinate basis: Jacobian matrix
---
Derivative of objective
---
Given objective $f:M\to \mathbb{R}$
its derivative at point $p\in M$
is a linear functional $$T_p{M}\to T\mathbb{R}\cong \mathbb{R}.$$
- $f_{*,p}\in T_p^* M$,
- need $T_p M$ :confused: .
---
Metric manifolds
---
"A smooth manifold endowed with _smooth_ assignment of (almost) '_inner products_' $g:T_p M\times T_p M\to \mathbb{R}$ at each $p$". Notes for details
- Rieszc in $\mathbb{R}^n$: vectors :left_right_arrow: linear functionals via _inner product_
- $\omega = g(X, \cdot) = X^\flat$
- $\omega \mapsto X=\omega^\sharp$ as unique $X$ s.t. $\omega(Y) = g(X, Y)$.
---
RGD
---
$$X_t = \operatorname{grad}f(p_t) \equiv f_{*,p_t}^\sharp,$$
$$x_{t+1} = \exp_{p_t}(-\alpha X_t).$$
Ready to implement?
---
Extrinsic representation
---
_Immersion_: smooth $\phi: M\to\mathbb{R}^n$ such that $\phi_{*,p}$ is injective for every $p\in M$.
_Embedding_: a diffeomorphic immersion.
[Strong] Whitney: can embed in $\mathbb{R}^{2\operatorname{dim}M}.$
---
Extrinsic representation
---
$M\subset\mathbb{R}^d$ be a level-set ($\{p\left|F(p)=0\right.\}$).
- $\left.\operatorname{id}\right|_{M}$ is an embedding and
- can represent points and tangents via (redundant) $\mathbb{R}^d$ coordinates,
- can induce metric from Euclidean using $F$,
- need __projection__ op.
- (prevalent approach, including geoopt)
- (examples using local charts?)
---
<center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" />
(Becigneul, Ganea)</center>
---
Retractions
---
- Closed-form $\exp$ and $\sharp$?
- Usually unknown or infeasible
- First-order approximations to $\exp$: **retractions**
$$R_p:T_p M \to M,$$
$$R_p(0) = p,$$
$$(R_p)_{*,0} "=" \operatorname{id}_{T_p M}.$$
---
Faster methods
---
- In vector spaces:
- second-order methods (Newton)
- dynamic approximations of Hessian (quasi-Newton)
- "momentum", "adaptivity term"
- On manifold:
- First derivative is a vector field
- **How to differentiate vector fields**?
---
Vector fields
---
A **Bundle** is a triple $(E, \pi, M)$, $$E\xrightarrow{\pi} M$$
- $E$ -- "total space" (smooth mfd),
- $\pi$ -- "projection" (smooth map),
- $M$ -- "base space" (smooth mfd),
---
Vector fields
---
Tangent bundle:
$$E = TM = \cup_p T_p M$$
$$\pi = X_{\gamma, p} \mapsto p.$$
---
Vector fields
---
A smooth **section** of a bundle $E\xrightarrow\pi M$
is a smooth map $$\sigma:M\to E$$ such that $$\pi\circ\sigma = \operatorname{id}_M$$
Set of all smooth sections is denoted $\Gamma(E)$.
---
Vector fields
---
A **vector field** is a smooth section of the tangent bundle $TM$.
- To each $p\in M$ assigns $X_p\in T_p M$
- but in a "smooth way"
Another example: metric $g$ is a smooth section of the bundle of bilinear forms ($(0,2)$-tensor field)
---
Covariant derivative
---
"Covariant derivative of vf $Y$ along $X$":
$$\nabla_XY$$
Wishlist:
1. $\nabla_{fX+Y}Z = f\nabla_XZ + \nabla_Y Z,$
2. $\nabla_X(Y+Z) = \nabla_XY + \nabla_X Z,$
3. $\nabla_X(fY) = X(f)Y + f\nabla_XY.$
---

---
Covariant derivative
---
In chart:
...
(Christoffel symbols)
---
Levi-Civita connection
---
...
---

---
Vector field along a curve
---
- $\gamma:I\to M$
- $X:I\to TM$ s.t. $X_\lambda \in T_{\gamma(\lambda)}M$
**Extendible** if there is $\tilde{X}\in\Gamma(TM)$ s.t.
$X_\lambda = \tilde{X}_{\gamma(\lambda)}.$
- Thm: locally unique
---
Parallel translate
---
If $X:I \to TM$ is a vf along $\gamma$
extendible to $\tilde{X}$
and
$$\nabla_X\tilde{X} = 0$$
then $X$ is called **autoparallely-transported**
- "Transports" $X_0$ in $T_{\gamma(0)}M$ to $X_\lambda\in T_{\gamma(\lambda)}M$
---

---
<center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" />
(Becigneul, Ganea)</center>
---
Vector transport
---
- No closed-form parallel translation either
- "Vector transport" -- 1st-order approx (see Absil)
---
(here should go some qualitative derivation vector transports for levelsets)
---
Newton
---
_later addition_
> $f: \mathbb{R} \to \mathbb{R},$ find $x$ s.t. $f(x)=0$.
- Instead, replace $f$ with linear $h(x) = f(x_t) + f'(x_t)(x - x_t)$
- and solve $h(x)=0$:
- $x = x_t-f'(x_t)^{-1}f(x_t)$
---
Newton
---
_later addition_
> $f: \mathbb{R}\to \mathbb{R},$ find $x$ s.t. $f'(x)=0$.
- Apply Newton's iteration to $f'$
- $x_{t+1} = x_t - f''(x_t)^{-1} f'(x_t)$
---
Newton
---
_later addition_
> $f: \mathbb{R}^n\to\mathbb{R}^m$, find $x$ s.t. $f(x)=0$.
- $\tilde{f}(x) = f(x_t) + f'(x_t)(x-x_t),$
- $\tilde{f}(x_{t+1}) = 0.$
---
Newton
---
_later addition_
> $f: \mathbb{R}^n\to\mathbb{R}$, find $x$ s.t. $f'(x)=0$.
- $\tilde{f'}(x) = f'(x_t) + \underbrace{f''(x_t)(x - x_t)}_{\in~{^n}\mathbb{R}},$
- $f''(x_t)(x - x_t) = -f'(x_t)$
---
Newton
---
Solve
$$\nabla_{X_t}\operatorname{grad}f(p_t)
= -\operatorname{grad}f(p_t)$$
for $X_t\in T_{p_t}M$.
$$p_{t+1} = \exp_{p_t}X_t.$$
---
Accelerated first-order methods
---
- Stochastic optimization
- Accelerated SGD?
- Adam: coordinate-wise adaptive learning rate (more or less)
- Parallel translation of "momentum"
"Coordinates" on manifolds?
---
Product manifolds
---
$$M = M_1\times \cdots\times M_n,$$
Product topology, product metric, ...
**RAdam**: adaptivity term for each product component!
---
<center><img src="https://i.imgur.com/Wd5ofKf.png" class="img-responsive" width="75%" />
(Becigneul, Ganea)</center>
---
$\varkappa$-Stereographic space
---
- (Becigneul, Ganea) use RAdam to embed wordnet in a Poincare ball (constant curvature $c$)
- Hyperbolic neural networks
- $c$ -- hyperparameter
- Optimize for $c$?
---
$\varkappa$-Stereographic space
---
- $S^\varkappa$ -- $\varkappa$-stereographic space, universal for positive and negative curvature ([Bloch](https://andbloch.github.io/K-Stereographic-Model/))
- "Almost $\mathbb{R}$-line bundle"
- $E\xrightarrow\pi \mathbb{R}$
- $\pi^{-1}(\varkappa) \simeq S^\varkappa$
- Can we induce metric, etc? OK for optimization?
{"metaMigratedAt":"2023-06-15T04:45:26.758Z","metaMigratedFrom":"YAML","title":"Riemannopt@HSE - SLIDES, pt. 1/2","breaks":false,"description":"Sketch of slides for the Riemann opts talk at HSE. View the slide with \"Slide Mode\".","slideOptions":"{\"transition\":\"fade\",\"allottedMinutes\":60,\"theme\":\"white\"}","contributors":"[{\"id\":\"5df6d742-95e4-4a3e-8568-902ecaabcfca\",\"add\":15267,\"del\":4364}]"}