<!-- .slide: data-transition="zoom" data-background-image="https://64.media.tumblr.com/8b7097b708d0091f3cb2391ad8819e68/tumblr_nlj1m60ATR1uqy1pwo2_400.gifv"-->
## A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning
[`Yoshihiro Nagano`](https://ganow.me/),
[`Shoichiro Yamaguchi`](https://scholar.google.com/citations?user=YuBhA1wAAAAJ&hl=ja),
_et al_.
_presented by [Albert M. Orozco Camacho](https://x.com/tropdeep___)_
---
<!-- .slide: data-transition="zoom" data-background="red"-->
## Motivation
----
- **Hyperbolic geometry** has been shown helpful when capturing fundamental _structural properties_ of data.
- Various tasks, such as _machine translation_ seem to improve **generalization performance** ([Gülçehre et al., 2019](https://arxiv.org/abs/1805.09786)).
- Euclidean space often fails to capture inherent structural information of data;
- especially when a hierarchical number of features is present.
Note:
- People have embedded things like **WordNet**, or **tree-like structures** with success, in the past
on hyperbolic space.
- Number of features often grows exponentially
- The main area of focus for the authors is the **target space**
----
- _Hyperbolic space_ comes adequate for some high-dimensional tasks ([Sarkar, 2012](https://link.springer.com/chapter/10.1007/978-3-642-25878-7_34);
[Sala et al., 2018](https://arxiv.org/abs/1804.03329))!
- Hence, the authors' goal is to benefit from these ideas to design more informative _prior distributions_ on the Bayesian inference setting;
- namely, helping a Variational Auto-encoder (VAE) learn hierarchical structures.
----
<!-- .slide: data-background="white"-->
<img src="https://i.imgur.com/PP10FkM.png"
alt="clip performance"
width="40%"
height="50%"> <img src="https://thdaily.s3-us-west-1.amazonaws.com/gif_20200719232646.gif"
alt="clip performance"
width="40%"
height="50%">
Note:
- Here is to illustrate what kind of data/tasks we care about modelling.
----
<!-- .slide: data-background="yellow"-->
### Paper Contributions
----
- <!-- .element: class="fragment" --> Novel <b>Pseudo-hyperbolic Gaussian</b> distribution
- defined on tangent space and projected onto hyperbolic space
- <!-- .element: class="fragment" --> Usage of <i>pseudo-hyperbolic Gaussian</i> in gradient-based learning
- in particular, as a prior of a VAE
- <!-- .element: class="fragment" --> Applications to <i>word embeddings</i>
- also competitive benchmarks against MNIST, Atari 2600 Breakout, and WordNet
---
<!-- .slide: data-transition="zoom" data-background="blue"-->
## Background
----
### Hyperbolic Geometry
----
<!-- .slide: data-background="white"-->
- Non-Euclidean geometry with a constant negative _Gaussian curvature_
- Can be visualized as the forward sheet of a two-sheeted hyperboloid
- Four common equivalent models:
1. Klein model
2. Poincaré disk model
3. Lorentz (hyperboloid) model
4. Poincaré half-plane
Note:
- Gaussian curvature: product of two principal curvatures
- In particular, there is a lot of previous work using the Poincaré disk as subject of study
----
<!-- .slide: data-transition="fade" data-background="white"-->
<p><a href="https://commons.wikimedia.org/wiki/File:Cylinder_-_hyperboloid_-_cone.gif#/media/File:Cylinder_-_hyperboloid_-_cone.gif"><img src="https://upload.wikimedia.org/wikipedia/commons/5/59/Cylinder_-_hyperboloid_-_cone.gif" alt="Cylinder - hyperboloid - cone.gif"></a></p>
----
<!-- .slide: data-background="white"-->
<img src="https://upload.wikimedia.org/wikipedia/commons/1/13/Uniform_tiling_73-t1_klein.png"
alt="clip performance"
width="30%"
height="50%"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Poincare_disc_hyperbolic_parallel_lines.svg/300px-Poincare_disc_hyperbolic_parallel_lines.svg.png"
alt="clip performance"
width="30%"
height="50%">
<img src="https://upload.wikimedia.org/wikipedia/commons/1/1f/HyperboloidProjection.png"
alt="clip performance"
width="30%"
height="50%"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Poincare_halfplane_heptagonal_hb.svg/1920px-Poincare_halfplane_heptagonal_hb.svg.png"
alt="clip performance"
width="35%"
height="50%">
Note:
- In particular, there is a lot of previous work using the Poincaré disk as subject of study
----
> <!-- .element: class="fragment" --> This study uses the <b>Lorentz model</b>
> <!-- .element: class="fragment" --> <img src="https://upload.wikimedia.org/wikipedia/commons/1/1f/HyperboloidProjection.png" alt="clip performance" width="50%" height="50%">
Note:
- The Lorentz model comes with a simpler closed form of the _geodesics_
----
<!-- .slide: data-transition="zoom" data-background="white"-->
#### The Lorentz Model
$$
\mathbb{H}^n = \left\{\mathbf{z} \in \mathbb{R}^{n+1}:
\langle \mathbf{z}, \mathbf{z} \rangle_\mathcal{L} = -1,
z_0 > 0
\right\}
$$
where
$$
\langle \mathbf{z}, \mathbf{z'} \rangle_\mathcal{L} =
-z_0z_0' + \sum_{i=1}^{n}z_iz_i'
$$
----
<!-- .slide: data-transition="zoom" data-background="white"-->
### The Lorentz Model
- The Lorentzian inner product also functions as metric tensor on hyperbolic space.
- The one-hot vector $\mathbf{\mu}_0 = [1, 0, \ldots, 0] \in \mathbb{H}^n \subset \mathbb{R}^{n+1}$
will be referred as the origin of the hyperbolic space.
- The distance between two points $\mathbf{z}, \mathbf{z'} \in \mathbb{H}^n$ is given by
$$
d_\mathcal{l}(\mathbf{z}, \mathbf{z'}) =
\text{arccosh}(-\langle \mathbf{z}, \mathbf{z'} \rangle_\mathcal{L})
$$
----
<!-- .slide: data-transition="zoom" data-background="white"-->
### Overview of of _Pseudo-hyperbolic Gaussian_ Construction
1. Sample a vector from $\mathcal{N}(\mathbf{0}, \Sigma)$
2. _Transport_ the vector from $\mathbf{\mu}_0$ to $\mathbf{\mu}$ along the geodesic
3. Project the vector onto the surface
The result is the desired distribution $\mathcal{G}(\mathbf{\mu}, \Sigma)$.
Note:
- _Transportation_ of the tangent vector requires **parallel transport**
- Projection of the tangent vector to the surface requires the definition of _exponential map_
----
#### The Tangent Space (at $\mathbf{\mu}$) of a Hyperbolic Space
- Characterized as the set of points satisfying the orthogonality relation with respect to
the Lorentzian product:
$$
T_\mathbf{\mu}\mathbb{H}^n :=
\{\mathbf{u}: \langle \mathbf{u}, \mathbf{\mu} \rangle_\mathcal{L}
= 0\}
$$
- $T_\mathbf{\mu}\mathbb{H}^n$ consists of $\mathbf{v} \in \mathbb{R}^{n+1}$ with $\mathbf{v}_0 = 0$, and
$$
\|v\|_\mathcal{L} := \sqrt{\langle \mathbf{v}, \mathbf{v}_\mathcal{L} \rangle}
= \|v\|_2
$$
Note:
- $T_\mathbf{\mu}\mathbb{H}^n$ can be thought as the tangent space of the
forward hyperboloid sheet at $\mathbf{\mu}$.
----
<!-- .slide: data-background="brown"-->
### Parallel Transport
----
<!-- .slide: data-transition="zoom" data-background="white"-->
For an arbitrary pair of point $\mathbf{\mu}, \mathbf{\nu} \in \mathbb{H}^n$:
- the **parallel transport** from $\mathbf{\nu}$ to $\mathbf{\mu}$ is defined as a map $\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}$,
- from $T\mathbf{\nu}\mathbb{H}^n$ to $T\mathbf{\mu}\mathbb{H}^n$,
- by carrying a vector from $\mathbf{\mu}$ to $\mathbf{\nu}$ along the geodesic in a parallel manner without changing its metric tensor, i.e.:
$$
\left\langle
\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}(\mathbf{v}),
\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}(\mathbf{v'})
\right\rangle_\mathcal{L} =
\langle \mathbf{v}, \mathbf{v'} \rangle_\mathcal{L}
$$
Note:
- Metric tensor: contains information on distances and angles, as well as, inner products
----
<!-- .slide: data-transition="zoom" data-background="white"-->
The formula for the parallel transport on the Lorentz model is given by:
![](https://i.imgur.com/XtCcNDs.png)
where $\alpha = -\langle \mathbf{\nu}, \mathbf{\mu} \rangle_\mathcal{L}$.
_Inverse parallel transport_ $\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}^{-1}$ carries the vector in $T\mathbf{\nu}\mathbb{H}^n$ back to $T\mathbf{\mu}\mathbb{H}^n$ along the geodesic, that is,
![](https://i.imgur.com/dwAyQg5.png)
----
<!-- .slide: data-transition="zoom" data-background="white"-->
![](https://i.imgur.com/GoV48ne.png)
----
<!-- .slide: data-background="magenta"-->
### Exponential Map
----
<!-- .slide: data-transition="zoom" data-background="white"-->
- Recall that every $\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n$ determines a unique maximal geodesic $\gamma_\mathbf{\mu}: [0,1] \to \mathbb{H}^n$
- with $\gamma_\mathbf{\mu}(0) = \mathbf{\mu}$
- and $\gamma_{\mathbf{\mu}}'(0) = \mathbf{u}$
- **Exponential Map** $\exp_\mathbf{\mu}: T\mathbf{\mu}\mathbb{H}^n \to \mathbb{H}^n$ is defined by
- $\exp_\mathbf{\mu}(\mathbf{u}) = \gamma_{\mathbf{\mu}}(1)$
- projection from $\mathbf{\mu}$ to destination coincides with $\|\mathbf{v}\|_\mathcal{L}$
Note:
- A curve $\gamma$ is a geodesic if its derivative wrt domain is 0.
- An exponential map will help us to project a vector in a tangent space to its surface.
- Exponential map: from tangent space to the manifold.
----
<!-- .slide: data-transition="zoom" data-background="white"-->
For hyperbolic space, the map is given by
![](https://i.imgur.com/QVjrJzo.png)
- It can be shown that this map is _norm preserving_ in the sense that
- $d_\mathcal{l}(\mathbf{\mu}, \exp_\mathbf{\mu}(\mathbf{u})) = \text{arcosh}\left(-\langle \mathbf{\mu}, \exp_\mathbf{\mu}(\mathbf{u})\rangle_\mathcal{L}\right)$
Note:
- Euler's formula!
----
<!-- .slide: data-transition="zoom" data-background="white"-->
To evaluate the density of a point on hyperbolic space we need to compute the inverse of the exponential map, that is, the _logarithmic map_:
![](https://i.imgur.com/KsKBWty.png)
Note:
- Logarithmic map is obtained by solving Equation 5 for $\mathbf{u}$.
----
<!-- .slide: data-transition="zoom" data-background="white"-->
Putting all together...
![](https://i.imgur.com/tOIPu3t.png)
Note:
- (a) 1-dimentional Lorentz model with its tangent space
- (b) parallel transport carries a vector from in the origin tangent space (in green) to a target tangent space (in blue) while preserving the magnitude
- ( c ) exponential map projects the mapped vector (in blue) to a target point in the hyperbolic model
---
<!-- .slide: data-transition="zoom" data-background="yellow"-->
## Contributions
----
### The Pseudo-Hyperbolic Gaussian Distribution
----
> Goal: construct $\mathcal{G}(\mathbf{\mu}, \Sigma)$ on hyperbolic space with $\mathbf{\mu} \in \mathbb{H}^n$ and positive definite $\Sigma$
1. Sample a vector $\mathbf{\tilde{v}} \in \mathbb{R}^n$ from $\mathcal{N}(\mathbf{0}, \Sigma)$.
2. Interpret $\mathbf{\tilde{v}}$ as an element of $T\mathbf{\mu_0}\mathbb{H}^n \subset \mathbb{R}^{n+1}$ by rewritting $\mathbf{\tilde{v}}$ as $\mathbf{v} = [0, \mathbf{\tilde{v}}]$.
3. Parallel transport the vector $\mathbf{v}$ to $\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n \subset \mathbb{R}^{n+1}$ along the geodesic from $\mathbf{\mu_0}$ to $\mu$.
4. Map $\mathbf{u}$ to $\mathbb{H}^n$ by $\exp_\mathbf{\mu}$.
----
<!-- .slide: data-transition="zoom" data-background="white"-->
![](https://i.imgur.com/mmFQDPI.png)
Note:
- The most prominent advantage of this construction is that we can compute the density of the probability distribution.
- Equations in the algorithm point to content in the Appendix.
----
#### Probability Density Function
----
- Note that both $\text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}$ and $\exp_\mathbf{\mu}$ are differentiable functions that can be evaluated analytically.
- Thus, we can compute the density of $\mathcal{G}(\mathbf{\mu}, \Sigma)$ at $\mathbf{z} \in \mathbb{H}^n$ using a composition of differenctiable functions:
- $\text{proj}_\mathbf{\mu} := \exp_\mathbf{\mu} \circ \text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}$
----
<!-- .slide: data-transition="zoom" data-background="white"-->
- If $\mathbf{X}$ is a random variable with PDF $p(\mathbf{x})$, the $\log$-likelihood of $Y = f(\mathbf{X})$ at $\mathbf{y}$ can be expressed as
$$
\log_\mathbf{y} = \log p(\mathbf{x}) -
\log\det \left(
\frac{\partial f}
{\partial \mathbf{x}}
\right)
$$
- Hence we only need to evaluate the determinant of the gradient of the defined projection:
![](https://i.imgur.com/51hpUa4.png)
Note:
- $f$ is an invertible and continuous map.
----
<!-- .slide: data-background="white"-->
![](https://i.imgur.com/NyTB88w.png)
----
<!-- .slide: data-background="cyan"-->
![](https://i.pinimg.com/564x/95/2d/a0/952da03b73f0810c8d58c4087bccb509.jpg)
----
<!-- .slide: data-transition="fade"-->
- Note that by the chain rule property of determinants:
![](https://i.imgur.com/zQggud3.png)
- To evaluate $\partial \exp_\mathbf{\mu}(\mathbf{u})/\partial \mathbf{u}$, we take an orthonormal basis of the tangent space of $\mathbf{\mu}$ and compute the directional derivative of $\exp_\mathbf{\mu}$ w.r.t to it (obs. $r = \|\mathbf{u}\|_\mathcal{L}$):
![](https://i.imgur.com/UAln0yo.png)
where $r = \|\mathbf{u}\|_\mathcal{L}$
----
<!-- .slide: data-transition="fade" data-background="white"-->
- Because the directional derivative in the direction of $\vec{\mathbf{u}}$ has magnitude 1, and because each components are orthogonal to each other, the norm of the directional derivative is given by
![](https://i.imgur.com/ga5RX12.png)
- Finally, observe that parallel transport is itself a _norm preserving map_, thus, we have that
$$
\partial \text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}(\mathbf{v})/
{\partial \mathbf{v}} = 1.
$$
Note:
- The whole evaluation of the log determinant can be computed on $O(n)$.
----
<!-- .slide: data-background="white"-->
![](https://i.imgur.com/Cd1pYlE.png)
Note:
- heatmaps of log-likelihood of the pseudo-Gaussian hyperbolic distribution, with various values of $\mathbf{\mu}$ and $\Sigma$
- the $\times$ mark designates the origin of the hyperbolic space
---
<!-- .slide: data-transition="zoom" data-background="green"-->
## Applications of $\mathcal{G}(\mathbf{\mu}, \Sigma)$
----
### Hyperbolic VAE
----
Consider the typical VAE setting where decoder $p_\theta(x \mid z)$ is trained along encoder $q_\phi(z \mid x)$ by maximizing the sum of the ELBO:
![](https://i.imgur.com/4U1nafa.png)
Note:
- Recall that prior $p_\theta$ is the standard normal
- Posterior is _variationally_ estimated by a Gaussian
----
- **Hyperbolic VAE** sets the prior $p_\theta = \mathcal{G}(\mathbf{\mu_0}, \mathbf{I})$
- also _reparameterizes_ the posterior $q_\phi = \mathcal{G}(\mathbf{\mu}, \Sigma)$
- It is assured that output $\mu$ of the encoder lives in $\mathbb{H}^n$ by applying $\exp_\mathbf{\mu_0}$ to the final layer of the encoder:
![](https://i.imgur.com/M5S9oXa.png)
----
### Word Embeddings
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
- The authors extend work by [Vilnis & McCallum (2015)](https://arxiv.org/abs/1412.6623) by changing the destination of the defined map to the family of $\mathcal{G}$.
- Such map attempts to extract linguistic and contextual properties of words into a dictionary.
<img src="https://i.imgur.com/wanS2yJ.png)"
alt="clip performance"
width="80%"
height="50%">
Note:
- The authors change the distributions used to compute the KL-divergence to evaluate the learned loss function, into a pseudo-hyperbolic Gaussian.
- Hence, words are projected to hyperbolic space and similarity is measured over there.
---
<!-- .slide: data-background="purple"-->
### Experiments
----
### Synthetic Binary Tree
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
- Data was constructed from a binary tree of depth 8.
- Given a (initial) set of binary representations $A_0$, a generated set $A$ would be generated by randomly flipping each coordinate value with probability $\epsilon = 0.1$.
- Finally, these probabilities are projected to a $\mathbb{R}^d$ space using a MLP of depth 3, and 100 hidden variables at each layer for encoder and decoder of the VAE.
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/IHMmGhk.png)
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/NwkuGUJ.png)
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/oZb9JIL.png)
Note:
- This one is been visualized using a Poincaré ball.
----
### MNIST
----
<!-- .slide: data-background="white"-->
- A Hyperbolic VAE was applied to a binarized version of MNIST.
- An MLP of depth 3 and 500 hidden units at each layer was used for both, encoder and decoder.
- This method outperformed Vanilla VAE with small latent dimension.
----
<!-- .slide: data-background="white"-->
![](https://i.imgur.com/7SXJTmg.png)
Note:
- Poincaré ball representations of the interpolations were produced on $\mathbb{H}^2$
----
<!-- .slide: data-background="white"-->
![](https://i.imgur.com/pYuHxeN.png)
----
### Atari 2600 Breakout
----
![](https://media.tenor.com/fbLp1bdUvqQAAAAC/heblushabus-atari.gif)
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
- In Deep RL the number of possible state-action trajectories grows exponentially with the time horizon.
- Such trajectories may be thought as having a _tree-like structure_
- Hyperbolic VAE was applied to a set of trajectories that were explored by a trained policy during multiple episodes of Breakout.
- A Deep Q-Network was used with an epsilon-greedy ($\epsilon = 0.1$) strategy.
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
- Dataset consisted of 100K trajectories
- 80K were used for _training_
- 10K were used for _validation_
- 10K were used for _testing_
- Samples from the dataset:
![](https://i.imgur.com/zg4uHu6.png)
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/2znBPkG.png)
Note:
- TASK: Representation learning of state-action pairs.
- Number of blocks decreases gradually and consistently for Hyperbolic VAE
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/g26Ek2x.png)
Note:
- The cumulative reward up to a given state can be approximated well by the norm of a Hyperbolic VAE
- Correlation of latent representation for each state in the test set was computed w.r.t. cumulative reward, giving **0.8460**.
----
### Word Embeddings
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
- Authors trained WordNet noun benchmarks and evaluated reconstruction performance following the procedure proposed with Poincaré embeddings ([Nickel & Kiela, 2017](https://arxiv.org/abs/1705.08039)).
- The authors present a comparison with word embeddings with Gaussian distribution on Euclidean space.
- Hyperbolic model outperformed the Euclidean counterpart when the latent space was _low dimensional_
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
![](https://i.imgur.com/tFNSjjD.png)
---
<!-- .slide: data-transition="zoom" data-background="pink"-->
## Conclusions
----
- <!-- .element: class="fragment" --> A novel parameterization for the Gaussian density on hyperbolic space was proposed.
- <!-- .element: class="fragment" --> It can be differentiated and evaluated analytically.
- <!-- .element: class="fragment" --> This enabled gradient-based learning.
----
- <!-- .element: class="fragment" --> Experiments were successful on some word embedding settings.
- <!-- .element: class="fragment" --> Yet the last experiments suggest this approach could fail to scale to higher dimensions.
- <!-- .element: class="fragment" --> Much room left for the application of hyperbolic space!
{"metaMigratedAt":"2023-06-17T11:27:42.125Z","metaMigratedFrom":"YAML","title":"A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning","breaks":true,"description":"Yoshihiro Nagano,Shoichiro Yamaguchi,et al.","contributors":"[{\"id\":\"adb0403f-b4e6-4ebc-be17-cc638e9f5cfe\",\"add\":27242,\"del\":6987}]"}