A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning

--- title: A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning tags: hyperbolic geometry, generative models, word embeddings, reinforcement learning ---  ## A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning [`Yoshihiro Nagano`](https://ganow.me/), [`Shoichiro Yamaguchi`](https://scholar.google.com/citations?user=YuBhA1wAAAAJ&hl=ja), _et al_. _presented by [Albert M. Orozco Camacho](https://x.com/tropdeep___)_ ---  ## Motivation ---- - **Hyperbolic geometry** has been shown helpful when capturing fundamental _structural properties_ of data. - Various tasks, such as _machine translation_ seem to improve **generalization performance** ([Gülçehre et al., 2019](https://arxiv.org/abs/1805.09786)). - Euclidean space often fails to capture inherent structural information of data; - especially when a hierarchical number of features is present. Note: - People have embedded things like **WordNet**, or **tree-like structures** with success, in the past on hyperbolic space. - Number of features often grows exponentially - The main area of focus for the authors is the **target space** ---- - _Hyperbolic space_ comes adequate for some high-dimensional tasks ([Sarkar, 2012](https://link.springer.com/chapter/10.1007/978-3-642-25878-7_34); [Sala et al., 2018](https://arxiv.org/abs/1804.03329))! - Hence, the authors' goal is to benefit from these ideas to design more informative _prior distributions_ on the Bayesian inference setting; - namely, helping a Variational Auto-encoder (VAE) learn hierarchical structures. ----  <img src="https://i.imgur.com/PP10FkM.png" alt="clip performance" width="40%" height="50%"> <img src="https://thdaily.s3-us-west-1.amazonaws.com/gif_20200719232646.gif" alt="clip performance" width="40%" height="50%"> Note: - Here is to illustrate what kind of data/tasks we care about modelling. ----  ### Paper Contributions ---- -  Novel Pseudo-hyperbolic Gaussian distribution - defined on tangent space and projected onto hyperbolic space -  Usage of pseudo-hyperbolic Gaussian in gradient-based learning - in particular, as a prior of a VAE -  Applications to word embeddings - also competitive benchmarks against MNIST, Atari 2600 Breakout, and WordNet ---  ## Background ---- ### Hyperbolic Geometry ----  - Non-Euclidean geometry with a constant negative _Gaussian curvature_ - Can be visualized as the forward sheet of a two-sheeted hyperboloid - Four common equivalent models: 1. Klein model 2. Poincaré disk model 3. Lorentz (hyperboloid) model 4. Poincaré half-plane Note: - Gaussian curvature: product of two principal curvatures - In particular, there is a lot of previous work using the Poincaré disk as subject of study ----  <a href="https://commons.wikimedia.org/wiki/File:Cylinder_-_hyperboloid_-_cone.gif#/media/File:Cylinder_-_hyperboloid_-_cone.gif"><img src="https://upload.wikimedia.org/wikipedia/commons/5/59/Cylinder_-_hyperboloid_-_cone.gif" alt="Cylinder - hyperboloid - cone.gif"></a> ----  <img src="https://upload.wikimedia.org/wikipedia/commons/1/13/Uniform_tiling_73-t1_klein.png" alt="clip performance" width="30%" height="50%"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Poincare_disc_hyperbolic_parallel_lines.svg/300px-Poincare_disc_hyperbolic_parallel_lines.svg.png" alt="clip performance" width="30%" height="50%"> <img src="https://upload.wikimedia.org/wikipedia/commons/1/1f/HyperboloidProjection.png" alt="clip performance" width="30%" height="50%"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Poincare_halfplane_heptagonal_hb.svg/1920px-Poincare_halfplane_heptagonal_hb.svg.png" alt="clip performance" width="35%" height="50%"> Note: - In particular, there is a lot of previous work using the Poincaré disk as subject of study ---- >  This study uses the Lorentz model >  <img src="https://upload.wikimedia.org/wikipedia/commons/1/1f/HyperboloidProjection.png" alt="clip performance" width="50%" height="50%"> Note: - The Lorentz model comes with a simpler closed form of the _geodesics_ ----  #### The Lorentz Model $$ \mathbb{H}^n = \left\{\mathbf{z} \in \mathbb{R}^{n+1}: \langle \mathbf{z}, \mathbf{z} \rangle_\mathcal{L} = -1, z_0 > 0 \right\} $$ where $$ \langle \mathbf{z}, \mathbf{z'} \rangle_\mathcal{L} = -z_0z_0' + \sum_{i=1}^{n}z_iz_i' $$ ----  ### The Lorentz Model - The Lorentzian inner product also functions as metric tensor on hyperbolic space. - The one-hot vector $\mathbf{\mu}_0 = [1, 0, \ldots, 0] \in \mathbb{H}^n \subset \mathbb{R}^{n+1}$ will be referred as the origin of the hyperbolic space. - The distance between two points $\mathbf{z}, \mathbf{z'} \in \mathbb{H}^n$ is given by $$ d_\mathcal{l}(\mathbf{z}, \mathbf{z'}) = \text{arccosh}(-\langle \mathbf{z}, \mathbf{z'} \rangle_\mathcal{L}) $$ ----  ### Overview of of _Pseudo-hyperbolic Gaussian_ Construction 1. Sample a vector from $\mathcal{N}(\mathbf{0}, \Sigma)$ 2. _Transport_ the vector from $\mathbf{\mu}_0$ to $\mathbf{\mu}$ along the geodesic 3. Project the vector onto the surface The result is the desired distribution $\mathcal{G}(\mathbf{\mu}, \Sigma)$. Note: - _Transportation_ of the tangent vector requires **parallel transport** - Projection of the tangent vector to the surface requires the definition of _exponential map_ ---- #### The Tangent Space (at $\mathbf{\mu}$) of a Hyperbolic Space - Characterized as the set of points satisfying the orthogonality relation with respect to the Lorentzian product: $$ T_\mathbf{\mu}\mathbb{H}^n := \{\mathbf{u}: \langle \mathbf{u}, \mathbf{\mu} \rangle_\mathcal{L} = 0\} $$ - $T_\mathbf{\mu}\mathbb{H}^n$ consists of $\mathbf{v} \in \mathbb{R}^{n+1}$ with $\mathbf{v}_0 = 0$, and $$ \|v\|_\mathcal{L} := \sqrt{\langle \mathbf{v}, \mathbf{v}_\mathcal{L} \rangle} = \|v\|_2 $$ Note: - $T_\mathbf{\mu}\mathbb{H}^n$ can be thought as the tangent space of the forward hyperboloid sheet at $\mathbf{\mu}$. ----  ### Parallel Transport ----  For an arbitrary pair of point $\mathbf{\mu}, \mathbf{\nu} \in \mathbb{H}^n$: - the **parallel transport** from $\mathbf{\nu}$ to $\mathbf{\mu}$ is defined as a map $\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}$, - from $T\mathbf{\nu}\mathbb{H}^n$ to $T\mathbf{\mu}\mathbb{H}^n$, - by carrying a vector from $\mathbf{\mu}$ to $\mathbf{\nu}$ along the geodesic in a parallel manner without changing its metric tensor, i.e.: $$ \left\langle \text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}(\mathbf{v}), \text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}(\mathbf{v'}) \right\rangle_\mathcal{L} = \langle \mathbf{v}, \mathbf{v'} \rangle_\mathcal{L} $$ Note: - Metric tensor: contains information on distances and angles, as well as, inner products ----  The formula for the parallel transport on the Lorentz model is given by: ![](https://i.imgur.com/XtCcNDs.png) where $\alpha = -\langle \mathbf{\nu}, \mathbf{\mu} \rangle_\mathcal{L}$. _Inverse parallel transport_ $\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}^{-1}$ carries the vector in $T\mathbf{\nu}\mathbb{H}^n$ back to $T\mathbf{\mu}\mathbb{H}^n$ along the geodesic, that is, ![](https://i.imgur.com/dwAyQg5.png) ----  ![](https://i.imgur.com/GoV48ne.png) ----  ### Exponential Map ----  - Recall that every $\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n$ determines a unique maximal geodesic $\gamma_\mathbf{\mu}: [0,1] \to \mathbb{H}^n$ - with $\gamma_\mathbf{\mu}(0) = \mathbf{\mu}$ - and $\gamma_{\mathbf{\mu}}'(0) = \mathbf{u}$ - **Exponential Map** $\exp_\mathbf{\mu}: T\mathbf{\mu}\mathbb{H}^n \to \mathbb{H}^n$ is defined by - $\exp_\mathbf{\mu}(\mathbf{u}) = \gamma_{\mathbf{\mu}}(1)$ - projection from $\mathbf{\mu}$ to destination coincides with $\|\mathbf{v}\|_\mathcal{L}$ Note: - A curve $\gamma$ is a geodesic if its derivative wrt domain is 0. - An exponential map will help us to project a vector in a tangent space to its surface. - Exponential map: from tangent space to the manifold. ----  For hyperbolic space, the map is given by ![](https://i.imgur.com/QVjrJzo.png) - It can be shown that this map is _norm preserving_ in the sense that - $d_\mathcal{l}(\mathbf{\mu}, \exp_\mathbf{\mu}(\mathbf{u})) = \text{arcosh}\left(-\langle \mathbf{\mu}, \exp_\mathbf{\mu}(\mathbf{u})\rangle_\mathcal{L}\right)$ Note: - Euler's formula! ----  To evaluate the density of a point on hyperbolic space we need to compute the inverse of the exponential map, that is, the _logarithmic map_: ![](https://i.imgur.com/KsKBWty.png) Note: - Logarithmic map is obtained by solving Equation 5 for $\mathbf{u}$. ----  Putting all together... ![](https://i.imgur.com/tOIPu3t.png) Note: - (a) 1-dimentional Lorentz model with its tangent space - (b) parallel transport carries a vector from in the origin tangent space (in green) to a target tangent space (in blue) while preserving the magnitude - ( c ) exponential map projects the mapped vector (in blue) to a target point in the hyperbolic model ---  ## Contributions ---- ### The Pseudo-Hyperbolic Gaussian Distribution ---- > Goal: construct $\mathcal{G}(\mathbf{\mu}, \Sigma)$ on hyperbolic space with $\mathbf{\mu} \in \mathbb{H}^n$ and positive definite $\Sigma$ 1. Sample a vector $\mathbf{\tilde{v}} \in \mathbb{R}^n$ from $\mathcal{N}(\mathbf{0}, \Sigma)$. 2. Interpret $\mathbf{\tilde{v}}$ as an element of $T\mathbf{\mu_0}\mathbb{H}^n \subset \mathbb{R}^{n+1}$ by rewritting $\mathbf{\tilde{v}}$ as $\mathbf{v} = [0, \mathbf{\tilde{v}}]$. 3. Parallel transport the vector $\mathbf{v}$ to $\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n \subset \mathbb{R}^{n+1}$ along the geodesic from $\mathbf{\mu_0}$ to $\mu$. 4. Map $\mathbf{u}$ to $\mathbb{H}^n$ by $\exp_\mathbf{\mu}$. ----  ![](https://i.imgur.com/mmFQDPI.png) Note: - The most prominent advantage of this construction is that we can compute the density of the probability distribution. - Equations in the algorithm point to content in the Appendix. ---- #### Probability Density Function ---- - Note that both $\text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}$ and $\exp_\mathbf{\mu}$ are differentiable functions that can be evaluated analytically. - Thus, we can compute the density of $\mathcal{G}(\mathbf{\mu}, \Sigma)$ at $\mathbf{z} \in \mathbb{H}^n$ using a composition of differenctiable functions: - $\text{proj}_\mathbf{\mu} := \exp_\mathbf{\mu} \circ \text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}$ ----  - If $\mathbf{X}$ is a random variable with PDF $p(\mathbf{x})$, the $\log$-likelihood of $Y = f(\mathbf{X})$ at $\mathbf{y}$ can be expressed as $$ \log_\mathbf{y} = \log p(\mathbf{x}) - \log\det \left( \frac{\partial f} {\partial \mathbf{x}} \right) $$ - Hence we only need to evaluate the determinant of the gradient of the defined projection: ![](https://i.imgur.com/51hpUa4.png) Note: - $f$ is an invertible and continuous map. ----  ![](https://i.imgur.com/NyTB88w.png) ----  ![](https://i.pinimg.com/564x/95/2d/a0/952da03b73f0810c8d58c4087bccb509.jpg) ----  - Note that by the chain rule property of determinants: ![](https://i.imgur.com/zQggud3.png) - To evaluate $\partial \exp_\mathbf{\mu}(\mathbf{u})/\partial \mathbf{u}$, we take an orthonormal basis of the tangent space of $\mathbf{\mu}$ and compute the directional derivative of $\exp_\mathbf{\mu}$ w.r.t to it (obs. $r = \|\mathbf{u}\|_\mathcal{L}$): ![](https://i.imgur.com/UAln0yo.png) where $r = \|\mathbf{u}\|_\mathcal{L}$ ----  - Because the directional derivative in the direction of $\vec{\mathbf{u}}$ has magnitude 1, and because each components are orthogonal to each other, the norm of the directional derivative is given by ![](https://i.imgur.com/ga5RX12.png) - Finally, observe that parallel transport is itself a _norm preserving map_, thus, we have that $$ \partial \text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}(\mathbf{v})/ {\partial \mathbf{v}} = 1. $$ Note: - The whole evaluation of the log determinant can be computed on $O(n)$. ----  ![](https://i.imgur.com/Cd1pYlE.png) Note: - heatmaps of log-likelihood of the pseudo-Gaussian hyperbolic distribution, with various values of $\mathbf{\mu}$ and $\Sigma$ - the $\times$ mark designates the origin of the hyperbolic space ---  ## Applications of $\mathcal{G}(\mathbf{\mu}, \Sigma)$ ---- ### Hyperbolic VAE ---- Consider the typical VAE setting where decoder $p_\theta(x \mid z)$ is trained along encoder $q_\phi(z \mid x)$ by maximizing the sum of the ELBO: ![](https://i.imgur.com/4U1nafa.png) Note: - Recall that prior $p_\theta$ is the standard normal - Posterior is _variationally_ estimated by a Gaussian ---- - **Hyperbolic VAE** sets the prior $p_\theta = \mathcal{G}(\mathbf{\mu_0}, \mathbf{I})$ - also _reparameterizes_ the posterior $q_\phi = \mathcal{G}(\mathbf{\mu}, \Sigma)$ - It is assured that output $\mu$ of the encoder lives in $\mathbb{H}^n$ by applying $\exp_\mathbf{\mu_0}$ to the final layer of the encoder: ![](https://i.imgur.com/M5S9oXa.png) ---- ### Word Embeddings ----  - The authors extend work by [Vilnis & McCallum (2015)](https://arxiv.org/abs/1412.6623) by changing the destination of the defined map to the family of $\mathcal{G}$. - Such map attempts to extract linguistic and contextual properties of words into a dictionary. <img src="https://i.imgur.com/wanS2yJ.png)" alt="clip performance" width="80%" height="50%"> Note: - The authors change the distributions used to compute the KL-divergence to evaluate the learned loss function, into a pseudo-hyperbolic Gaussian. - Hence, words are projected to hyperbolic space and similarity is measured over there. ---  ### Experiments ---- ### Synthetic Binary Tree ----  - Data was constructed from a binary tree of depth 8. - Given a (initial) set of binary representations $A_0$, a generated set $A$ would be generated by randomly flipping each coordinate value with probability $\epsilon = 0.1$. - Finally, these probabilities are projected to a $\mathbb{R}^d$ space using a MLP of depth 3, and 100 hidden variables at each layer for encoder and decoder of the VAE. ----  ![](https://i.imgur.com/IHMmGhk.png) ----  ![](https://i.imgur.com/NwkuGUJ.png) ----  ![](https://i.imgur.com/oZb9JIL.png) Note: - This one is been visualized using a Poincaré ball. ---- ### MNIST ----  - A Hyperbolic VAE was applied to a binarized version of MNIST. - An MLP of depth 3 and 500 hidden units at each layer was used for both, encoder and decoder. - This method outperformed Vanilla VAE with small latent dimension. ----  ![](https://i.imgur.com/7SXJTmg.png) Note: - Poincaré ball representations of the interpolations were produced on $\mathbb{H}^2$ ----  ![](https://i.imgur.com/pYuHxeN.png) ---- ### Atari 2600 Breakout ---- ![](https://media.tenor.com/fbLp1bdUvqQAAAAC/heblushabus-atari.gif) ----  - In Deep RL the number of possible state-action trajectories grows exponentially with the time horizon. - Such trajectories may be thought as having a _tree-like structure_ - Hyperbolic VAE was applied to a set of trajectories that were explored by a trained policy during multiple episodes of Breakout. - A Deep Q-Network was used with an epsilon-greedy ($\epsilon = 0.1$) strategy. ----  - Dataset consisted of 100K trajectories - 80K were used for _training_ - 10K were used for _validation_ - 10K were used for _testing_ - Samples from the dataset: ![](https://i.imgur.com/zg4uHu6.png) ----  ![](https://i.imgur.com/2znBPkG.png) Note: - TASK: Representation learning of state-action pairs. - Number of blocks decreases gradually and consistently for Hyperbolic VAE ----  ![](https://i.imgur.com/g26Ek2x.png) Note: - The cumulative reward up to a given state can be approximated well by the norm of a Hyperbolic VAE - Correlation of latent representation for each state in the test set was computed w.r.t. cumulative reward, giving **0.8460**. ---- ### Word Embeddings ----  - Authors trained WordNet noun benchmarks and evaluated reconstruction performance following the procedure proposed with Poincaré embeddings ([Nickel & Kiela, 2017](https://arxiv.org/abs/1705.08039)). - The authors present a comparison with word embeddings with Gaussian distribution on Euclidean space. - Hyperbolic model outperformed the Euclidean counterpart when the latent space was _low dimensional_ ----  ![](https://i.imgur.com/tFNSjjD.png) ---  ## Conclusions ---- -  A novel parameterization for the Gaussian density on hyperbolic space was proposed. -  It can be differentiated and evaluated analytically. -  This enabled gradient-based learning. ---- -  Experiments were successful on some word embedding settings. -  Yet the last experiments suggest this approach could fail to scale to higher dimensions. -  Much room left for the application of hyperbolic space!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.