The Lorentzian inner product also functions as metric tensor on hyperbolic space.
The one-hot vector \(\mathbf{\mu}_0 = [1, 0, \ldots, 0] \in \mathbb{H}^n \subset \mathbb{R}^{n+1}\) will be referred as the origin of the hyperbolic space.
The distance between two points \(\mathbf{z}, \mathbf{z'} \in \mathbb{H}^n\) is given by \[
d_\mathcal{l}(\mathbf{z}, \mathbf{z'}) =
\text{arccosh}(-\langle \mathbf{z}, \mathbf{z'} \rangle_\mathcal{L})
\]
Overview of of Pseudo-hyperbolic Gaussian Construction
Sample a vector from \(\mathcal{N}(\mathbf{0}, \Sigma)\)
Transport the vector from \(\mathbf{\mu}_0\) to \(\mathbf{\mu}\) along the geodesic
Project the vector onto the surface
The result is the desired distribution \(\mathcal{G}(\mathbf{\mu}, \Sigma)\).
The Tangent Space (at \(\mathbf{\mu}\)) of a Hyperbolic Space
Characterized as the set of points satisfying the orthogonality relation with respect to the Lorentzian product: \[
T_\mathbf{\mu}\mathbb{H}^n :=
\{\mathbf{u}: \langle \mathbf{u}, \mathbf{\mu} \rangle_\mathcal{L}
= 0\}
\]
\(T_\mathbf{\mu}\mathbb{H}^n\) consists of \(\mathbf{v} \in \mathbb{R}^{n+1}\) with \(\mathbf{v}_0 = 0\), and \[
\|v\|_\mathcal{L} := \sqrt{\langle \mathbf{v}, \mathbf{v}_\mathcal{L} \rangle}
= \|v\|_2
\]
Parallel Transport
For an arbitrary pair of point \(\mathbf{\mu}, \mathbf{\nu} \in \mathbb{H}^n\):
the parallel transport from \(\mathbf{\nu}\) to \(\mathbf{\mu}\) is defined as a map \(\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}\),
from \(T\mathbf{\nu}\mathbb{H}^n\) to \(T\mathbf{\mu}\mathbb{H}^n\),
by carrying a vector from \(\mathbf{\mu}\) to \(\mathbf{\nu}\) along the geodesic in a parallel manner without changing its metric tensor, i.e.:
The formula for the parallel transport on the Lorentz model is given by:
where \(\alpha = -\langle \mathbf{\nu}, \mathbf{\mu} \rangle_\mathcal{L}\).
Inverse parallel transport\(\text{PT}_{\mathbf{\nu}\to\mathbf{\mu}}^{-1}\) carries the vector in \(T\mathbf{\nu}\mathbb{H}^n\) back to \(T\mathbf{\mu}\mathbb{H}^n\) along the geodesic, that is,
Exponential Map
Recall that every \(\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n\) determines a unique maximal geodesic \(\gamma_\mathbf{\mu}: [0,1] \to \mathbb{H}^n\)
with \(\gamma_\mathbf{\mu}(0) = \mathbf{\mu}\)
and \(\gamma_{\mathbf{\mu}}'(0) = \mathbf{u}\)
Exponential Map\(\exp_\mathbf{\mu}: T\mathbf{\mu}\mathbb{H}^n \to \mathbb{H}^n\) is defined by
To evaluate the density of a point on hyperbolic space we need to compute the inverse of the exponential map, that is, the logarithmic map:
Putting all together…
Contributions
The Pseudo-Hyperbolic Gaussian Distribution
Goal: construct \(\mathcal{G}(\mathbf{\mu}, \Sigma)\) on hyperbolic space with \(\mathbf{\mu} \in \mathbb{H}^n\) and positive definite \(\Sigma\)
Sample a vector \(\mathbf{\tilde{v}} \in \mathbb{R}^n\) from \(\mathcal{N}(\mathbf{0}, \Sigma)\).
Interpret \(\mathbf{\tilde{v}}\) as an element of \(T\mathbf{\mu_0}\mathbb{H}^n \subset \mathbb{R}^{n+1}\) by rewritting \(\mathbf{\tilde{v}}\) as \(\mathbf{v} = [0, \mathbf{\tilde{v}}]\).
Parallel transport the vector \(\mathbf{v}\) to \(\mathbf{u} \in T\mathbf{\mu}\mathbb{H}^n \subset \mathbb{R}^{n+1}\) along the geodesic from \(\mathbf{\mu_0}\) to \(\mu\).
Map \(\mathbf{u}\) to \(\mathbb{H}^n\) by \(\exp_\mathbf{\mu}\).
Probability Density Function
Note that both \(\text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}\) and \(\exp_\mathbf{\mu}\) are differentiable functions that can be evaluated analytically.
Thus, we can compute the density of \(\mathcal{G}(\mathbf{\mu}, \Sigma)\) at \(\mathbf{z} \in \mathbb{H}^n\) using a composition of differenctiable functions:
If \(\mathbf{X}\) is a random variable with PDF \(p(\mathbf{x})\), the \(\log\)-likelihood of \(Y = f(\mathbf{X})\) at \(\mathbf{y}\) can be expressed as \[
\log_\mathbf{y} = \log p(\mathbf{x}) -
\log\det \left(
\frac{\partial f}
{\partial \mathbf{x}}
\right)
\]
Hence we only need to evaluate the determinant of the gradient of the defined projection:
Note that by the chain rule property of determinants:
To evaluate \(\partial \exp_\mathbf{\mu}(\mathbf{u})/\partial \mathbf{u}\), we take an orthonormal basis of the tangent space of \(\mathbf{\mu}\) and compute the directional derivative of \(\exp_\mathbf{\mu}\) w.r.t to it (obs. \(r = \|\mathbf{u}\|_\mathcal{L}\)): where \(r = \|\mathbf{u}\|_\mathcal{L}\)
Because the directional derivative in the direction of \(\vec{\mathbf{u}}\) has magnitude 1, and because each components are orthogonal to each other, the norm of the directional derivative is given by
Finally, observe that parallel transport is itself a norm preserving map, thus, we have that \[
\partial \text{PT}_{\mathbf{\mu_0}\to\mathbf{\mu}}(\mathbf{v})/
{\partial \mathbf{v}} = 1.
\]
Applications of \(\mathcal{G}(\mathbf{\mu}, \Sigma)\)
Hyperbolic VAE
Consider the typical VAE setting where decoder \(p_\theta(x \mid z)\) is trained along encoder \(q_\phi(z \mid x)\) by maximizing the sum of the ELBO:
Hyperbolic VAE sets the prior \(p_\theta = \mathcal{G}(\mathbf{\mu_0}, \mathbf{I})\)
also reparameterizes the posterior \(q_\phi = \mathcal{G}(\mathbf{\mu}, \Sigma)\)
It is assured that output \(\mu\) of the encoder lives in \(\mathbb{H}^n\) by applying \(\exp_\mathbf{\mu_0}\) to the final layer of the encoder:
Word Embeddings
The authors extend work by Vilnis & McCallum (2015) by changing the destination of the defined map to the family of \(\mathcal{G}\).
Such map attempts to extract linguistic and contextual properties of words into a dictionary.
Experiments
Synthetic Binary Tree
Data was constructed from a binary tree of depth 8.
Given a (initial) set of binary representations \(A_0\), a generated set \(A\) would be generated by randomly flipping each coordinate value with probability \(\epsilon = 0.1\).
Finally, these probabilities are projected to a \(\mathbb{R}^d\) space using a MLP of depth 3, and 100 hidden variables at each layer for encoder and decoder of the VAE.
MNIST
A Hyperbolic VAE was applied to a binarized version of MNIST.
An MLP of depth 3 and 500 hidden units at each layer was used for both, encoder and decoder.
This method outperformed Vanilla VAE with small latent dimension.
Atari 2600 Breakout
In Deep RL the number of possible state-action trajectories grows exponentially with the time horizon.
Such trajectories may be thought as having a tree-like structure
Hyperbolic VAE was applied to a set of trajectories that were explored by a trained policy during multiple episodes of Breakout.
A Deep Q-Network was used with an epsilon-greedy (\(\epsilon = 0.1\)) strategy.
Dataset consisted of 100K trajectories
80K were used for training
10K were used for validation
10K were used for testing
Samples from the dataset:
Word Embeddings
Authors trained WordNet noun benchmarks and evaluated reconstruction performance following the procedure proposed with Poincaré embeddings (Nickel & Kiela, 2017).
The authors present a comparison with word embeddings with Gaussian distribution on Euclidean space.
Hyperbolic model outperformed the Euclidean counterpart when the latent space was low dimensional
Conclusions
A novel parameterization for the Gaussian density on hyperbolic space was proposed.
It can be differentiated and evaluated analytically.
This enabled gradient-based learning.
Experiments were successful on some word embedding settings.
Yet the last experiments suggest this approach could fail to scale to higher dimensions.
Much room left for the application of hyperbolic space!
A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning Yoshihiro Nagano , Shoichiro Yamaguchi , et al . presented by Albert M. Orozco Camacho
{"metaMigratedAt":"2023-06-17T11:27:42.125Z","metaMigratedFrom":"YAML","title":"A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning","breaks":true,"description":"Yoshihiro Nagano,Shoichiro Yamaguchi,et al.","contributors":"[{\"id\":\"adb0403f-b4e6-4ebc-be17-cc638e9f5cfe\",\"add\":27242,\"del\":6987}]"}