owned this note
owned this note
Published
Linked with GitHub
# Reparametrising HMC
We have a variable $q$ whose distribution has covariance $\Sigma$. You can imagine a Gaussian for simplicity if you'd like. In HMC, the usual choice for the distribution of the auxiliary variable $p$ would be the isotropic Gaussian (with mass matrix $M = I$). To sample efficiently, we would like to either
1. sample a whitened $\tilde q$ with $p \sim N(0,I)$ then transform the samples back to $q$, or
2. sample $q$ directly, with a well-chosen $M$ so that $p \sim N(0,M)$ to produce efficient proposals.
In this note, we show that approach 1 gives the same result as approach 2.
## Position variable
Let us first establish the relationship between these transformed and non-transformed variables. We can transform our original variable with the following relation
$$q = L \tilde q \quad \text{or} \quad \tilde q = L^{-1} q,$$
where $L L^T = \Sigma$ from a Cholesky decomposition or otherwise. The whitened $\tilde q$ should hopefully resemble an isotropic Gaussian.
## Momentum variable
Given knowledger of $\Sigma$ or some approximation of it, from Neal's paper on HMC (page 22) we know the best choice for mass matrix is $M = \Sigma^{-1}$. This intuition for this choice is that a small mass in the direction of large variance allows for better exploration (see Stuart's Hilbert space HMC paper).
From our previous decomposition of $\Sigma$, we have $\Sigma^{-1} = (LL^T)^{-1} = L^{-T} L^{-1} = L^{-T} (L^{-T})^T$. If you'd like, you can call $C = L^{-T}$ and say that $M = C C^T$. Here, I will stick to just using $L$. Then, the momentum transform reads
$$ p = L^{-T} \tilde p \quad or \quad \tilde p = L^T p.$$
## Target density and gradient
We want that both $q$ and $\tilde q$ to maintain the same proportion in the respective target density functions, that is to say
$$ \pi(q) =\pi(L \tilde q) \propto \tilde \pi (\tilde q).$$
The constant factor can be ignored as they cancel out in the Hamiltonian (and so acceptance probability) calculations. For a Gaussian target, we have
$$-\log \pi(q) = \frac{1}{2} q^T \Sigma^{-1} q = \frac{1}{2} (L^{-1} q)^T (L^{-1}q) = \frac{1}{2} \tilde q^T \tilde q = -\log \tilde \pi(\tilde q).$$
Similarly, we want both gradients to check out. Using basic chain rule, we have
$$- \nabla_q \log \pi = - \nabla_q \log \tilde \pi =- \nabla_\tilde q \log \tilde \pi \, \nabla_q \,\tilde q = (L^{-T})(-\nabla_\tilde q \log \tilde \pi)$$
Note the extra $L^{-T}$ term before the whitened target gradient. This means we should not expect the gradient values to match if we naively take the gradient from the abovementioned target. Nevertheless, we do not require them to be the same, as they are not involved in the Metropolis-Hastings decision. We will see that this all works out nicely in the leapfrog integrator.
## Leapfrog Integrator
Let us check that in each step of the leapfrog integrator, both approaches agree with each other. We first look at the momentum half-step update
\begin{align}
p_{1/2} &= p_0 - \frac{\epsilon}{2} \nabla_q \log \pi (q_0) \\
\Rightarrow
p_{1/2} &= p_0 - \frac{\epsilon}{2} (L^{-T}) \nabla_{\tilde q} \log \pi (\tilde q_0) \\
\Rightarrow
L^T p_{1/2} &= L^T p_0 - \frac{\epsilon}{2} \nabla_{\tilde q} \log \pi (\tilde q_0)\\
\Rightarrow
\tilde p_{1/2} &= \tilde p_0 - \frac{\epsilon}{2} \nabla_{\tilde q} \log \pi (\tilde q_0)
\end{align}
And so, we see that the momentum half-step update in approach 1 implies the momentum half-step update in approach 2. You can reverse the derivation to see that they are equivalent. Next, we consider the position full-step update
\begin{align}
q_1 &= q_0 + \epsilon \, \Sigma \, p_{1/2} \\
\Rightarrow
q_1 &= q_0 + \epsilon \, \Sigma \, L^{-T} \tilde p_{1/2} \\
\Rightarrow
L^{-1} q_1 &= L^{-1} q_0 + \epsilon \, L^{-1} \Sigma \, L^{-T} \tilde p_{1/2} \\
\Rightarrow
L^{-1} q_1 &= L^{-1} q_0 + \epsilon \, L^{-1}(L L^T) \, L^{-T} \tilde p_{1/2} \\
\Rightarrow
\tilde q_1 &= \tilde q_0 + \epsilon \, \tilde p_{1/2}. \\
\end{align}
The last momentum half-step is the same as the first. Thus, we have shown that for both approaches, at every step of the leapfrog integrators agree in the sense that we can transform between approaches and recover either parametrisation.
## Bounce operator
In the bouncy sampler, there is an additional bounce operator. We label $\lambda := -\nabla_q \log \pi (q)$ and $\tilde \lambda = -\nabla_{\tilde q} \log \tilde \pi (\tilde q)$. From before, we know that $L^T \lambda = \tilde \lambda$. We start from the whitened parametrisation (I find it easier). The reflection operator reads
\begin{align}
\tilde R(\tilde \lambda, \tilde p) &= \tilde p - 2\frac{\langle \tilde \lambda, \tilde p \rangle}{ \langle \tilde \lambda, \tilde \lambda \rangle} \tilde \lambda \\
&= L^{T} p - 2\frac{\langle L^{T} \lambda, L^{T} p \rangle}{ \langle L^{T} \lambda, L^{T} \lambda \rangle} L^{T} \lambda \\
&= L^{T} \Bigg\{ p - 2\frac{\langle L^{T} \lambda, L^{T} p \rangle}{ \langle L^{T} \lambda, L^{T} \lambda \rangle} \lambda \Bigg\} \\
&= L^{T} \Bigg\{ p - 2\frac{ \lambda^T L L^T p }{ \lambda^T L L^T \lambda} \lambda \Bigg\} \\
&= L^{T} \Bigg\{ p - 2\frac{ \lambda^T \Sigma p }{ \lambda^T \Sigma \lambda} \lambda \Bigg\} \\
&= L^{T} \Bigg\{ p - 2\frac{ \langle \lambda, \Sigma p \rangle}{ \langle \Sigma \lambda, \lambda \rangle} \lambda \Bigg\} \\
&= L^{T} \Bigg\{ p - 2\frac{ \langle \Sigma \lambda, p \rangle}{ \langle \Sigma \lambda, \lambda \rangle} \lambda \Bigg\}.\\
&:= L^{T} R(\lambda, p).\\
\end{align}
Note that the penultimate equality is possible since $\Sigma$ is symmetric. The reflection operator produces a momentum (or direction), and we see that $\tilde R = L^T R$ does obey the momentum transformation.