owned this note
owned this note
Published
Linked with GitHub
# Tuning $l$ in HMC
Beskos et al (2013) state that the cost of computing a single proposal is $\mathcal{O}(d^{5/4})$, where $d$ is the number of dimensions. This cost is minimized (over $l$) by maximising the efficiency function $eff(l) = a(l) l$, where $a$ is acceptance probability and $l$ is number of leapfrog steps. They show that in the $d$-asymptotic $a(l_{opt}) = 0.651$, which is target independent and in practice may be greater. (Betancourt et al (2015) shows that this value can be relaxed to $0.6 < a(l_{opt}) < 0.9$.) Using the expected squared jumping distance (ESJD) as measure of efficiency, they show that maximising ESJD is equivalent to maximising $eff$. The paper offers no practical suggestions to tune $l$.
The NUTS algorithm (refresher [here](#NUT-sampler)) is known to compute integration path longer than is used. Wu et al (2019) aim to reduce this cost with their empirical HMC (eHMC), where at each iteration $l$ is sampled from the empirical distribution of the longest NUTS path (details [here](#Learning-the-empirical-distribution-of-longest-NUTS-path)).
Hoffman et al (2021) points out the complexity of NUTS and difficulty to run it in parallel with GPUs. By analysing the first-order autocorrelations of jittered chain on a Gaussian example, they show that maximising ESJD detracts from the goal of minimizing autocorrelations by doing extra work so to make sure estimates of high-variance parameters are better than those of low-variance parameters. Also,
> ESJD is also fooled by long trajectories’ ability to produce anti-correlated samples, which are good for estimating means but bad for estimating variances.
(this sentence is a little confusing to me)
Their results also suggest that we should jitter total integration time $T$ (sometimes called integration length) and proposes choosing $T$ in (jittered) HMC according by minimizing the Change in the Estimator of the Expected Square (ChEES) criterion. They propose using ADAM to do this optimization and to use quasi-random Halton sequences for jittering $T$. The resultant ChEES-HMC algorithm adaptively tunes $T$ and step size $\epsilon$, where $l = \lceil T / \epsilon \rceil$. These are kept the same across however many chains being run. The performance of ChEES-HMC often beat NUTS and is competitive with grid-searching $T$.
Yu et al (2020) suggests maximising conditional entropy (CE) of a reversible Markov chain as a criterion with which to choose mass matrix and $T$. Further, they empirically show using a 1D toy example that maximising ESJD leads to periodic chains while the CE criterion maintains ergodicity. For near-Gaussian cases, they prove that the optimal mass matrix resulting from maximising CE is indeed the target inverse covariance and the corresponding optimal integration time is $T_\text{opt} = (2m+1) \pi / 2, m \in \mathbb{N}$. They argue that reasonably Gaussian cases should benefit from this choice and choose $l$ with an adaptive HMC method, where using fixed $T=\pi/2$, $l$ is increased until the average acceptance per computation stops improving.
Some questions I have about this work is that the counterexample against ESJD though demonstrates periodicity of the chain, is not in fact a proper sampling method. So, it is unclear this argument applies for properly designed samplers, albeit using the ESJD criterion. Also, the choice of $T=\pi/2$ comes from that a smaller integration is preferred, a claim I find unmotivated. Lastly, maximising acceptance per computation as a rule to choose $l$ is also not very convincing.
## Appendix
### NUT Sampler
The No-U-Turn Sampler (NUTS) is a HMC method with tuning of step size $\epsilon$ and $l$ built in. The step size $\epsilon$ is tuned with a warm-up chain via primal-dual averaging (Nesterov, 2009), after which it is fixed.
To tune $l$, NUTS looks to maximise ESJD by considering the U-turn criterion $(\theta' - \theta) M^{-1} v' < 0$. Fulfilment of the U-turn criterion means that further leapfrog integration is undesirable as the sampler will tend to explore already-visited regions, i.e. the sampler is doing a U-turn.
NUTS generates a leapfrog integration trajectory in the form of a balanced binary tree. To uphold reversibility, the tree is generated (recursively) via repeated doubling. When U-turn criterion is fulfiled, the construction of the balanced binary tree is terminated. Then, the final HMC proposal is one of the leaves, chosen at random with weight proportional to the target density.
We particularly note that NUTS is rather complex to implement for non-experts (actually experts too), which also means that adapting it is challenging. Moreover, its recursive nature also makes parallelisation less than easy. (There is an iterative implementation of NUTS called 'BlackJax', which ought to be more ammenable to parallelisation, but it is still quite new/in develop and I am not familiar with it.)
### Learning the empirical distribution of longest NUTS path
The empirical distribution of the longest NUTS path is learnt during a warm-up chain. At every sampler iteration, leapfrog integration persists until the U-turn criterion is fulfilled, where the number of leapfrog steps taken $l^*$ contributes to the empirical distribution for $l$. Regardless of $l^*$, the next HMC proposal is computed using $l_0$ leapfrog steps. In other words, if $l^* < l_0$, then $l_0 - l^*$ more leapfrog integrator steps must be run to arrive at the next proposal. If $l^* > l_0$, then only the last $l^* - l_0$ leapfrog steps will not contribute to the next proposal. Using $l^*$ steps to generate the next proposal will result in a biased sampler. Overall, the warm up chain generates HMC proposals with a fixed $l_0$ leapfrog steps while collecting empirical samples $l^*$ of $l$.
## BibTex citations
@misc{betancourt2015optimizing,
title={Optimizing The Integrator Step Size for Hamiltonian Monte Carlo},
author={M. J. Betancourt and Simon Byrne and Mark Girolami},
year={2015},
eprint={1411.6669},
archivePrefix={arXiv},
primaryClass={stat.ME}
}
@misc{yu2020maximum,
title={Maximum conditional entropy Hamiltonian Monte Carlo sampler},
author={Tengchao Yu and Hongqiao Wang and Jinglai Li},
year={2020},
eprint={1910.05275},
archivePrefix={arXiv},
primaryClass={stat.CO}
}
@Article{Nesterov2009,
author={Nesterov, Yurii},
title={Primal-dual subgradient methods for convex problems},
journal={Mathematical Programming},
year={2009},
month={Aug},
day={01},
volume={120},
number={1},
pages={221-259},
issn={1436-4646},
doi={10.1007/s10107-007-0149-x},
url={https://doi.org/10.1007/s10107-007-0149-x}
}
@misc{hoffman2011no,
title={The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo},
author={Matthew D. Hoffman and Andrew Gelman},
year={2011},
eprint={1111.4246},
archivePrefix={arXiv},
primaryClass={stat.CO}
}
@InProceedings{hoffman21adaptive,
title = { An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo },
author = {Hoffman, Matthew and Radul, Alexey and Sountsov, Pavel},
booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
pages = {3907--3915},
year = {2021},
editor = {Banerjee, Arindam and Fukumizu, Kenji},
volume = {130},
series = {Proceedings of Machine Learning Research},
month = {13--15 Apr},
publisher = {PMLR},
pdf={http://proceedings.mlr.press/v130/hoffman21a/hoffman21a.pdf},
url ={http://proceedings.mlr.press/v130/hoffman21a.html}
}
@article{beskos13optimal,
author = {Alexandros Beskos and Natesh Pillai and Gareth Roberts and Jesus-Maria Sanz-Serna and Andrew Stuart},
title = {{Optimal tuning of the hybrid Monte Carlo algorithm}},
volume = {19},
journal = {Bernoulli},
number = {5A},
publisher = {Bernoulli Society for Mathematical Statistics and Probability},
pages = {1501 -- 1534},
keywords = {Hamiltonian dynamics, high dimensions, leapfrog scheme, optimal acceptance probability, squared jumping distance},
year = {2013},
doi = {10.3150/12-BEJ414},
URL = {https://doi.org/10.3150/12-BEJ414}
}
@unpublished{wu2019faster,
TITLE = {{Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale}},
AUTHOR = {Wu, Changye and Stoehr, Julien and Robert, Christian},
URL = {https://hal.archives-ouvertes.fr/hal-01968770},
NOTE = {21 pages},
YEAR = {2019},
MONTH = Jan,
HAL_ID = {hal-01968770},
HAL_VERSION = {v1},
}