Minibatch Markov Chain Monte Carlo

###### tags: `one-offs` `sampling` `mcmc` `subsampling` # Minibatch Markov Chain Monte Carlo **Overview**: In this note, I informally review some aspects of the class of algorithms which combine standard Markov Chain Monte Carlo (MCMC) methodologies with minibatching techniques. ## Introduction Markov Chain Monte Carlo (MCMC) is an algorithmic framework for approximately sampling from probability measures to which we only have limited access. For example, Random Walk Metropolis only requires pointwise access to the density of the target measure, up to a normalising constant. The Metropolis-Adjusted Langevin Algorithm and Hamiltonian Monte Carlo assume the same access, as well as pointwise access to the gradient of the logarithm of this density. MCMC algorithms differ in what form of access to the target they assume, and how they process that information. These 'modes of access' are often referred to as *oracles*. As in many numerical endeavours, there are tradeoffs involved in the design of MCMC algorithms. For example, even when relatively strong oracles are available in principle, they may be quite computationally expensive to implement. As such, it is of interest to synthesise new oracles which are in some sense weaker, but cheaper. ## Minibatching Oracles Going forward, we will assume that the density of the target measure admits some factorisation into individual components, i.e. \begin{align} \pi (\mathrm{d} x) = \mu (\mathrm{d} x) \cdot \prod_{i \in [N]} \psi_i (x), \end{align} where $\mu$ is some simple reference measure, and each $\psi_i$ is individually inexpensive to evaluate. For $B \subseteq [N]$, it will be useful to abbreviate $\psi_B (x) = \prod_{i \in B} \psi_i (x)$, and for $B = [N]$, we can simply write $\Psi(x)$. In standard MCMC algorithms, proposal moves and accept-reject decisions are made on the basis of $\Psi$. In Minibatch MCMC, these operations are made on the basis of $\psi_B$ for 'minibatches' $B$ which satisfy $|B| \ll N$. ### Proposals Langevin Monte Carlo is based around the idea of constructing a Markov chain whose dynamics emulate the Overdamped Langevin Diffusion, \begin{align} \mathrm{d} X_t = \nabla_x \log \left( \frac{\mathrm{d} \pi}{\mathrm{d} \lambda} \right) (x) \, \mathrm{d} t + \sqrt{2} \, \mathrm{d} W. \end{align} where $\lambda$ denotes Lebesgue measure. Assuming that $\mu = \lambda$, the drift term of this SDE then has the form \begin{align} \nabla_x \log \left( \frac{\mathrm{d} \pi}{\mathrm{d} \lambda} \right) (x) &= \nabla_x \log \left( \prod_{i \in [N]} \psi_i (x) \right) \\ &= \sum_{i \in [N]} \nabla_x \log \psi_i (x). \end{align} When $N$ is large, it is natural to approximate this term by subsampling, i.e. \begin{align} \sum_{i \in [N]} \nabla_x \log \psi_i (x) &\approx \frac{N}{|B|} \cdot \sum_{i \in B} \nabla_x \log \psi_i (x), \end{align} which allows for approximate proposals to be generated at a greatly-reduced cost. These approximations are typically relatively easy to analyse, as they are unbiased at the level of the gradient. This proposal thus corresponds roughly to inflating the diffusivity of the underlying process, by a factor which depends on both the minibatch size and the discretisation step-size. It bears mentioning that modifying the process will typically impact its equilibrium properties, and this impact is not always easy to assess a priori. Similar ideas are naturally adapted to other proposal schemes based on simulation of stochastic processes which involve gradient information, with similar associated risks. ### Accept-Reject Decisions In Metropolis-Hastings algorithms, proposed moves are evaluated on the basis of a so-called acceptance probability, which typically requires computing ratio of the density of $\pi$ at both the current location $x$ to the density at the proposed new location $x^{'}$. As with proposals, this can be prohibitively expensive for large $N$, and so various works have considered reduced-cost approximations. The simplest such approach would be to approximate \begin{align} \Psi(y) / \Psi (x) \approx \psi_B (y) / \psi_B (x) \end{align} for some $B \subseteq [N]$. While similarly natural, the analysis of this approximation is more challenging, due to the presence of additional nonlinearity involved in converting this ratio into a binary accept-reject decision. There are also methods which grow the set $B$ adaptively until the accept-reject decision can be made with sufficient confidence; one can imagine that such approaches are yet more challenging to analyse. Additionally, there are refined analyses which tend to suggest that in order to control the bias induced by the additional stochasticity, this ratio should be penalised by an additional factor which explicitly accounts for the noisiness of the ratio estimator. So, at some level, it appears that there is something qualitatively different about using minibatching to approximate proposal moves and accept-reject decisions. ### Other Forms of Minibatching There are forms of MCMC which do not use gradient-based proposalsm and which do not involve accept-reject decisions, e.g. Gibbs sampling. Minibatching techniques can still be used in these settings, but tend to involve more case-specific analyses. As such, I do not discuss them further at present. ## Costs and Tradeoffs Without serious care, typical minibatching MCMC methods will not admit the desired invariant measure, even when some form of accept-reject filter is incorporated. As such, it is relatively common to cast the accept-reject step aside entirely, enabling a much simpler mathematical analysis, while perhaps further inflating the error of the method. Typically, for a given minibatch size $|B|$ and desired error tolerance $\varepsilon$, theory suggest that one can find a discretisation step-size $h$ such that the process converges to an invariant measure $\hat{\pi}_{h, B}$ which is $\varepsilon$-close to $\pi$ in some distance of interest. I do not have a good sense for how relevant such existence results are in practice, or how they have been used to inform practical implementations of such methods. A separate approach is to run coupled copies of the process with different minibatch sizes and step-sizes, and use the difference between the processes as a coarse estimator of the error relative to the full-batch, perfectly-discretised process. This is related to methods like Multilevel Monte Carlo. ## Outlook Minibatch methods occupy an interesting place in the landscape of MCMC algorithms. It appears that they are able to offer affordable solutions to certain sampling-related problems, but perhaps not exactly the problems with which people have historically been concerned. Roughly speaking, conventional MCMC research focuses carefully on ensuring convergence to the desired invariant measure, and prefers to sidestep approximations which induce a bias where possible. It seems that satisfactorily achieving these goals with minibatching methods is not so straightforward, despite substantial efforts. In contrast, as solutions to statistical problems, things may be more positive than the MCMC perspective suggests. For context, it is well-understood that when using stochastic gradient methods for parameter estimation, it is only really interesting to solve the optimisation problem to within an error which matches the statistical error involved with the ideal estimator; high-accuracy estimators do not actually improve the situation so much, at least at the statistical level. One hopes that for minibatching MCMC methods, there may be a similar story to tell. It also appears relatively clear that whether biased or otherwise, these methods are practically interesting, and will be used. If 'exact' methods are simply not feasible, but inexact, cheap methods are feasible, it is difficult to recommend the former class. As such, it seems inevitably relevant for theorists characterise and control these biases, as pertains to practical estimation problems.