###### tags: `sampling` `monte carlo` `one-offs`
# From Adaptive MCMC to Interacting Particle Systems
**Overview**: In this note, I will sketch out a way of mapping a conventional adaptive Markov Chain Monte Carlo algorithm onto an interacting particle system with similar features.
## Brisk Outline of MCMC
Markov Chain Monte Carlo (MCMC) is a strategy for approximately sampling from a probability measure $\pi$ of interest. In short, MCMC proceeds as follows:
1. Identify a Markov kernel $K$ which leaves $\pi$ invariant, in the sense that
\begin{align}
\int \pi(x) K (x \to y) dx = \pi(y).
\end{align}
2. Select an initial point in the state space, $x$.
3. Use $K$ to generate a Markov chain starting at $X_0 = x$, i.e. for $t \geqslant 1$, sample $X_t \sim K (X_{t-1} \to X_t)$.
Under appropriate assumptions, as $t$ grows, the marginal distribution of $X_t$ will tend to that of $\pi$, allowing for the estimation of expectations under $\pi$, as well as the resolution of other related tasks.
## Adaptive MCMC
As in many Monte Carlo tasks, even when one can establish strong theoretical guarantees on the limiting behaviour of an MCMC algorithm, the finite-sample behaviour may not always be satisfactory. In MCMC, the challenge is usually 'mixing', i.e. guaranteeing that the Markov chain $\{ X_t \}_{t \geqslant 0}$ explores the relevant support of the measure $\pi$ sufficiently well. A priori, a Markov kernel $K$ which leaves $\pi$ invariant need not do this especially well - for example, the trivial kernel $K (x \to y) = \delta (x, y)$ is certainly $\pi$-invariant, but does not move at all!
As such, instead of working with a fixed kernel $K$, it is common to consider a parametrised family of kernels $\mathbf{K} = \{ K_\theta \}_{\theta \in \Theta}$, each of which leaves $\pi$ invariant, and to identify the best kernel within that family as the chain proceeds. That is,
1. At time $t$,
2. Move to $X_t \sim K_{\theta_{t-1}} (X_{t-1} \to X_t)$.
3. Set $\theta_t = \text{Update} (\theta_{t-1}; \{X_s\}_{0\leqslant s \leqslant t} )$.
The $\text{Update}$ rule is typically designed so that $\theta_t$ converges as $t$ grows, ideally to the 'optimal' value of $\theta$, which will be task-dependent.
It bears noting that even if this convergence is successful, it does not immediately follow that the output of the algorithm is suitable for approximating the measure $\pi$. The sequence of $X_t$ no longer forms a Markov chain, and so the standard theory underpinning MCMC approaches does not directly apply. Fortunately, there is by now a well-developed theory which provides guidance on how to perform adaptation while retaining the desired guarantees. I will not address this further in this note, and will simply assume that 'all is well'.
## An Example : Adaptive Random Walk Metropolis-Hastings
It is useful to first consider a simple example of a parametrised family of kernels, and some associated update rules.
The standard Random Walk Metropolis-Hastings (RWMH) algorithm applies to sampling from measures $\pi$ which admit a density with respect to the Lebesgue measure on $\mathbf{R}^d$ for some $d \in \mathbf{N}$. The approach is to construct a $\pi$-invariant kernels follows: let $\theta = (h, \Sigma)$, where $h > 0$, and $\Sigma$ is a positive-definite matrix. Then,
1. At $x$, 'propose a move to' $y \sim \mathcal{N} (y | x, h \cdot \Sigma)$.
2. Compute $\alpha (x \to y) = \min \left( 1, \pi(y) / \pi(x) \right)$.
3. With probability $\alpha (x \to y)$, move to $y$; otherwise, remain at $x$.
In basic implementations of the algorithm, one might fix $\Sigma$ to be the identity matrix.
To tune this algorithm, there are two standard heuristics which are applied:
1. The matrix $\Sigma$ should approximate the covariance matrix of $x$ under the law of $\pi$, and
2. The step-size $h$ should be set so that a proportion $\alpha_* \approx 1/4$ of proposed moves are accepted.
Both of these heuristics are valid in some idealised settings, can be made to fail in stylised settings, and are seen to be broadly useful in general practice.
There are thus some natural update rules which one can consider to implement these heuristics (which can of course be tweaked and improved to varying degrees):
1. Use the output of the Markov chain so far to estimate the covariance matrix of $X$ under the law of $\pi$, and use this estimate as $\Sigma$.
2. If the average acceptance rate of the chain exceeds $\alpha_*$, then increase $h$. Likewise, if the average acceptance rate is less than $\alpha_*$, then decrease $h$.
* A common strategy here would be to set $\log h_t = \log h_{t-1} + \gamma_t ( \alpha_t - \alpha_*)$, with $\alpha_t = \alpha (X_t \to Y_t)$, and $\gamma_t$ some positive, decaying sequence.
The first technique can be roughly characterised as estimating the target measure $\pi$. In this case, it takes place by estimating the moments of $\pi$. In other cases, it may be more elaborate, e.g. fitting a Gaussian mixture to $\pi$, expressing $\pi$ as a transformation of a simple distribution, and so on.
The second technique can be roughly characterised as an instance of feedback control. The parameter $h$ functions as a discretisation parameter, and observations from the output of the Markov chain are used to assess whether it is set appropriately, i.e. to assess whether the acceptance rate is at the right level. The mismatch between the observed acceptance rate and desired acceptance rate is then used to modify the parameter $h$, with an eye towards stabilising the acceptance rate at the desired level.
Broadly speaking, most adaptive MCMC strategies are based on some combination of these two principles. Sometimes, the distinction between the principles is not so clear. Sometimes, only one of the principles is relevant. Other times, it is not so clear how to use either. Nevertheless, they are a useful pair of tools to consider when thinking about how to adapt and improve an MCMC algorithm.
## Adaptation via Particle Systems
When the adaptation scheme is based around estimating the target measure $\pi$ in some way, it is usually the case that the optimal parameter $\theta$ can be represented as some functional $\Phi$ of the target measure $\pi$, i.e.
\begin{align}
\theta_* = \Phi ( \pi).
\end{align}
For example, in the RWMH case detailed above, one has that
\begin{align}
\Sigma_* = \text{Cov}_\pi \left[ X \right].
\end{align}
There, one makes the approximation
\begin{align}
\hat{\Sigma} = \text{Cov}_{\hat{\pi}} \left[ X \right],
\end{align}
where $\hat{\pi}$ is the empirical measure of the Markov chain path thus far. Indeed, with any approximation of $\pi$, generated by any means, this could be a sensible strategy. Using the empirical measure of the Markov chain is natural for an approximation which is formed sequentially, but this is not the only approach.
## Adaptive RWMH via Particle Systems
Consider the task of approximately sampling from $\pi$, given some approximation $\pi_0$. The following strategy might then be reasonable:
1. Draw samples $\{ x_0^a \}_{a = 1}^N \sim \pi_0$.
2. Define $\hat{\pi}_0$ to be the empirical measure of $\{ x_0^a \}_{a = 1}^N$.
3. Compute $\hat{\Sigma}_0 = \text{Cov}_{\hat{\pi}_0} \left[ X \right]$, initialise $h_0 > 0$, and write $\theta_0 = (h_0, \Sigma_0)$.
4. Move the particles $\{ x_0^a \}_{a = 1}^N$ using the kernel $K \left(x \to y; \theta_0 \right)$ to obtain $\{ x_1^a \}_{a = 1}^N$.
5. For $t \geqslant 1$,
1. Define $\hat{\pi}_t$ to be the empirical measure of $\{ x_t^a \}_{a = 1}^N$.
2. Compute $\hat{\Sigma}_t = \text{Cov}_{\hat{\pi}_t} \left[ X \right]$.
3. Update $\log h_t = \log h_{t-1} + \gamma_t ( \alpha_t - \alpha_*)$, with $\alpha_t$ the average acceptance rate from the previous step.
4. Write $\theta_t = (h_t, \Sigma_t)$.
5. Move the particles $\{ x_t^a \}_{a = 1}^N$ using the kernel $K \left(x \to y; \theta_t \right)$ to obtain $\{ x_{t + 1}^a \}_{a = 1}^N$.
Essentially, instead of running a single Markov chain which incrementally constructs an estimate of the target measure, one works with a population of particles from the outset, and tries to use them to approximate the measure as a collective. There are some subtleties associated with this approach (e.g. one ought to be careful to ensure that $\Sigma_t$ is positive-definite when the ambient dimension is large), but with fairly minor modifications, this resembles a functional sampling algorithm.
As ever, one should be cautious about what one can reasonably expect from such approaches in practice. It is best to focus on the path which lead us to this method: we wanted to estimate the covariance matrix of our target measure while sampling from it. Some simple remarks are in order:
* If our estimate of the covariance matrix is very poor, then we should not expect to explore the target measure efficiently.
* If we are not exploring the target measure well, then we should not expect to form an accurate estimate of the covariance matrix.
In many cases, there is a more positive feedback loop involved here, in the sense that an 'OK' estimate of the covariance matrix can lead to reasonable exploration, and the quality of the covariance matrix improves accordingly. This is not always the case. In particular, for inhomogeneous targets and multimodal targets, the situation can be far less favourable. This is not so surprising, in the sense that these are generally difficult problems.
## Measure-Dependent Markov Processes
It is sometimes interesting to consider continuum limits of sampling algorithms. For example, in the algorithm described above, as the step-size $h$ tends to $0$, the dynamics of each particle can be approximated by the stochastic differential equation (SDE)
\begin{align}
dX_t^a = \Sigma_t \nabla \log \pi (X_t^a) \, dt + \sqrt{2 \Sigma_t} dW_t^a
\end{align}
where $\{W_t^a\}_{a = 1}^N$ are independent Brownian motions of appropriate dimension. One might prefer to write $\mu_t$ for the empirical measure of the particles, which allows for the representation
\begin{align}
dX_t = \Sigma ( \mu_t ) \nabla \log \pi (X_t) \, dt + \sqrt{2 \Sigma ( \mu_t )} dW_t.
\end{align}
This framing emphasises that the evolution is 'measure-dependent' in this sense. The equations which govern the evolution of $\mu_t$ are then nonlinear in $\mu_t$ (in contrast to the case for standard SDEs), which is part of why SDEs of this form are sometimes referred to as 'nonlinear SDEs'.
## Conclusion
In this note, I have sketched out how a certain adaptive MCMC scheme can be mapped onto a corresponding interacting particle system with similar properties. One can carry out similar mappings for other adaptive schemes by identifying which functionals of the target measure are sufficient for the optimal parametrisation of the kernels in question, and then approximating those functionals using adaptive particle systems. It is an informative exercise to carry out some of these computations for oneself.
I have not said much here about the practical benefits of such approaches, nor of the theoretical properties. These have been explored to some extent in some particular cases, but are arguably less well-understood than conventional MCMC and Sequential Monte Carlo approaches. They are also not typically framed in terms of Adaptive MCMC, as the technical challenges are in some sense distinct.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**