###### tags: `one-offs` `monte carlo` `sampling` `multiple try`
# Multiple-Try Metropolis via Jump Processes
**Overview**: In this note, I describe the standard Multiple-Try Metropolis (MTM) algorithm for MCMC sampling, as well as a modified implementation with some desirable properties.
## Introduction
There is a neat { implementation / interpretation } of MTM which has a reduced cost relative to the standard version, is rejection-free, and is based around a continuous-time formulation involving jump processes. While the implementation which I have in mind is related to the paper of Gagnon-Maire-Zanella ([here](https://arxiv.org/abs/2211.11613)), it is not contained in it per se. I highlight also that this paper is very nice, and goes a decent way towards making MTM as good as it { can, should } be (which is still not necessarily fantastic in all cases).
## Basic Algorithm
A bit of recap first: MTM is an MCMC approach in which one makes a number of proposals, and then tries to select a 'real' proposal from among them, which is somehow of improved quality. In principle, this allows for the synthesis of better samplers from simple base proposals. It bears mentioning that without clever computing tricks (e.g parallelism) or other structure, this deserves skepticism; if you are only ever using the same, simple base proposals, then processing them by a selection step should not obviously be able to give you a sampler which is substantially more efficient. Nevertheless, sometimes these tricks do exist, and so the method is nominally of interest.
The simplest rule is to select a proposal move from $x$ to $y$ with probability proportional to $w\left(x,y\right)$ for some function $w$, independent of the other proposals, or how many there are. Thinking carefully, as the number of proposals grows, this is akin to an importance resampling approach to emulating the proposal which has density
\begin{align}
Q^{w}\left(x,\mathrm{d}y\right)=\frac{q\left(x,\mathrm{d}y\right)\cdot w\left(x,y\right)}{qw\left(x\right)}
\end{align}
Thus, the Multiple-Try approach allows for approximate sampling from an intractable proposal. There are then two key questions for the algorithm designer about the construction of $w$:
1. How do we make a good modified proposal $Q^{w}$?
2. How do we ensure that the importance resampling approach gives a good approximation to $Q^{w}$?
Both are addressed in the linked paper of Gagnon-Maire-Zanella.
Anyways, to study this algorithm, it is useful to consider the limiting MH algorithm which has proposal given by $Q^{w}$. Observe that the acceptance ratio
\begin{align}
r\left(x\to y\right) &=\frac{\pi\left(y\right)\cdot Q^{w}\left(y,x\right)}{\pi\left(x\right)\cdot Q^{w}\left(x,y\right)} \\
&=\frac{\pi\left(y\right)\cdot q\left(y,x\right)\cdot w\left(y,x\right)\cdot qw\left(x\right)}{\pi\left(x\right)\cdot q\left(x,y\right)\cdot w\left(x,y\right)\cdot qw\left(y\right)}
\end{align}
involves unknown quantities like $qw\left(x\right)$ and $qw\left(y\right)$. When implementing an MTM approximation of this algorithm, it will turn out to be natural to approximate these quantities by Monte Carlo simulation from $q\left(x,\cdot\right)$ and $q\left(y,\cdot\right)$, e.g.
\begin{align}
\left\{ y^{i}\right\} _{i\in\left[M\right]}\overset{\mathrm{iid}}{\sim}q\left(x,\cdot\right)\leadsto qw\left(x\right)\approx\frac{1}{M}\sum_{i\in\left[M\right]}w\left(x,y^{i}\right).
\end{align}
This then leads to an algorithm with a peculiar forward-backward structure (which is in fact common to a number of advanced MH algorithms; see [here](https://arxiv.org/abs/2012.14881) for more detail),
1. At $x$,
a) For $i\in\left[M\right]$, independently sample $y^{i}\sim q\left(x,\cdot\right)$.
b) Sample $\iota\sim\mathrm{Cat}\left(\left\{ w\left(x,y^{i}\right)\right\} \right)$ and set $y=y^{\iota}$.
c) For $i\in\left[M\right]\setminus\left\{ \iota\right\}$, independently sample $z^{i}\sim q\left(y,\cdot\right)$; set $z^{\iota}=x$.
d) Compute acceptance ratio
\begin{align}
r\left(x\to y;y^{\left[M\right]},z^{\left[M\right]}\right)=\frac{\pi\left(y\right)\cdot q\left(y,x\right)\cdot w\left(y,x\right)}{\pi\left(x\right)\cdot q\left(x,y\right)\cdot w\left(x,y\right)}\cdot\frac{\sum_{i\in\left[M\right]}w\left(x,y^{i}\right)}{\sum_{i\in\left[M\right]}w\left(y,z^{i}\right)}.
\end{align}
e) With probability $\alpha=\min\left\{ 1,r\right\}$, move to $y$, otherwise, remain at $x$.
If $w$ is chosen so that the map $\left(x,y\right)\mapsto\pi\left(x\right)\cdot q\left(x,y\right)\cdot w\left(x,y\right)$ is symmetric in its arguments, then the acceptance probability simplifies somewhat. The paper of Gagnon-Maire-Zanella (and Zanella's work elsewhere on 'balancing functions') provides further detail on why this is a desirable property for weights $w$ more broadly, i.e. it is not only nice because the formula becomes nicer.
## Modified Algorithm
An alternative perspective on the MTM approach is the following: suppose that the proposals can be reparametrised, in the following sense: for some probability measure $r$ which is independent of $x$,
\begin{align}
u\sim r, \, y=F\left(x,u\right)\implies\mathrm{Law}\left(y\mid x\right)=q\left(x,\cdot\right)
\end{align}
where $F$ is some function which is 'nice enough' in terms of smoothness, invertibility, etc. Then, instead of sampling the values of $y$, one can think of sampling the values of $u$, and then conduct a random walk on a synthetic graph, on which one can make moves of the form $x\mapsto F\left(x,u^{i}\right)$ or $x\mapsto F\left(\cdot,u^{i}\right)^{-1}\left(x\right)$, combined with periodically refreshing the values of the $u^{i}$ for the sake of ergodicity, etc. With the values of $u$ treated as fixed, this is then a random walk on a discrete graph.
This 'conditional discreteness' makes it much easier to put things into a continuous-time, rejection-free framework. Aside from the conceptual neatness, this allows for a 'forward-only' algorithm which costs only $N$ evaluations of the target per step, rather than the $2N$ which are required of the MTM, where $N$ is the number of candidate new states at any given time (note that $M$ and $N$ may be distinct in specific examples).
For example, in the MTM with random walk proposals, let $u$ denote the set of available increments for the random walk, with $F\left(x,u\right)=x+u$. Then, for each $u$ in the current pool of increments, jump from $x$ to $x\pm u$ at rate $\lambda \left(x;u,\pm\right) := \beta\left(\frac{\pi\left(x\pm u\right)}{\pi\left(x\right)}\right)$, where $\beta$ is a suitable 'balancing function' (see the paper for details.
As with standard jump process-based MH algorithms, there is no longer a rejection step per se; one simply waits a while, until a suitable new state reveals itself, and one then moves there. For standard MH, this is already a bit nice; for MTM it arguably gets nicer and nicer as $N$ grows. Additionally, it is not particularly challenging to enhance this methodology by introducing some { persistence / non-reversibility / 'momentum' } to the dynamics. Again, the perspective of a random walk on a synthetic discrete graph is helpful here. Some related ideas are explored in [a work of mine](https://arxiv.org/abs/1912.04681) together with Jacob Vorstrup Goldman, without an explicit connection to MTM.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**