# Markov chain Monte Carlo aglorithms with sequential proposals <small>(Park & Atchade, 2019)</small>
$\newcommand{\diff}{\mathop{}\!\mathrm{d}}\newcommand{\gvn}{\,|\,}\newcommand{\set}[1]{#1}\newcommand{\space}[1]{\mathbb{#1}}\newcommand{\algebra}[1]{\mathcal{X}}\newcommand{\expc}{\mathbb{E}}\newcommand{\var}{\mathbb{V}}\newcommand{\range}[2][1]{#1{:}\mkern0.5mu#2}\newcommand{\indc}{𝟙}$
## Preliminaries
### Setup
Target distribution on a measurable space $(\space{X}, \algebra{X})$
$$
\bar{\pi}(\diff x) := \frac{1}{Z}\pi(x) \diff x
$$
Typically wish to estimate expecations of one or more $\algebra{X}$ measurable functions $f$
$$ \bar{\pi}[f] = \int_{\mathbb{X}} f(x)\, \pi(\diff x) $$
### Markov chain Monte Carlo
Let $P : \space{X} \times \algebra{X} \to [0,1]$ be a $\bar{\pi}$-invariant Markov kernel, i.e. satisfying
$$
\bar{\pi}(A) =
\int_{\mathbb{X}} P(x, A)\,\bar{\pi}(\diff x)
\quad \forall \set{A} \in \set{X}.
$$
A common special case satisfying this generalised balance condition is a $\bar{\pi}$-reversible Markov kernel which satisfies
$$
\int_{\set{B}} P(x, A)\,\bar{\pi}(\diff x) =
\int_{\set{A}} P(x, B)\,\bar{\pi}(\diff x)
\quad \forall A,B \in \set{X}.
$$
If for an arbitrary initial state distribution $\mu$ the Markov chain
$$
X_1 \sim \mu,
\quad X_m \sim P(X_{m-1}, \cdot) ~ \forall m \in \range[2]{M}
$$
is *ergodic*, then for any measurable function $f$ the Markov chain Monte Carlo (MCMC) estimator
$$\hat{I}(f, X_{\range{M}}) = \frac{1}{M}\sum_{i=1}^M f(X_i)$$
is a consistent estimator for $\bar{\pi}[f]$ i.e.
$$\lim_{M\to\infty} \mathbb{E}\left[\hat{I}(f, X_{\range{M}})\right] = \bar{\pi}[f]$$.
For a Markov chain $X_{\range{M}}$ with kernel $P$ initialised from its stationary distribution i.e. $\mu = \bar{\pi}$ then we define the *asymptotic variance* for a $\algebra{X}$ measurable function $f$ as
$$\textrm{var}_\infty[f, P] := \lim_{M\to\infty} M \var\left[\frac{1}{M}\sum_{i=1}^M f(X_i)\right]$$
Note the asymptotic variance as a measure of the performance of a MCMC method is distinct from measures of distributional convergence which assess how quickly a chain converges to stationarity, and in some cases a chain with low asymptotic variance may converge very slowly (or not all) to stationarity and vice versa.
### Metropolis--Hastings algorithm <small>(Metropolis et al., 1953; Hastings, 1970)</small>
Let $Q$ be a Markov kernel on $(\space{X},\algebra{X})$ with density $q : \space{X}\times\space{X} \to [0, \infty)$ with respect to the reference measure $\diff x$, i.e. $Q(y, \diff x) = q(y, x) \diff x$. Then the Markov kernel
$$
P_{MH}(x, \diff y) = Q(x, \diff y) \alpha(x, y) + \delta_x(\diff y)\underbrace{\int_{\space{X}} (1 - \alpha(x,u)) Q(x, \diff u)}_{\textrm{probability of rejecting propsoal}}
$$
with
$$
\alpha(x, y) = \min\left(1, \frac{\pi(y)\,q(y,x)}{\pi(x)\,q(x,y)}\right)
$$
is a $\bar{\pi}$ reversible kernel for arbitrary choices of the proposal kernel $Q$.
This can easily be proven by first observing that
$$
\alpha(x, y) q(x,y) \pi(x) =
\min\left(q(x,y) \pi(x), q(y,x) \pi(y) \right) =
\alpha(y, x) q(y,x) \pi(y)
$$
and so
\begin{align}
\int_B P_{MH}(x, A) \bar{\pi}(\diff x) &=
\int_B \int_A
q(x, y) \alpha(x, y) \, \frac{1}{Z}\pi(x) \diff y\diff x +
\int_{B \cap A}
\int_{\space{X}} (1 - a(x,u)) Q(x, \diff u)
\diff x\\
&=
\int_A \int_B
q(y, x) \alpha(y, x) \, \frac{1}{Z}\pi(y) \diff x\diff y +
\int_{B \cap A}
\int_{\space{X}} (1 - a(x,u)) Q(x, \diff u)
\diff x
\\
&=
\int_A \int_B
q(x, y) \alpha(x, y) \, \frac{1}{Z}\pi(x) \diff y\diff x +
\int_{B \cap A}
\int_{\space{X}} (1 - a(x,u)) Q(x, \diff u)
\diff x\\
&= \int_A P_{MH}(x, B) \bar{\pi}(\diff x)
\end{align}
We can generate a sample $X' \sim P_{MH}(x, \cdot)$ in practice as follows
$Y \sim Q(X, \cdot)$
$\Lambda \sim \mathrm{unif}(0,1)$
if $\Lambda < \alpha(X, Y)$
$\quad X' \gets Y$
else
$\quad X' \gets X$
### Metropolis--Hastings-Green algorithm <small>(Green, 1995; Geyer, 2003)</small>
If $\mathcal{F} : \mathbb{X} \to \mathbb{X}$ is an involution such that $\mathcal{F} \circ \mathcal{F} = \textrm{id}$ then
$P_{MHG}(x, \diff y) = \delta_{\mathcal{F}(x)}(\diff y)\alpha(x) + \delta_x(\diff y) (1 - \alpha(x))$
with
$\alpha(x) =
\min\left(1, \frac{\pi(\mathcal{F}(x))}{\pi(x)} \left| \partial\mathcal{F}(x) \right| \right)$
is a $\bar{\pi}$ reversible kernel.
$Y \gets \mathcal{F}(X)$
$\Lambda \sim \textrm{unif}(0,1)$
if $\Lambda < \frac{\pi(Y)}{\pi(X)}\left| \partial\mathcal{F}(X) \right|$ then
$\quad X' \gets Y$
else
$\quad X' \gets X$
### Peskun-Tierney ordering <small>(Peskun, 1973; Tierney, 1998)</small>
Let $P_1$ and $P_2$ be two $\bar{\pi}$-reversible Markov kernels on $(\space{X}, \algebra{X})$, then for any $\algebra{X}$ measurable function $f$
$$
P_1(x, A \setminus \lbrace x \rbrace) \geq
P_2(x, A \setminus \lbrace x \rbrace)
\quad \forall x \in \mathbb{X}, \forall A \in \mathcal{X}
\implies
\textrm{var}_\infty(f, P_1) \leq \textrm{var}_\infty(f, P_2)
$$
For a Metropolis--Hastings kernel $P_{MH}$ then
$$
P_{MH}(x, A \setminus \lbrace x \rbrace) = \int_{A\setminus \lbrace x\rbrace} \alpha(x,y) Q(x, \diff y) =
\int_{A} \alpha(x,y) Q(x, \diff y)
$$
is the average probability of accepting a proposal.
We can therefore improve the asymptotic variance of a MCMC estimator compared to using a base Metropolis-Hastings kernel by redistributing probability mass assigned to rejections in the base Metropolis--Hastings kernel to accepted proposals by 'trying again' with further proposals.
## Delayed rejection method <small>(Tierney and Miray, 2001; Mira, 2001)</small>
An early instantiation of the idea of improving the asymptotic variance of a Metropolis--Hastings kernel by creating extra chances to accept with further proposals, is the delayed rejection algorithm of Tierney and Mira.
Given a proposal kernel $Q(y, \diff x) = q(y, x) \diff x$, an initial proposal is first made as in the standard Metropolis Hastings algorithm, and accepted with the same probability
$Y_1 \sim Q(X, \cdot)$
$\Lambda_1 \sim \textrm{unif}(0,1)$
if $\Lambda_1 < \alpha(X, Y)$
$\quad X' \gets Y_1$
If the first proposal is rejected however, rather than set the next state to the previous value, another proposal is sampled from $Q$ this time starting from the first proposal $Y_1$
$Y_1 \sim Q(X, \cdot)$
$\Lambda_1 \sim \textrm{unif}(0,1)$
if $\Lambda_1 < \alpha(X, Y)$
$\quad X' \gets Y_1$
else
$\quad Y_2 \sim Q(Y_1, \cdot)$
$\quad\Lambda_2 \sim \textrm{unif}(0,1)$
$\quad$if $\Lambda_2 < \min\left( 1, \frac{\pi(Y_2) q(Y_2, Y_1) q(Y_1, X)}{\pi(X) q(X, Y_1) q(Y_1, Y_2)} \frac{1 - \alpha(Y_2, Y_1)}{1 - \alpha(X, Y_1)}\right)$
$\qquad X' \gets Y_2$
This procedure of making a new proposal if the previous one was rejected can be iterated as many times as wanted. The probability of accepting the $n$th proposal given a sequence of proposed values $y_{\range{m}}$ and initial state $x = y_0$ is then given by
$$
\alpha_n(y_{\range[0]{n}}) =
\min\left(
1,
\frac{\pi(y_n)}{\pi(y_0)}
\frac{\prod_{i=1}^n q(y_i, y_{i-1})}{\prod_{i=1}^n q(y_{i-1}, y_{i})}
\frac{\prod_{j=1}^{n-1} (1 - \alpha_j(y_{\range[n]{n-j}}))}{\prod_{j=1}^{n-1} (1 - \alpha_j(y_{\range[0]{j}}))}
\right)
$$
The overall algorithm delayed rejection algorithm is then
$Y_0 \gets X$, accepted $\gets \texttt{False}$, $n \gets 1$
while not accepted and $n < N$
$\quad Y_n \gets Q(Y_{n-1}, \cdot)$
$\quad \Lambda_n \sim \textrm{unif}(0,1)$
$\quad$if $\Lambda_n < \alpha_n(y_{\range[0]{n}})$
$\quad\quad X' \gets Y_n$, accepted $\gets \texttt{True}$
$\quad n \gets n + 1$
if not accepted
$\quad X' \gets X$
The overall kernel $P_{DR}$ is
\begin{align}
P_{DR}(x, A) &=
\sum_{m=1}^N \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = x)\\
&\phantom{=}\, + \mathbb{P}(\text{accepted} = \texttt{false}, X' \in \set{A}, n=N \gvn X = x)
\end{align}
The probability of accepting a proposal in a set $A$ on the $n$th attempt given $X = y_0$ is then
\begin{multline}
\mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0)=\\
\int_{\space{X}^m}
\indc_A(y_m)\,
\prod_{i=1}^m q(y_j\gvn y_{j-1})\,
\prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\,
\alpha_m(y_{\range[0]{m}})
\diff y_{\range{m}}
\end{multline}
\begin{multline}
\int_B
\mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0)
\,\bar{\pi}(\diff y_0)
=\\
\int_{\space{X}^{m+1}}
\indc_A(y_m)\,\indc_B(y_0)\,
\frac{1}{Z}\pi(y_0)
\prod_{i=1}^m q(y_{j-1}, y_j)\,
\prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\,
\alpha_m(y_{\range[0]{m}})
\diff y_{\range[0]{m}}
\end{multline}
\begin{multline}
\int_B
\mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0)
\,\bar{\pi}(\diff y_0)
=\\
\int_{\space{X}^{m+1}}
\indc_A(y_m)\,\indc_B(y_0)\,
\frac{1}{Z}
\underbrace{
\pi(y_0)
\prod_{i=1}^m q(y_{j-1}, y_j)\,
\prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\,
\alpha_m(y_{\range[0]{m}})
}_
{
\min\Big(
\pi(y_0)
\prod_{i=1}^m q(y_{j-1}, y_j)\,
\prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}})),\,
\pi(y_m)
\prod_{i=1}^m q(y_{j}, y_{j-1})\,
\prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[m]{m-j}}))
\Big)
}
\diff y_{\range[0]{m}}
\end{multline}
$$
\int_B
\mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0) \,\bar{\pi}(\diff y_0) =\, \\
\int_A
\mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{B}, n = m \gvn X = y_0)\,\bar{\pi}(\diff y_0)
$$
## Sequential proposal Metropolis-Hastings algorithm
$$
\beta(y_{\range[0]{N}}) =
\min\left(
1,
\frac{\pi(y_n)}{\pi(y_0)}
\frac{\prod_{i=1}^n q(y_i, y_{i-1})}{\prod_{i=1}^n q(y_{i-1}, y_{i})}
\right)
$$
$\Lambda \sim \textrm{unif}(0,1)$
$Y_0 \gets X$, accepted $\gets \texttt{False}$, $n \gets 1$
while not accepted and $n < N$
$\quad Y_n \sim Q(Y_{n-1}, \cdot)$
$\quad$if $\Lambda < \beta(y_{\range[0]{n}})$
$\quad\quad X' \gets Y_n$, accepted $\gets \texttt{True}$
$\quad n \gets n + 1$
if not accepted
$\quad X' \gets X$