Markov chain Monte Carlo aglorithms with sequential proposals (Park & Atchade, 2019)

# Markov chain Monte Carlo aglorithms with sequential proposals (Park & Atchade, 2019) $\newcommand{\diff}{\mathop{}\!\mathrm{d}}\newcommand{\gvn}{\,|\,}\newcommand{\set}[1]{#1}\newcommand{\space}[1]{\mathbb{#1}}\newcommand{\algebra}[1]{\mathcal{X}}\newcommand{\expc}{\mathbb{E}}\newcommand{\var}{\mathbb{V}}\newcommand{\range}[2][1]{#1{:}\mkern0.5mu#2}\newcommand{\indc}{𝟙}$ ## Preliminaries ### Setup Target distribution on a measurable space $(\space{X}, \algebra{X})$ $$ \bar{\pi}(\diff x) := \frac{1}{Z}\pi(x) \diff x $$ Typically wish to estimate expecations of one or more $\algebra{X}$ measurable functions $f$ $$ \bar{\pi}[f] = \int_{\mathbb{X}} f(x)\, \pi(\diff x) $$ ### Markov chain Monte Carlo Let $P : \space{X} \times \algebra{X} \to [0,1]$ be a $\bar{\pi}$-invariant Markov kernel, i.e. satisfying $$ \bar{\pi}(A) = \int_{\mathbb{X}} P(x, A)\,\bar{\pi}(\diff x) \quad \forall \set{A} \in \set{X}. $$ A common special case satisfying this generalised balance condition is a $\bar{\pi}$-reversible Markov kernel which satisfies $$ \int_{\set{B}} P(x, A)\,\bar{\pi}(\diff x) = \int_{\set{A}} P(x, B)\,\bar{\pi}(\diff x) \quad \forall A,B \in \set{X}. $$ If for an arbitrary initial state distribution $\mu$ the Markov chain $$ X_1 \sim \mu, \quad X_m \sim P(X_{m-1}, \cdot) ~ \forall m \in \range[2]{M} $$ is *ergodic*, then for any measurable function $f$ the Markov chain Monte Carlo (MCMC) estimator $$\hat{I}(f, X_{\range{M}}) = \frac{1}{M}\sum_{i=1}^M f(X_i)$$ is a consistent estimator for $\bar{\pi}[f]$ i.e. $$\lim_{M\to\infty} \mathbb{E}\left[\hat{I}(f, X_{\range{M}})\right] = \bar{\pi}[f]$$. For a Markov chain $X_{\range{M}}$ with kernel $P$ initialised from its stationary distribution i.e. $\mu = \bar{\pi}$ then we define the *asymptotic variance* for a $\algebra{X}$ measurable function $f$ as $$\textrm{var}_\infty[f, P] := \lim_{M\to\infty} M \var\left[\frac{1}{M}\sum_{i=1}^M f(X_i)\right]$$ Note the asymptotic variance as a measure of the performance of a MCMC method is distinct from measures of distributional convergence which assess how quickly a chain converges to stationarity, and in some cases a chain with low asymptotic variance may converge very slowly (or not all) to stationarity and vice versa. ### Metropolis--Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) Let $Q$ be a Markov kernel on $(\space{X},\algebra{X})$ with density $q : \space{X}\times\space{X} \to [0, \infty)$ with respect to the reference measure $\diff x$, i.e. $Q(y, \diff x) = q(y, x) \diff x$. Then the Markov kernel $$ P_{MH}(x, \diff y) = Q(x, \diff y) \alpha(x, y) + \delta_x(\diff y)\underbrace{\int_{\space{X}} (1 - \alpha(x,u)) Q(x, \diff u)}_{\textrm{probability of rejecting propsoal}} $$ with $$ \alpha(x, y) = \min\left(1, \frac{\pi(y)\,q(y,x)}{\pi(x)\,q(x,y)}\right) $$ is a $\bar{\pi}$ reversible kernel for arbitrary choices of the proposal kernel $Q$. This can easily be proven by first observing that $$ \alpha(x, y) q(x,y) \pi(x) = \min\left(q(x,y) \pi(x), q(y,x) \pi(y) \right) = \alpha(y, x) q(y,x) \pi(y) $$ and so \begin{align} \int_B P_{MH}(x, A) \bar{\pi}(\diff x) &= \int_B \int_A q(x, y) \alpha(x, y) \, \frac{1}{Z}\pi(x) \diff y\diff x + \int_{B \cap A} \int_{\space{X}} (1 - a(x,u)) Q(x, \diff u) \diff x\\ &= \int_A \int_B q(y, x) \alpha(y, x) \, \frac{1}{Z}\pi(y) \diff x\diff y + \int_{B \cap A} \int_{\space{X}} (1 - a(x,u)) Q(x, \diff u) \diff x \\ &= \int_A \int_B q(x, y) \alpha(x, y) \, \frac{1}{Z}\pi(x) \diff y\diff x + \int_{B \cap A} \int_{\space{X}} (1 - a(x,u)) Q(x, \diff u) \diff x\\ &= \int_A P_{MH}(x, B) \bar{\pi}(\diff x) \end{align} We can generate a sample $X' \sim P_{MH}(x, \cdot)$ in practice as follows $Y \sim Q(X, \cdot)$ $\Lambda \sim \mathrm{unif}(0,1)$ if $\Lambda < \alpha(X, Y)$ $\quad X' \gets Y$ else $\quad X' \gets X$ ### Metropolis--Hastings-Green algorithm (Green, 1995; Geyer, 2003) If $\mathcal{F} : \mathbb{X} \to \mathbb{X}$ is an involution such that $\mathcal{F} \circ \mathcal{F} = \textrm{id}$ then $P_{MHG}(x, \diff y) = \delta_{\mathcal{F}(x)}(\diff y)\alpha(x) + \delta_x(\diff y) (1 - \alpha(x))$ with $\alpha(x) = \min\left(1, \frac{\pi(\mathcal{F}(x))}{\pi(x)} \left| \partial\mathcal{F}(x) \right| \right)$ is a $\bar{\pi}$ reversible kernel. $Y \gets \mathcal{F}(X)$ $\Lambda \sim \textrm{unif}(0,1)$ if $\Lambda < \frac{\pi(Y)}{\pi(X)}\left| \partial\mathcal{F}(X) \right|$ then $\quad X' \gets Y$ else $\quad X' \gets X$ ### Peskun-Tierney ordering (Peskun, 1973; Tierney, 1998) Let $P_1$ and $P_2$ be two $\bar{\pi}$-reversible Markov kernels on $(\space{X}, \algebra{X})$, then for any $\algebra{X}$ measurable function $f$ $$ P_1(x, A \setminus \lbrace x \rbrace) \geq P_2(x, A \setminus \lbrace x \rbrace) \quad \forall x \in \mathbb{X}, \forall A \in \mathcal{X} \implies \textrm{var}_\infty(f, P_1) \leq \textrm{var}_\infty(f, P_2) $$ For a Metropolis--Hastings kernel $P_{MH}$ then $$ P_{MH}(x, A \setminus \lbrace x \rbrace) = \int_{A\setminus \lbrace x\rbrace} \alpha(x,y) Q(x, \diff y) = \int_{A} \alpha(x,y) Q(x, \diff y) $$ is the average probability of accepting a proposal. We can therefore improve the asymptotic variance of a MCMC estimator compared to using a base Metropolis-Hastings kernel by redistributing probability mass assigned to rejections in the base Metropolis--Hastings kernel to accepted proposals by 'trying again' with further proposals. ## Delayed rejection method (Tierney and Miray, 2001; Mira, 2001) An early instantiation of the idea of improving the asymptotic variance of a Metropolis--Hastings kernel by creating extra chances to accept with further proposals, is the delayed rejection algorithm of Tierney and Mira. Given a proposal kernel $Q(y, \diff x) = q(y, x) \diff x$, an initial proposal is first made as in the standard Metropolis Hastings algorithm, and accepted with the same probability $Y_1 \sim Q(X, \cdot)$ $\Lambda_1 \sim \textrm{unif}(0,1)$ if $\Lambda_1 < \alpha(X, Y)$ $\quad X' \gets Y_1$ If the first proposal is rejected however, rather than set the next state to the previous value, another proposal is sampled from $Q$ this time starting from the first proposal $Y_1$ $Y_1 \sim Q(X, \cdot)$ $\Lambda_1 \sim \textrm{unif}(0,1)$ if $\Lambda_1 < \alpha(X, Y)$ $\quad X' \gets Y_1$ else $\quad Y_2 \sim Q(Y_1, \cdot)$ $\quad\Lambda_2 \sim \textrm{unif}(0,1)$ $\quad$if $\Lambda_2 < \min\left( 1, \frac{\pi(Y_2) q(Y_2, Y_1) q(Y_1, X)}{\pi(X) q(X, Y_1) q(Y_1, Y_2)} \frac{1 - \alpha(Y_2, Y_1)}{1 - \alpha(X, Y_1)}\right)$ $\qquad X' \gets Y_2$ This procedure of making a new proposal if the previous one was rejected can be iterated as many times as wanted. The probability of accepting the $n$th proposal given a sequence of proposed values $y_{\range{m}}$ and initial state $x = y_0$ is then given by $$ \alpha_n(y_{\range[0]{n}}) = \min\left( 1, \frac{\pi(y_n)}{\pi(y_0)} \frac{\prod_{i=1}^n q(y_i, y_{i-1})}{\prod_{i=1}^n q(y_{i-1}, y_{i})} \frac{\prod_{j=1}^{n-1} (1 - \alpha_j(y_{\range[n]{n-j}}))}{\prod_{j=1}^{n-1} (1 - \alpha_j(y_{\range[0]{j}}))} \right) $$ The overall algorithm delayed rejection algorithm is then $Y_0 \gets X$, accepted $\gets \texttt{False}$, $n \gets 1$ while not accepted and $n < N$ $\quad Y_n \gets Q(Y_{n-1}, \cdot)$ $\quad \Lambda_n \sim \textrm{unif}(0,1)$ $\quad$if $\Lambda_n < \alpha_n(y_{\range[0]{n}})$ $\quad\quad X' \gets Y_n$, accepted $\gets \texttt{True}$ $\quad n \gets n + 1$ if not accepted $\quad X' \gets X$ The overall kernel $P_{DR}$ is \begin{align} P_{DR}(x, A) &= \sum_{m=1}^N \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = x)\\ &\phantom{=}\, + \mathbb{P}(\text{accepted} = \texttt{false}, X' \in \set{A}, n=N \gvn X = x) \end{align} The probability of accepting a proposal in a set $A$ on the $n$th attempt given $X = y_0$ is then \begin{multline} \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0)=\\ \int_{\space{X}^m} \indc_A(y_m)\, \prod_{i=1}^m q(y_j\gvn y_{j-1})\, \prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\, \alpha_m(y_{\range[0]{m}}) \diff y_{\range{m}} \end{multline} \begin{multline} \int_B \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0) \,\bar{\pi}(\diff y_0) =\\ \int_{\space{X}^{m+1}} \indc_A(y_m)\,\indc_B(y_0)\, \frac{1}{Z}\pi(y_0) \prod_{i=1}^m q(y_{j-1}, y_j)\, \prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\, \alpha_m(y_{\range[0]{m}}) \diff y_{\range[0]{m}} \end{multline} \begin{multline} \int_B \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0) \,\bar{\pi}(\diff y_0) =\\ \int_{\space{X}^{m+1}} \indc_A(y_m)\,\indc_B(y_0)\, \frac{1}{Z} \underbrace{ \pi(y_0) \prod_{i=1}^m q(y_{j-1}, y_j)\, \prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}}))\, \alpha_m(y_{\range[0]{m}}) }_ { \min\Big( \pi(y_0) \prod_{i=1}^m q(y_{j-1}, y_j)\, \prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[0]{j}})),\, \pi(y_m) \prod_{i=1}^m q(y_{j}, y_{j-1})\, \prod_{j=1}^{m-1} (1 - \alpha_j(y_{\range[m]{m-j}})) \Big) } \diff y_{\range[0]{m}} \end{multline} $$ \int_B \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{A}, n = m \gvn X = y_0) \,\bar{\pi}(\diff y_0) =\, \\ \int_A \mathbb{P}(\text{accepted} = \texttt{true}, X' \in \set{B}, n = m \gvn X = y_0)\,\bar{\pi}(\diff y_0) $$ ## Sequential proposal Metropolis-Hastings algorithm $$ \beta(y_{\range[0]{N}}) = \min\left( 1, \frac{\pi(y_n)}{\pi(y_0)} \frac{\prod_{i=1}^n q(y_i, y_{i-1})}{\prod_{i=1}^n q(y_{i-1}, y_{i})} \right) $$ $\Lambda \sim \textrm{unif}(0,1)$ $Y_0 \gets X$, accepted $\gets \texttt{False}$, $n \gets 1$ while not accepted and $n < N$ $\quad Y_n \sim Q(Y_{n-1}, \cdot)$ $\quad$if $\Lambda < \beta(y_{\range[0]{n}})$ $\quad\quad X' \gets Y_n$, accepted $\gets \texttt{True}$ $\quad n \gets n + 1$ if not accepted $\quad X' \gets X$