Notes for reading group - 15/11/2018

# Notes for reading group - 15/11/2018 We will be reviewing the paper > Maxime Rischard, Pierre E. Jacob and Natesh Pillai. 'Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation'. [*arXiv preprint* 1810.01382](https://arxiv.org/abs/1810.01382) (October 2018). ## Notation $\newcommand{\dr}{\mathrm{d}} \newcommand{\set}[1]{\mathcal{#1}} \newcommand{\gvn}{\,|\,} \newcommand{\grad}[2]{\nabla_{#2} #1} \newcommand{\expc}{\mathbb{E}}$The set of integers from $A$ to $B$ inclusive will be denote $A{:}B$. Symbols subscripted by a set will be used to indicate indexed sets e.g. $x_{\set{S}} = \lbrace x_s \rbrace_{s\in\set{S}}$ and the cardinality of a set $\set{S}$ is $|\set{S}|$. ## Preliminaries Let $\pi_0$ and $\pi_1$ be two distributions on a space $\set{X}$ with $$ \pi_i(\dr x) = \frac{\tilde{\pi}_i(x)}{Z_i} \,\dr x = \frac{\exp(-U_i(x))}{Z_i} \,\dr x, \quad Z_i = \int_{\set{X}} \tilde{\pi}_i(x)\,\dr x < \infty \quad \forall i \in \lbrace 0, 1 \rbrace. $$ The task is to estimate the log normalising constant ratio $$ r_{01} = \log \frac{Z_1}{Z_0} = \log Z_1 - \log Z_0. $$ ## Example application: Bayesian cross-validation Let $\set{X}$ represent the parameter space of a generative model, $\set{Y}$ the observation space, $p(\dr x) \propto \tilde{p}(x)\,\dr x$ a prior distribution on $\set{X}$ and $\ell(y\gvn x)$ the conditional density of observing $y \in\set{Y}$ given parameters $x \in \set{X}$. We aim to model a set of $N$ observations $y_{1{:}N}$ which we assume are independent and identically distributed. Given a model defined by $\lbrace p, l \rbrace$ we would like to be able to define a measure of 'goodness' for the model - the idea of cross-validation is to use predictive performance of a model on 'unseen' data. Let $\set{S}_{N,T} = \lbrace \set{T} \subseteq 1{:}N \gvn|\set{T}| = T < N\rbrace$ i.e. the set of all subsets of the indices $1{:}N$ of size $T < N$. Using a logarithmic scoring rule, a Bayesian cross-validation objective for the model can be defined $$ C_{T}(y_{1{:}N}, \ell, p) = \frac{1}{|\set{S}_{N,T}|} \sum_{\set{T}\in\set{S}_{N,T}} \log\frac {\int_{\set{X}}\prod_{n\in 1{:}N} \ell(y_n \gvn x) \, \tilde{p}(x)\,\dr x} {\int_{\set{X}}\prod_{t\in\set{T}}\ell(y_t \gvn x') \, \tilde{p}(x')\,\dr x'} $$ which corresponds to the average over all training sets of size $T$ of the log posterior predictive density of the data *not* in the training set (under the posterior given the training set). Note higher values correspond to a larger average log posterior predictive density and so we want to maximise this objective (the paper defines the objective as the negative of this term with the aim then to minimise). If for a given $\set{T} \in \set{S}_{N,T}$ we define $$ \tilde{\pi}_0(x) = \tilde{p}(x) \prod_{t\in\set{T}} \ell(y_t\gvn x), \qquad \tilde{\pi}_1(x) = \tilde{p}(x) \prod_{n\in 1{:}N} \ell(y_n\gvn x) $$ then $r_{01}$ as defined above corresponds to the term being averaged over in the cross-validation objective. If we can estimate such log ratios $r_{01}$ then to estimate the cross-validation objective we can therefore sample one or more training subsets uniformly from $\set{S}_{N,T}$, and for each estimate the corresponding log ratio and average over these estimates. ## Paths between distributions We define a *path* between $\pi_0$ and $\pi_1$ as a family of distributions $\pi_\lambda$ for $\lambda \in [0,1]$ with $$ \pi_\lambda(\dr x) = \frac{\tilde{\pi}_\lambda(x)}{Z_\lambda} \,\dr x = \frac{\exp(-U_\lambda(x))}{Z_\lambda} \,\dr x, \quad Z_\lambda = \int_{\set{X}} \tilde{\pi}_\lambda(x)\,\dr x < \infty \quad \forall \lambda \in [0, 1]. $$ A simple example is the *geometric path* for which $$ \tilde{\pi}_\lambda(x) := \tilde{\pi}_0(x)^{1 - \lambda} \tilde{\pi}_1(x)^{\lambda} \iff U_\lambda(x) := (1-\lambda)U_0(x) + \lambda U_1(x) \quad \forall \lambda \in [0, 1]. $$ ## Thermodynamic integration identity The estimator used in the paper for $r_{01}$ relies on what is sometimes termed the *thermodynamic integration identity* from its origins in statistical physics where $r_{01}$ would typically correspond to a difference in the *Hemholtz free energy* of two states of a system<a href='#note-1' id='link-1'>1</a>. We derive the identity below, expanding upon the derivation in equations (1) to (3) in the paper. From the chain rule we have that $$ \grad{\log Z_\lambda}{\lambda} = \frac{1}{Z_\lambda}\grad{Z_\lambda}{\lambda}$$ and from the definition of $Z_\lambda$ therefore $$ \grad{\log Z_\lambda}{\lambda} = \frac{1}{Z_\lambda} \grad{}{\lambda} \int_{\set{X}} \tilde{\pi}_\lambda(x)\,\dr x. $$ Subject to some weak conditions on $\tilde{\pi}_\lambda$<a href='#note-2' id='link-2'>2</a>, then we can bring the derivative inside the integral and so $$ \grad{\log Z_\lambda}{\lambda} = \frac{1}{Z_\lambda} \int_{\set{X}} \grad{\tilde{\pi}_\lambda(x)}{\lambda} \,\dr x. $$ Ignoring<a href='#note-3' id='link-3'>3</a> any points for which $\tilde{\pi}_\lambda(x) = 0$ we can muliply the integrand in the RHS by $\frac{\tilde{\pi}_\lambda(x)}{\tilde{\pi}_\lambda(x)} = 1$ $$ \grad{\log Z_\lambda}{\lambda} = \frac{1}{Z_\lambda} \int_{\set{X}} \frac{\tilde{\pi}_\lambda(x)}{\tilde{\pi}_\lambda(x)}\grad{\tilde{\pi}_\lambda(x)}{\lambda} \,\dr x. $$ Recognising that $\frac{1}{\tilde{\pi}_\lambda}\grad{\tilde{\pi}_\lambda}{\lambda} = \grad{\log\tilde{\pi}_\lambda}{\lambda}$ and rearranging we have $$ \grad{\log Z_\lambda}{\lambda} = \int_{\set{X}} \grad{\log\tilde{\pi}_\lambda(x)}{\lambda} \frac{\tilde{\pi}_\lambda(x)}{Z_\lambda}\,\dr x, $$ and using $\log \tilde{\pi}_\lambda(x) = -U_\lambda(x)$ and $\pi_\lambda(\dr x) = \frac{\tilde{\pi}_\lambda(x)}{Z_\lambda} \,\dr x$ we have $$ \grad{\log Z_\lambda}{\lambda} = -\int_{\set{X}} \grad{U_\lambda(x)}{\lambda} \,\pi_\lambda(\dr x). $$ Integrating the LHS over $\lambda \in [0, 1]$ and applying the fundamental theorem of calculus we have that $$ \int_{0}^1 \grad{\log Z_\lambda}{\lambda} \,\dr x = \log Z_1 - \log Z_0 = r_{01}. $$ Finally putting this together with the above we have the following identity for log ratio term $r_{01}$ we wish to estimate $$ r_{01} = -\int_{0}^1 \int_{\set{X}} \grad{U_\lambda(x)}{\lambda} \pi_\lambda(\dr x) \, \dr\lambda. $$ ## Path sampling The thermodynamic integration identity can be used to estimate $r_{01}$ by nesting approximate integration of the inner and outer integrals. In statistical physics settings typically the one-dimensional outer integral over $\lambda \in [0,1]$ is approximated using a quadrature method (e.g. trapezium rule) using a set of $M$ points $0 = \lambda_0 < \lambda_1 \dots <\lambda_{M-1} < \lambda_M = 1$ and the inner integrals with respect to $\pi_{\lambda_m} ~~\forall m\in 1{:}M$ estimated using a Monte Carlo method. Alternatively we can estimate both inner and outer integrals using Monte Carlo methods. If we define an importance distribution $q(\dr\lambda) = q(\lambda)\,\dr \lambda$ with support on $[0, 1]$ then $$ r_{01} = -\int_{0}^1 \int_{\set{X}} \frac{1}{q(\lambda)} \grad{U_\lambda(x)}{\lambda} \pi_\lambda(\dr x) \, q(\dr\lambda). $$ and if we generate $\Lambda$ independently from $q$ and $X$ independently from $\pi_\Lambda$ then $\grad{U_\Lambda(X)}{\Lambda} / q(\Lambda)$ will be an unbiased estimator of $r_{01}$. The main problem with this approach is that generally generating an exact independent sample from $\pi_\lambda$ for arbitrary $\lambda$ will be intractable. We could instead use a Markov chain Monte Carlo (MCMC) method to generate approximate samples from $\pi_{\lambda}$ however MCMC will only asymptotically give an exact sample and so the resulting estimator would in general be biased. The key idea in the paper is to use recent developments in methods for computing *unbiased estimators* using MCMC<a href='#note-4' id='link-4'>4</a> to the path sampling problem, with the simple Monte Carlo estimate of $\int_{\set{X}}\grad{U_\lambda(x)}{\lambda} \pi_\lambda(\dr x)$ replaced with an unbiased MCMC estimator. ## Unbiased MCMC The standard MCMC approach for approximating expectations with respect to a distribution $\pi_\lambda$ on a space $\set{X}$ is to define a Markov kernel $P$ such that $\pi_\lambda$ is its unique invariant distribution $$ \int_{\set{X}} P(\set{A}\gvn x)\,\pi_\lambda(\dr x) = \pi_\lambda(\set{A}) \quad \forall \textrm{ measurable } \set{A}\subseteq \set{X}. $$ and to define a Markov chain $X_{1:M}$ by $X_1 \sim \varpi$ for some arbitrary initial distribution $\varpi$ on $\set{X}$ and then for each $m \in 2{:}M$ $X_m \sim P(\cdot \gvn X_{m-1})$. Then for all $K < M$ we have that as $M \to \infty$ for all $\pi_\lambda$ measurable functions $h$ $$ \lim_{M\to\infty} \expc\left[ \frac{1}{M-(K-1)}\sum_{m\in K{:}M} h(X_m) \right] = \int_{\set{X}} h(x)\,\pi_\lambda(\dr x) $$ i.e. we can form *consistent* estimates of integrals with respect to $\pi_\lambda$. For a finite length chain $M$ the estimate will be *biased* however - it will not be equal in expecation to the true value of the integral. The integer $K$ here represents a tunable *burn-in* or *warm-up* parameter, with the first $K-1$ steps of the chain not included in the estimate to reduce the bias for a finite chain length $M$ from the initial transient stage of chain where it is typically 'far from stationarity'. The starting point of unbiased MCMC methods is to define a pair of *coupled* Markov chains each marginally having a unique invariant distribution $\pi_\lambda$. Let $P$ be a $\pi_\lambda$ invariant Markov kernel on $\set{X}$ as previously and $\bar{P}$ be a *coupled* Markov kernel on $\set{X}\times\set{X}$ satisfying \begin{gather} \bar{P}((\set{A},\set{X})\gvn (x,x')) = P(\set{A}\gvn x) ~~ \textrm{and} ~~ \bar{P}((\set{X},\set{A})\gvn (x,x')) = P(\set{A}\gvn x')\\ \quad \forall (x, x') \in\set{X}\times\set{X},\,\textrm{ measurable } \set{A} \subseteq \set{X}. \end{gather} We additionally require that: * if $(X, X') \sim \bar{P}(\cdot \gvn (x, x'))$ there is a non-zero probability of $X = X'$ for $(x, x')$ in a non-$\pi_\lambda$-null subset of $\set{X}\times\set{X}$, * and $(X, X') \sim \bar{P}(\cdot \gvn (x, x)) \implies X = X'$ for $\pi_\lambda$ almost all $x$, which together means there is non-zero of the two coupled chains *meeting* in some finite time and once the chains have met they remain identical thereafter. We can then construct a pair of coupled chains $X_{0:\bar{M}+1}$ and $X'_{0:\bar{M}}$ by drawing $X_0 \sim \varpi$ and $X_0' \sim \varpi$, $X_1 \sim P(\cdot \gvn X_0)$ and for all $m \geq 1$ $(X_{n+1}, X'_n) \sim \bar{P}(\cdot | (X_n, X'_{n-1}))$ where the number of chain iterations is $\bar{M} = \max(M, \tau)$ where $M$ is user-chosen integer and $\tau$ is the random chain meeting time. By construction the chain with state $X_m$ at step $m$ will converge in distribution to $\pi_{\lambda}$ as $M \to \infty$ and so $$ \int_{\set{X}} h(x)\,\pi_\lambda(x) = \lim_{m\to\infty}\expc[h(X_m)]. $$ We can expand the limit on the RHS as a telescoping sum $$ \int_{\set{X}} h(x)\,\pi_\lambda(\dr x) = \expc[h(X_k)] + \sum_{m=k+1}^\infty \left( \expc[h(X_m)] - \expc[h(X_{m-1})] \right). $$ Using that $X_m$ and $X'_m$ both have the same marginal distribution (and so expectations) by construction we have $$ \int_{\set{X}} h(x)\,\pi_\lambda(\dr x) = \expc[h(X_k)] + \sum_{m=k+1}^\infty \left( \expc[h(X_m)] - \expc[h(X'_{m-1})] \right). $$ If we denote the meeting time of the coupled chains $\tau$ such that $X_{m} = X'_{m-1}$ almost surely for all $m \geq \tau$, then the terms inside the summation for $m \geq \tau$ vanish and so $$ \int_{\set{X}} h(x)\,\pi_\lambda(\dr x) = \expc[h(X_k)] + \sum_{m=k+1}^{\tau - 1} \left( \expc[h(X_m)] - \expc[h(X'_{m-1})] \right). $$ Using linearity of expectation we further have that $$ \int_{\set{X}} h(x)\,\pi_\lambda(\dr x) = \expc\left[ h(X_k) + \sum_{m=k+1}^{\tau - 1} \left( h(X_m) - h(X'_{m-1}) \right) \right]. $$ We therefore have that $$ H_k = h(X_k) + \sum_{m=k+1}^{\tau - 1} \left( h(X_m) - h(X'_{m-1})\right) $$ is an unbiased estimator for $\int_{\set{X}} h(x)\,\pi_\lambda(\dr x)$ for all $k \geq 0$. We can further use a construction similar to the standard MCMC estimator to define a time-averaged estimator $$ H_{K,M} = \frac{1}{M-(K-1)}\sum_{k=K}^M H_k $$ which will also necessarily be an unbiased estimator for $\int_{\set{X}} h(x)\,\pi_\lambda(\dr x)$. Substituting the definition of $H_k$ and simplifying gives $$ H_{K,M} = \frac{1}{M-(K-1)}\sum_{k=K}^M h(X_k) + \sum_{m=k+1}^{\tau-1} \min\left\lbrace 1, \frac{m - K}{M - (K - 1)}\right\rbrace \left(h(X_m) - h(X'_{m-1})\right) $$ corresponding to equation (6) in the paper. --- ## Footnotes 1. <a id='note-1' href='#link-1'>^</a> For those interested in reading more about the links between statistical physics / thermodynamics and Bayesian inference, the article *A correspondence between thermodynamics and inference* by Colin LaMont and Paul Wiggins ([arXiv preprint 1706.01428](https://arxiv.org/abs/1706.01428)) is a good introduction. 2. <a id='note-2' href='#link-2'>^</a> We require that $\tilde{\pi}_\lambda(x)$ has a derivative $\grad{\tilde{\pi}_\lambda}{\lambda}$ which exists for all $\lambda \in [0,1]$ and its derivative is bounded above in absolute value $\left|\grad{\tilde{\pi}_\lambda}{\lambda}(x)\right| \leq \bar{\pi}(x)$ by an integrable function $\bar{\pi}(x)$ for all $x \in \set{X}, ~ \lambda \in [0,1]$. 3. <a id='note-3' href='#link-3'>^</a> We can ignore the points for which $\tilde{\pi}_\lambda(x) = 0$ as the set $\lbrace x \in \set{X} : \tilde{\pi}_\lambda(x) = 0 \rbrace$ necessarily has zero measure under $\pi_\lambda$ and so makes no contribution to the integral. 4. <a id='note-4' href='#link-4'>^</a> See the article *Unbiased Markov chain Monte Carlo with couplings* by Pierre Jacob, John O'Leary, and Yves Atchadé ([arXiv preprint 1708.03625](https://arxiv.org/abs/1708.03625)) and the accompanying [blog post](https://statisfaction.wordpress.com/2017/08/14/unbiased-mcmc-with-couplings/) for more details about unbiased MCMC.