Likelihood Estimation in Latent Variable Models

###### tags: `integration` `monte carlo` `expository` # Likelihood Estimation in Latent Variable Models **Overview**: In this note, I will describe some simple, approximate methods for estimating likelihoods in latent variable models, highlighting how they lie on a spectrum of precision. ## Latent Variable Models In a number of statistical tasks of interest, it is desirable to model observations as being just one part of a generative process, with other components remaining unobserved. Broadly speaking, the setting is that we have some relatively tractable model $p(x, z | \theta)$, where $\theta$ is the parameter of interest, $x$ is our observation, and $z$ are the unobserved quantities which fill out the joint model. Note that while the 'actual' likelihood $p(x|\theta)$ does exist, and can be used to define statistical procedures and estimators, it is often unavailable in closed form, which can be frustrating computationally. While the $z$ variables are often of interest in their own right, the 'hard part' of these problems is often to estimate $\theta$, and when the $z$ are unknown, this can be challenging. ## Likelihood Estimation in Latent Variable Models For many models, estimation of $\theta$ can be carried out efficiently when $p(x|\theta)$ is known, and so a natural strategy is to form some tractable approximation of $p(x|\theta)$. Here, we look at some approaches to this problem. We begin by writing out the definition of $p(x | \theta)$ as the marginal density of $x$ under the joint law of $(x, z)$ given $\theta$: \begin{align} p(x | \theta) = \int_\mathcal{Z} p (x, z | \theta) \, \mathrm{d}z. \end{align} This is an integral over $\mathcal{Z}$, and so we might try to understand it as an expectation over a $\mathcal{Z}$-valued random variable. Introduce an auxiliary density $q(z) = q(z | x, \theta)$ to write \begin{align} p(x | \theta) &= \int_\mathcal{Z} p (x, z | \theta) \, \mathrm{d}z \\ &= \int_\mathcal{Z} q(z | x, \theta) \cdot \frac{p (x, z | \theta)}{q(z | x, \theta)} \, \mathrm{d}z \\ &= \mathbf{E}_{q(z|x,\theta)} \left[ \frac{p (x, z | \theta)}{q(z | x, \theta)} \right]. \end{align} We now detail some strategies for turning this observation into an approximation: ### Constant Integrand A particularly easy case of this representation would be when the integrand $\frac{p (x, z | \theta)}{q(z | x, \theta)}$ is constant, since then its evaluation at *any* value of $z$ would return an exact estimate of $p(x | \theta)$. It is a useful exercise to check that the $q$ which guarantees this property is given by \begin{align} q(z | x, \theta) = p(z | x, \theta), \end{align} i.e. the true posterior distribution for the latent variable $z$ under the model. This observation can equally be written as \begin{align} \text{for all}\, z \in \mathcal{Z}, \quad p(x|\theta) = \frac{p (x, z | \theta)}{p(z | x, \theta)}, \end{align} which is known in some circles as the 'candidate's formula', due to its discovery (?) in the exam scripts of a mysterious university student by Julian Besag. ### General $q$, Single Deterministic Point For $q \neq p$, one might reasonably use this as inspiration to fit a $q(z | x, \theta) \approx p(z | x, \theta)$, and thus argue that \begin{align} \text{for nice enough}\, z \in \mathcal{Z}, \quad p(x|\theta) \approx \frac{p (x, z | \theta)}{q(z | x, \theta)}. \end{align} Given that one might expect approximations to distributions to have their highest fidelity in the bulk of the two distributions, a typical strategy would be to evaluate this expression at e.g. the mode of $q$, say $z = m_* (x, \theta)$. This gives rise to the approximation \begin{align} p(x|\theta) \approx \frac{p (x, z | \theta)}{q(z | x, \theta)}\Bigg{\rvert}_{z = m_* (x, \theta)}. \end{align} One can also view this as approximating \begin{align} p(x | \theta) = \mathbf{E}_{q(z|x,\theta)} \left[ \frac{p (x, z | \theta)}{q(z | x, \theta)} \right] \approx \mathbf{E}_{\delta(\mathrm{d}z, m_*(x,\theta))} \left[ \frac{p (x, z | \theta)}{q(z | x, \theta)} \right] . \end{align} It should be noted that when $q$ is the Gaussian approximation to $p(z|x,\theta)$ with the same mode, and covariance matrix derived from the Hessian of the joint log-likelihood, then this is the well-known _Laplace approximation_ to the likelihood. See e.g. [1](https://users.wpi.edu/~balnan/Tierney-Kadane.pdf), [2](http://www.stat.uchicago.edu/~pmcc/pubs/paper26.pdf) for some relevant statistical references. ### General $q$, Multiple Deterministic Points For structured approximations $q$, more refined deterministic quadrature strategies suggest themselves. For example, when $q$ is a Gaussian distribution, one can change coordinates into a basis such that $q$ is of product form, and apply quadrature rules which are derived by taking products of 1-dimensional Gauss-Hermite quadrature rules. A strategy to this effect is pursued and evaluated in [this work](https://arxiv.org/abs/2102.06801). ### General $q$, Multiple Randomised Points Another strategy is to use the integral representation of the likelihood to motivate a Monte Carlo estimate, i.e. draw repeated samples from the approximation $q$, evaluate the integrand at those values, and then return the empirical average of these values. This is essentially an _importance sampling_ approach. While asymptotically exact as the number of samples tends to infinity, this strategy can suffer from prohibitively high variance in some practical scenarios, particularly in high-dimensional problems, unless the approximation $q$ is of exceptionally high quality. ## Downstream Applications My background is largely in Monte Carlo methods as applied to Bayesian inference, so my comments here are confined primarily to this domain. For the deterministic strategies, one can use the approximations to define a new 'likelihood', and apply conventional Monte Carlo strategies to sample from this new 'posterior', either accepting the bias in full, or using this approximate posterior as the basis for a more involved subsequent approach to the 'true' posterior. See e.g. [1](https://arxiv.org/abs/2004.12550), [2](https://arxiv.org/abs/1701.07844) for applications of the former approach. For the randomised strategies, the unbiasedness of the importance sampling estimate unlocks the possibility of asymptotically unbiased Monte Carlo inference via the so-called _Pseudo-Marginal MCMC_ approach (see e.g. [the original paper](https://www.jstor.org/stable/30243645)). As one might expect, the quality of such approaches is contingent on the variability of the inner Monte Carlo estimates, and so there is no free lunch. There are related strategies for the approximate marginalisation of latent variables within MCMC. The idea is to replace the 'pure' importance sampling inner loop (as proposed above) with a Sequential Monte Carlo inner loop. When implemented appropriately, this can mitigate the effects of dimensionality and reduce the variance of the likelihood estimates substantially, though it should not be suggested that this comes without additional cost and coding effort. Ultimately, there is a trade-off between the implementation-side simplicity of running full inference on the joint distribution of $(z, \theta)$, and the potential improvements in posterior exploration which can be enabled via marginalisation. ## Conclusion In this note, I have described some elementary strategies for approximate computation of likelihoods in latent variable models. I then described some strategies for embedding these approximations into conventional Monte Carlo methods, and alluded to some of the trade-offs which are associated with approximations of this form.