Approximate Inference in Unnormalised Models by an Expanded Variational Representation

###### tags: `unnormalised models` `one-offs` `vi` # Approximate Inference in Unnormalised Models by an Expanded Variational Representation **Overview**: The goal of this note is to describe how the Zellner-Gibbs-Donsker-Varadhan(-...) representation of the Bayesian posterior can be harnessed to provide a new objective function for (approximate) inference in models with unnormalised likelihoods, a notoriously challenging scenario in the Bayesian paradigm. Some incomplete strategies for achieving this are sketched out, with some potential redeeming approaches hinted at. ## A Variational Representation of the Bayesian Posterior Measure A 1988 paper of Zellner offers the following presentation of Bayes rule, where $\pi$ is the prior measure and $L(x_*; \theta)$ is the likelihood of the observed data $x_*$: the marginal likelihood of the data can be obtained by solving the minimisation problem \begin{align} -\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \left\{ \mathbf{E}_\Pi \left[ \ell (x_*; \theta) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\ \text{where}\, \ell (x; \theta) &= - \log L (x;\theta), \end{align} and the infimum is achieved by taking $\Pi$ to be the standard Bayesian posterior measure. Note that the mathematical content of this statement predates Zellner, being known variously as the Gibbs variational principle, Donsker-Varadhan variational principle, etc. Zellner's contribution was essentially to make the connection with the form of the the Bayesian update. It should be recognised that many have found great inspiration in this representation of the Bayesian update, and used it as the basis for various generalisations, interpretations and extensions of the standard framework. Some very interesting work has taken place under this umbrella. ## Troubles with Unnormalised Likelihoods Consider a model in which the likelihood term is *unnormalised*, i.e. takes the form \begin{align} L (x; \theta) dx &= \frac{f (x, \theta)}{Z(\theta)} m(dx) \\ \ell(x; \theta) &= \mathcal{V}(x; \theta) - \mathcal{F}(\theta). \end{align} where $f$ (resp. $\mathcal{V}$) can be evaluated easily, but $Z$ (resp. $\mathcal{F}$) is expensive / intractable to compute. This is a challenging scenario for inference in general, but deservedly has a reputation for being especially challenging under the Bayesian paradigm, wherein the likelihood is in some sense sacrosanct. Plugging this expression into Zellner's representation, we see that \begin{align} -\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta) - \mathcal{F}(\theta) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \end{align} This hints at the computational challenges of Bayesian inference in this setting, as the expectation with respect to $\Pi$ involves the computation of $\mathcal{F}$, which is assumed to be beyond our reach. ## Partial Tractability via Extended Variational Representation To restore some degree of tractability, we apply the variational principle again by noting that \begin{align} \mathcal{F}(\theta) = \inf_{\Lambda \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Lambda \left[ \mathcal{V} (x; \theta) \right] + \text{KL} \left( \Lambda (dx), m (dx)\right) \right\}. \end{align} Nesting this within our original problem allow us to write that \begin{align} -\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta) - \left( \mathbf{E}_\Lambda \left[ \mathcal{V} (x; \theta) \right] + \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) \right) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\ &= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta)\right] -\mathbf{E}_{\Pi \otimes \Lambda} \left[ \mathcal{V} (x; \theta)\right] - \mathbf{E}_\Pi \left[ \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\ &= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta)\right] - \mathbf{E}_{\Pi \otimes \Lambda} \left[ \mathcal{V} (x; \theta)\right] - \text{KL} \left( \Pi \otimes \Lambda, \Pi \otimes m \right)+ \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\ &=: \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \mathcal{S} \left( \Pi, \Lambda \right) \end{align} where $\left( \Pi \otimes \Lambda \right)(d\theta, dx) := \Pi(d\theta) \Lambda(dx | \theta)$. Note that the observed data $x_*$ is held fixed in the first expectation, but not the second. With this representation established, the problem of characterising the Bayesian posterior for this model has shifted from a highly-intractable minimisation problem to a mildly-tractable saddle-point problem. It is up for debate whether this represents a real improvement. Optimistically, one might hope that techniques for the solution of variational inequalities could be of some use here. ## Wasserstein Saddle-Point Flows, and Steps towards Particle Implementations One can compute informally that \begin{align} \frac{\delta S}{\delta \Pi} (\theta) &= \mathcal{V} (x_*; \theta) - \mathbf{E}_{\Lambda (dx | \theta)} \left[ \mathcal{V} (x; \theta) \right] - \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) + \log \frac{\Pi (\theta)}{\pi (\theta)} \\ \frac{\delta S}{\delta \Lambda (\cdot | \theta)} (x, \theta) &= - \mathcal{V} (x; \theta) - \log \frac{\Lambda (x | \theta)}{m (x)}. \end{align} Suppose that we try to solve this saddle-point problem through a continuous-time ascent-descent dynamics, akin to a gradient flow, but targeted towards solving saddle-point problems. At the PDE level, this would look like \begin{align} \partial_t \pi (t, \theta) &+ \text{div}_\theta \left( \pi (t, \theta) \nabla_\theta \frac{\delta S}{\delta \Pi} (\pi(t, \theta), \theta)\right) &= 0 \\ \partial_t \lambda(t, x, \theta) &- \text{div}_x \left( \lambda (t, x, \theta) \nabla_x \frac{\delta S}{\delta \Lambda (\cdot | \theta)} (\lambda (t, x, \theta), (x, \theta) ) \right) &= 0, \end{align} noting the sign difference between the two rows. At the particle level, one might consider approximating \begin{align} \pi(t, d\theta) &\approx \frac{1}{N} \sum_{a \in [N]} \delta (\theta_t^a, d\theta) \\ \lambda (t, dx, \theta^a) &\approx \frac{1}{M} \sum_{b \in [M]} \delta (x_t^{a, b}, dx), \quad \text{for} \, a \in [N] \end{align} and evolve the particles according to the dynamics \begin{align} d\theta_t^a &= - \left.\nabla_\theta \left( \mathcal{V} (x_*; \theta) - \mathbf{E}_{\Lambda (dx | \theta)} \left[ \mathcal{V} (x; \theta) \right] - \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) + \log \frac{\Pi (\theta)}{\pi (\theta)} \right) \right\vert_{\theta = \theta_t^a, \Pi = \pi, \Lambda = \lambda(\cdot|\theta_t^a)} \, dt\\ &\sim \left( -\nabla_\theta \mathcal{V} (x_*; \theta_t^a) + \frac{1}{M} \sum_{b \in [M]} \nabla_\theta \mathcal{V} (x_t^{a, b}; \theta_t^a) \right) dt + \nabla_\theta \text{KL} \left( \Lambda (dx | \theta_t^a ), m (dx)\right) dt + \nabla_\theta \log \pi (\theta_t^a) dt + \sqrt{2} dW_t^a \\ dx_t^{a, b} &= \left. \nabla_x \left( - \mathcal{V} (x; \theta) - \log \frac{\Lambda (x | \theta)}{m (x)} \right) \right\rvert_{x = x_t^{a, b}, \theta = \theta_t^a, \Lambda = \lambda(\cdot|\theta^a)} dt \\ &\sim \left( - \nabla_x \mathcal{V} (x_t^{a, b}; \theta_t^a) + \nabla_x m (x_t^{a,b} ) \right) dt + \sqrt{2} dW_t^{a, b} \end{align} Unfortunately, the $\text{KL}$ term in the $\theta$ dynamics is unavailable, and so this is not quite implementable! Nevertheless, there may be favourable modifications to the dynamics which resolve this issue. ## Conclusion In this note, I have sketched out how certain variational representations of statistical quantities can be used to construct a saddle-point problem whose solution recovers the Bayesian posterior measure, with the key feature that it is in some sense 'tractable', even when the likelihood of the data comes with an unknown normalising constant. Unfortunately, while this representation comes tantalisingly close to providing some simple avenues towards efficient posterior inference, it does not quite get there. It will be interesting to see whether any of the strategies sketched herein can be developed into a fully-fledged method for Bayesian inference (approximate or otherwise) in models with unnormalised likelihoods.