###### tags: `unnormalised models` `one-offs` `vi`
# Approximate Inference in Unnormalised Models by an Expanded Variational Representation
**Overview**: The goal of this note is to describe how the Zellner-Gibbs-Donsker-Varadhan(-...) representation of the Bayesian posterior can be harnessed to provide a new objective function for (approximate) inference in models with unnormalised likelihoods, a notoriously challenging scenario in the Bayesian paradigm. Some incomplete strategies for achieving this are sketched out, with some potential redeeming approaches hinted at.
## A Variational Representation of the Bayesian Posterior Measure
A 1988 paper of Zellner offers the following presentation of Bayes rule, where $\pi$ is the prior measure and $L(x_*; \theta)$ is the likelihood of the observed data $x_*$: the marginal likelihood of the data can be obtained by solving the minimisation problem
\begin{align}
-\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \left\{ \mathbf{E}_\Pi \left[ \ell (x_*; \theta) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\
\text{where}\, \ell (x; \theta) &= - \log L (x;\theta),
\end{align}
and the infimum is achieved by taking $\Pi$ to be the standard Bayesian posterior measure.
Note that the mathematical content of this statement predates Zellner, being known variously as the Gibbs variational principle, Donsker-Varadhan variational principle, etc. Zellner's contribution was essentially to make the connection with the form of the the Bayesian update.
It should be recognised that many have found great inspiration in this representation of the Bayesian update, and used it as the basis for various generalisations, interpretations and extensions of the standard framework. Some very interesting work has taken place under this umbrella.
## Troubles with Unnormalised Likelihoods
Consider a model in which the likelihood term is *unnormalised*, i.e. takes the form
\begin{align}
L (x; \theta) dx &= \frac{f (x, \theta)}{Z(\theta)} m(dx) \\
\ell(x; \theta) &= \mathcal{V}(x; \theta) - \mathcal{F}(\theta).
\end{align}
where $f$ (resp. $\mathcal{V}$) can be evaluated easily, but $Z$ (resp. $\mathcal{F}$) is expensive / intractable to compute. This is a challenging scenario for inference in general, but deservedly has a reputation for being especially challenging under the Bayesian paradigm, wherein the likelihood is in some sense sacrosanct.
Plugging this expression into Zellner's representation, we see that
\begin{align}
-\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta) - \mathcal{F}(\theta) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\}
\end{align}
This hints at the computational challenges of Bayesian inference in this setting, as the expectation with respect to $\Pi$ involves the computation of $\mathcal{F}$, which is assumed to be beyond our reach.
## Partial Tractability via Extended Variational Representation
To restore some degree of tractability, we apply the variational principle again by noting that
\begin{align}
\mathcal{F}(\theta) = \inf_{\Lambda \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Lambda \left[ \mathcal{V} (x; \theta) \right] + \text{KL} \left( \Lambda (dx), m (dx)\right) \right\}.
\end{align}
Nesting this within our original problem allow us to write that
\begin{align}
-\log P(x_*) &= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta) - \left( \mathbf{E}_\Lambda \left[ \mathcal{V} (x; \theta) \right] + \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) \right) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\
&= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta)\right] -\mathbf{E}_{\Pi \otimes \Lambda} \left[ \mathcal{V} (x; \theta)\right] - \mathbf{E}_\Pi \left[ \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) \right] + \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\
&= \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \left\{ \mathbf{E}_\Pi \left[ \mathcal{V} (x_*; \theta)\right] - \mathbf{E}_{\Pi \otimes \Lambda} \left[ \mathcal{V} (x; \theta)\right] - \text{KL} \left( \Pi \otimes \Lambda, \Pi \otimes m \right)+ \text{KL} \left( \Pi (d\theta), \pi (d\theta) \right) \right\} \\
&=: \inf_{\Pi \in \mathcal{P}(\Theta)} \sup_{\Lambda(\cdot | \theta) \in \mathcal{P}(\mathcal{X})} \mathcal{S} \left( \Pi, \Lambda \right)
\end{align}
where $\left( \Pi \otimes \Lambda \right)(d\theta, dx) := \Pi(d\theta) \Lambda(dx | \theta)$. Note that the observed data $x_*$ is held fixed in the first expectation, but not the second.
With this representation established, the problem of characterising the Bayesian posterior for this model has shifted from a highly-intractable minimisation problem to a mildly-tractable saddle-point problem. It is up for debate whether this represents a real improvement. Optimistically, one might hope that techniques for the solution of variational inequalities could be of some use here.
## Wasserstein Saddle-Point Flows, and Steps towards Particle Implementations
One can compute informally that
\begin{align}
\frac{\delta S}{\delta \Pi} (\theta) &= \mathcal{V} (x_*; \theta) - \mathbf{E}_{\Lambda (dx | \theta)} \left[ \mathcal{V} (x; \theta) \right] - \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) + \log \frac{\Pi (\theta)}{\pi (\theta)} \\
\frac{\delta S}{\delta \Lambda (\cdot | \theta)} (x, \theta) &= - \mathcal{V} (x; \theta) - \log \frac{\Lambda (x | \theta)}{m (x)}.
\end{align}
Suppose that we try to solve this saddle-point problem through a continuous-time ascent-descent dynamics, akin to a gradient flow, but targeted towards solving saddle-point problems. At the PDE level, this would look like
\begin{align}
\partial_t \pi (t, \theta) &+ \text{div}_\theta \left( \pi (t, \theta) \nabla_\theta \frac{\delta S}{\delta \Pi} (\pi(t, \theta), \theta)\right) &= 0 \\
\partial_t \lambda(t, x, \theta) &- \text{div}_x \left( \lambda (t, x, \theta) \nabla_x \frac{\delta S}{\delta \Lambda (\cdot | \theta)} (\lambda (t, x, \theta), (x, \theta) ) \right) &= 0,
\end{align}
noting the sign difference between the two rows.
At the particle level, one might consider approximating
\begin{align}
\pi(t, d\theta) &\approx \frac{1}{N} \sum_{a \in [N]} \delta (\theta_t^a, d\theta) \\
\lambda (t, dx, \theta^a) &\approx \frac{1}{M} \sum_{b \in [M]} \delta (x_t^{a, b}, dx), \quad \text{for} \, a \in [N]
\end{align}
and evolve the particles according to the dynamics
\begin{align}
d\theta_t^a &= - \left.\nabla_\theta \left( \mathcal{V} (x_*; \theta) - \mathbf{E}_{\Lambda (dx | \theta)} \left[ \mathcal{V} (x; \theta) \right] - \text{KL} \left( \Lambda (dx | \theta), m (dx)\right) + \log \frac{\Pi (\theta)}{\pi (\theta)} \right) \right\vert_{\theta = \theta_t^a, \Pi = \pi, \Lambda = \lambda(\cdot|\theta_t^a)} \, dt\\
&\sim \left( -\nabla_\theta \mathcal{V} (x_*; \theta_t^a) + \frac{1}{M} \sum_{b \in [M]} \nabla_\theta \mathcal{V} (x_t^{a, b}; \theta_t^a) \right) dt + \nabla_\theta \text{KL} \left( \Lambda (dx | \theta_t^a ), m (dx)\right) dt + \nabla_\theta \log \pi (\theta_t^a) dt + \sqrt{2} dW_t^a \\
dx_t^{a, b} &= \left. \nabla_x \left( - \mathcal{V} (x; \theta) - \log \frac{\Lambda (x | \theta)}{m (x)} \right) \right\rvert_{x = x_t^{a, b}, \theta = \theta_t^a, \Lambda = \lambda(\cdot|\theta^a)} dt \\
&\sim \left( - \nabla_x \mathcal{V} (x_t^{a, b}; \theta_t^a) + \nabla_x m (x_t^{a,b} ) \right) dt + \sqrt{2} dW_t^{a, b}
\end{align}
Unfortunately, the $\text{KL}$ term in the $\theta$ dynamics is unavailable, and so this is not quite implementable! Nevertheless, there may be favourable modifications to the dynamics which resolve this issue.
## Conclusion
In this note, I have sketched out how certain variational representations of statistical quantities can be used to construct a saddle-point problem whose solution recovers the Bayesian posterior measure, with the key feature that it is in some sense 'tractable', even when the likelihood of the data comes with an unknown normalising constant. Unfortunately, while this representation comes tantalisingly close to providing some simple avenues towards efficient posterior inference, it does not quite get there. It will be interesting to see whether any of the strategies sketched herein can be developed into a fully-fledged method for Bayesian inference (approximate or otherwise) in models with unnormalised likelihoods.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**