Iterative Likelihood Approximation

###### tags: `one-offs` `likelihood-free inference` `surrogate models` # Iterative Likelihood Approximation **Overview**: In this note, I describe some strategies for carrying out approximate Bayesian inference in implicit models, and some possible formalisations of what they are hoping to achieve. ## Problem Setting There is an interesting abstract approach to approximate Bayesian inference in 'intractable' models (broadly construed) which comes up repeatedly in various forms, and which raises some interesting questions. Here, I'll describe the method, and then the questions: The setting is that we have a nice prior $\eta\left(\theta\right)$, and some less-nice likelihood $L\left(x,\theta\right)$. At the very least, this likelihood is not conjugate to $\eta$; at the worst, it may be defined completely implicitly, and evaluating it is exactly is not on the cards. ## A Priori Approximation A natural computational strategy is to form a friendly approximation $\hat{L}$ to this likelihood, and then use this as the basis for inference. If this approximation is designed before seeing the data, then this is conceptually quite straightforward; this is simply the realm of approximate models. ## Targeted Approximation A more involved strategy is to design an approximate likelihood *after* seeing the data. Statistically speaking, this is appealing because the approximation does not need to be accurate for all values of $x$, but only for those values of $x$ which lie near the observed data. Similarly, it only needs to be accurate for values of $\theta$ which are relevant under the posterior distribution. With all of this in mind, what might be a good criterion for obtaining this approximation? A common choice is to argue that the approximation should be accurate when averaged over the posterior, leading to objectives like \begin{align} \min_{\hat{L}}D\left(\pi\left(\theta\mid x_{*}\right)\cdot L\left(x,\theta\right),\pi\left(\theta\mid x_{*}\right)\cdot\hat{L}\left(x,\theta\right)\right) \end{align} for some divergence $D$. Now, we have opened a loop: the posterior distribution is not available to us, but we can construct an approximation to - perhaps inexactly - by building on our approximate likelihood, and solving \begin{align} \min_{\hat{\pi}}\tilde{D}\left(\hat{\pi}\left(\theta\right),\frac{\eta\left(\theta\right)\cdot\hat{L}\left(x_{*},\theta\right)}{Z\left(\hat{L},x_{*}\right)}\right) \end{align} with respect to $\hat{\pi}$, where $\tilde{D}$ is again some divergence, perhaps distinct to the previous divergence $D$. In the conjugacy-centric case, this is similar to the idea of posterior linearisation: with a Gaussian prior, one approximates the likelihood in linear-Gaussian form, and then the approximate posterior can be constructed exactly. In the intractable case, this resembles certain approaches to ABC which form approximations to the likelihood involving powerful function approximators. ## Fixed-Point Formulation Carefully considering the iterations above, one realises that it is possible to write $\left(\hat{\pi},\hat{L}\right)$ as the solutions to a fixed-point equation. While this is conceptually nice at a first pass, it leaves certain questions open, such as e.g. uniqueness of solutions. Statistically speaking, this seems important. I am then curious as to whether there is more at play, e.g. 1. Can one reformulate this fixed-point equation as a 'natural' minimisation problem, as in variational inference? 2. Are there reasonable conditions on the problem and approximating families which will guarantee existence and uniqueness of solutions? 3. Can one attempt to solve the fixed-point equation directly, and is this a worthwhile approach? For me, these approaches correspond to a very natural and intuitive process of model refinement and analysis, and so one likes to believe that something sensible is taking place also at the mathematical level. It would be rewarding to establish this with some rigour.