owned this note
owned this note
Published
Linked with GitHub
###### tags: `generative models` `comments and reviews` `sampling` `one-offs`
# Forward-Backward Markov Models for Density Estimation
**Overview**: The following exposition aims to describe the methods presented in these two papers ([1](http://proceedings.mlr.press/v37/sohl-dickstein15.html), [2](https://hojonathanho.github.io/diffusion/)) in terms of standard Monte Carlo techniques.
## Introduction
Consider the task of fitting a generative model $P_\theta (x)$ to data $x^1, \ldots, x^N \sim P_{\text{Data}}$. Fitting complex distributions to data "in one shot" is often challenging, and so it is often considered useful to develop some sort of bridging strategy, i.e. to transform noise into samples in a guided way.
The approach I will describe here starts by constructing a bridge from the distribution of our samples towards a known, tractable target, and then seeks to form an approximation to the backward path.
## Forward Markov Model
Fix an "easy" target distribution $\gamma$ on the state space of interest. Specify a Markov transition semigroup $( K_t )_{t \geqslant 0}$ which generates an ergodic Markov process which converges in law to $\gamma$ as $t \to \infty$. For sufficiently nice initial distributions $\mu$ and sufficiently large $T$, it should then hold that $\mu K_T \approx \gamma$.
With this in mind, define $\mu_t = P_{\text{Data}} K_t$. Under sensible assumptions, it holds that $\mu_0 = P_{\text{Data}}$, $\mu_\infty = \gamma$, and for large $T$, one expects that $\mu_T \approx \gamma$. This will be the basis of our (approximate) bridge between $P_{\text{Data}}$ and $\gamma$.
## Backward Markov Model
Let the time-reversal of $( K_t )_{t \geqslant 0}$ with respect to $\gamma$ be given by $( L_t )_{t \geqslant 0}$, i.e. for any $t > 0$,
\begin{align}
\gamma (dx) K_t (x \to dy) = \gamma (dy) L_t (y \to dx).
\end{align}
For $0 = t_0 < t_1 < \ldots < t_K = T < \infty$, consider now the joint distribution
\begin{align}
Q (dx_{0:K}) \triangleq P_{\text{Data}} (dx_0) \cdot \prod_{0 < k \leqslant K} K_{t_k - t_{k-1}} (x_{k - 1} \to dx_k).
\end{align}
Note that if $P_{\text{Data}} \ll \gamma$, then one can also write
\begin{align}
Q (dx_{0:K}) = \gamma (dx_K) \cdot \prod_{0 \leqslant k < K} L_{t_{k+1} - t_k} (x_{k + 1} \to dx_k) \cdot \frac{d P_{\text{Data}}}{d \gamma} (x_0).
\end{align}
## Twisted Backward Markov Model
In practice, we should expect $P_{\text{Data}}$ and $\gamma$ to be quite singular with respect to one another. However, by modifying the $L$ kernels appropriately, it should be possible to guide the $x_0$-marginal of the process closer to $P_{\text{Data}}$.
Suppose that for $K > k \geqslant 0$, we define kernels $L_{t_{k+1} - t_k}^{\theta, t_{k+1}, t_k} (x_{k+1} \to d x_k)$ which are absolutely continuous with respect to $L_{t_{k+1} - t_k} (x_{k + 1} \to dx_k)$ with explicitly-available Radon-Nikodym derivative, and from which we can sample. One can then define
\begin{align}
P_\theta (dx_{0:K}) = \gamma (dx_K) \cdot \prod_{0 \leqslant k < K} L_{t_{k+1} - t_k}^{\theta, t_{k+1}, t_k} (x_{k + 1} \to dx_k).
\end{align}
For example, if $( K_t )_{t \geqslant 0}, ( L_t )_{t \geqslant 0}$ are the transition operators of a diffusion, then one could define the $L^\theta$ by adding an additional drift term to the SDE. Note that the process in $P_\theta$ need not be stationary, and indeed, it should not be.
## Concrete Example
Let $\gamma (dx) = \mathcal{N} ( dx | 0, I_d )$, and let $K = L$ be the Ornstein-Uhlenbeck semigroup which corresponds to the diffusion
\begin{align}
dX_t = - X_t dt + \sqrt{2} dW_t.
\end{align}
Then
\begin{align}
L_\delta (x \to dy) = \mathcal{N} \left( dy | \exp(-\delta) x, \left(1 - \exp(-2\delta) \right) I \right),
\end{align}
and so one might define
\begin{align}
L_{t_{k+1} - t_k}^{\theta, t_{k+1}, t_k} (x_{k + 1} \to dx_k) &= \mathcal{N} \left( dx_k |\mu^\theta_k ( x_{k+1}, t_{k+1} ), \left(1 - \exp(-2\delta) \right) I \right) \\
\mu^\theta_k ( x_{k+1}, t_{k+1} ) &= \exp(-(t_{k+1} - t_k)) x_{k+1} + (t_{k+1} - t_k) \beta^\theta (x_{k+1}, t_{k+1} )
\end{align}
where $\beta^\theta (x, t)$ is a parametrised "extra drift" function. This can be viewed as a discretisation of
\begin{align}
dX_t = \left( -X_t - \beta^\theta (x, t) \right)dt + \sqrt{2} dW_t
\end{align}
backwards in time (note the "sign error" in $\beta$). The $L^\theta$ can be sampled from because they are still conditionally Gaussian, and the Radon-Nikodym derivatives are available due to Girsanov's theorem.
## Training Objective
One approach to training these models is by maximum likelihood, i.e.
\begin{align}
\max_{\theta \in \Theta} \sum_{i = 1}^N \log P_\theta ( x^i ),
\end{align}
where $P_\theta$ here denotes the $x_0$-marginal of the joint $P_\theta$ distribution. Of course, this marginal is not analytically available in general, and so a standard trick is to apply Importance Sampling / Jensen's inequality to note that
\begin{align}
\log P_\theta ( x_0 ) = \log \left( \int_{x_{1:K}} P_\theta ( dx_{0:K} ) \right) \geqslant \mathbf{E}_{Q ( dx_{1:K} | x_0)} \left[ \log \left( \frac{P_\theta ( x_{0:K})}{ Q (x_{1:K} | x_0 )} \right) \right] = \mathcal{L} (\theta)
\end{align}
where the ratio should be interpreted as a suitable modification of a Radon-Nikodym derivative.
In [2](https://hojonathanho.github.io/diffusion/), the authors optimise $\mathcal{L}$ with respect to $\theta$. They also use the following additional simplification
\begin{align}
\mathbf{E}_{Q ( dx_{1:K} | x_0)} \left[ \log \left( \frac{P_\theta ( x_{0:K})}{ Q (x_{1:K} | x_0 )} \right) \right] &= \mathbf{E}_{Q ( dx_{K} | x_0 )} \left[ \log \left( \frac{ P_\theta (x_K)}{ Q(x_K | x_0 )} \right) \right] \\
&+ \sum_{K > k > 0} \mathbf{E}_{Q ( dx_k, dx_{k+1} | x_0)} \left[ \log \left( \frac{ P_\theta (x_k | x_{k + 1})}{ Q(x_k | x_{k + 1}, x_0 )} \right) \right] \\
&+ \mathbf{E}_{Q ( dx_1 | x_0)} \left[ \log P_\theta ( x_0 | x_1 ) \right].
\end{align}
Note that since $Q$ is a Gaussian process with known covariance, the conditional probabilities of $Q$ are known exactly. Moreover, the $(K - 1)$ terms in the sum can be expressed in terms of KL divergences between Gaussian measures, and so can be estimated and optimised with low variance. Note also that the first term is independent of $\theta$, as $P_\theta (dx_K) = \gamma (dx_K)$.
There are similarities here with the training of variational autoencoders, though note that in a VAE, one would also train the $Q$. Here, the $Q$ remains fixed, though in principle, one could also adapt it to be closer to the path of $P$, provided the appropriate conditional distributions remain tractable. Some other links between diffusion-based generative models, variational inference, and control problems are elucidated in [1](http://proceedings.mlr.press/v99/tzen19a.html), [2](https://arxiv.org/abs/1905.09883).
Note also that this final training scheme is not "true" maximum likelihood, as an imperfect surrogate objective is maximised instead. In principle, "true" maximum likelihood is possible by (stochastic) gradient descent on $\theta$ though this could be challenging. Successful estimation of $\nabla_\theta \log P_\theta (x)$ might require some combination of importance sampling, sequential Monte Carlo, or Markov Chain Monte Carlo, at each step.