# 1D Denoising Diffusion

## Intro
We consider the Markov chain with transitions
$$
x_t \mid x_{t-1} \sim N(\sqrt{1-\beta_t}x_{t-1}, \beta_tI), \quad t=1,...,T
$$
and initial state $x_0 \sim p(x_0)$. With suitable choices of noise-scales $\beta_t$ and sufficiently large $T$ we have that the final state $x_T$ is approximately Gaussian and independent of $x_0$, i.e.,
$$
p(x_T \mid x_0) \approx p(x_T) \approx N(0,I).
$$
~~Additionally, the reverse process for "small" steps preserves the functional form of the forward process, that is~~
$$
x_{t-1} \mid x_t = N(\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)).
$$
Generative modeling based on these observations seeks to learn the reverse transition distributions $Q_{\theta,t} = N(x; \mu_\theta(x_t,t),\Sigma_\theta(x_t,t))$. After training, approximate samples of the prior $p(x_0)$ can then be generated by sampling $x_T \sim N(0,I)$ and iteratively sampling $x_{t-1} \sim Q_{\theta,t}$.
There are many possible choices of parametrisation of the reverse transition distributions. We follow the one suggested in [Ho Et Al 2020 Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239).
## Toy problem
We define a simple Gaussian mixture prior on $x_0$ by
$$
x_0 \sim \frac 1 2 N(-1,\sigma^2=0.1) + \frac 1 2 N(1,\sigma^2=0.1).
$$

For the forward transitions we use $\beta_t$ linearly increasing from 0.001 to 0.02, and $T=100$ timesteps. Sampling the forward process for 1000 points from the prior looks like this.

One can derive the distribution of $x_t$ conditioned on $x_0$ in closed form. Let $\alpha_t := 1-\beta_t$ and $\bar\alpha_t := \prod_{s=1}^t\alpha_s$. Then
$$
x_t \mid x_0 \sim N(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)
$$
(Note: For our choice of $T$ and $\beta_t$ we do not have $\bar\alpha_T \approx 0$ but it does however hold that $x_T \sim N(0,1)$ due to the choice of prior on $x_0$. A typical choice of $T$ in the literature is 1000.)
It can be shown that the reverse mean $\mu_\theta$ must predict
$$
\frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon)
$$
given $x_t$, where $\epsilon\sim N(0,I)$. Reparametrising to instead predict $\epsilon$ with $\epsilon_\theta(x_t,t)$ for a given timestep can be shown to lead to the following lower variational bound (derivation omitted, see e.g. [Ho et al 2020][1])
$$
\mathbb{E}_{x_0,\epsilon,t}\left[ \frac{\beta^2_t}{2\sigma^2_t\alpha_t(1-\bar\alpha_t)} \Vert \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + {\sqrt{1-\bar\alpha_t}}\epsilon, t)\Vert^2 \right].
$$
In practice, a reweighted version of this is used as a loss for $\epsilon_\theta$, namely
$$
L_{\text{simple}} := \mathbb{E}_{x_0,\epsilon,t}\left[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + {\sqrt{1-\bar\alpha_t}}\epsilon, t)\Vert^2 \right].
$$
## Sampling
The previous loss can be used for training a neural network $\epsilon_\theta$ using stochastic gradient descent. After training, samples can be generated by first sampling
$$
x_T \sim N(0,I)\\
$$
and then for $t = T,...,2$
$$
z \sim N(0,1) \\
x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_tz
$$
and finally
$$
x_{0} = \frac{1}{\sqrt{\alpha_1}}\left(x_1 - \frac{1-\alpha_1}{\sqrt{1-\bar\alpha_1}}\epsilon_\theta(x_1,t)\right).
$$
Here there are some choices for $\sigma_t$. In Ho 2020 they test two different with no clear difference (?), so let's use the simpler one $\sigma^2_t=\beta_t$.
Training a small MLP (2=>16, relu, 16=>4, relu, 4=>1) on the problem defined above using Adam and a batchsize of 1024 yields the following loss curve.

Sampling using the trained network yields the blue distribution, where the red line is the true prior.

## Connection to Langevin dynamics and likelihood guiding
Up to this point we have seen how we can approximate a prior on $x_0$. We now consider the process where we additionally have a lossy observation $y$ of $x_0$, and the problem of sampling the posterior $p(x_0 \mid y)$.

First, we sketch a connection between the previous sampling procedure and Langevin dynamics. Let $\pi(x)$ be a differentiable probability density. [Apparently](https://abdulfatir.com/blog/2020/Langevin-Monte-Carlo/), $\pi$ is the ~~stationary~~ steady-state distribution of the SDE
$$
dX_t = \nabla\log\pi(x)dt + \sqrt{2}dB_t.
$$
Discretising this using the Euler-Maruyama method yields the following approximation
$$
X_{t+\tau} = X_t + \tau\nabla\log\pi(x)+\sqrt{2\tau}\xi, \quad \xi\sim N(0,I).
$$
Comparing this with
$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_tz, \quad z \sim N(0,1),
$$
and noting that the time indices here are arbitrary, the methods seem... related.
In fact, assuming for some reason that they in fact *are* the same, i.e.,
$$
\frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) = x_t + \tau_t\nabla\log\pi(x_t), \\
\sigma_t = \sqrt{2\tau_t},
$$
one might be tempted to use the fact that, by Bayes
$$
\nabla\log p(x_0 \mid y) = \nabla\log p(x_0) + \nabla\log p(y \mid x_0)
$$
to perturb the reverse iterations by $\tau_t\nabla\log p(y\mid x)$.
*Note that score based generative diffusion models, e.g. [Song and Ermon 2019 ](https://arxiv.org/abs/1907.05600), are trained to estimate $\nabla\log\pi$*
## Likelihood guiding experiment
We continue with the previous example, and add to this an observation $y=0.3$ assumed to be sampled by
$$
y \mid x_0 \sim N(x_0, \sigma^2=0.5^2).
$$
In this case we have access to the true posterior density, which more or less should look like this.

The log likelihood of $y$ given $x=x_0$ is then
$$
\log p(y \mid x) = -\frac{(x-y)^2}{2\cdot 0.5^2} + C
$$
so
$$
\nabla\log p(y \mid x) = - \frac{x-y}{0.5^2}.
$$
Starting with $x_T\sim N(0,1)$ as before and iterating
$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \frac{\beta_t}{2}\frac{x-y}{0.5^2} + \sqrt{\beta_t}z, \quad z \sim N(0,1),
$$
yields, where red is the true posterior.

Scaling the gradient factor by 2, i.e.,
$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \beta_t\frac{x-y}{0.5^2} + \sqrt{\beta_t}z, \quad z \sim N(0,1),
$$
yields,

which pulls more mass towards the observation $y=0.3$.
The results are not very satisfactory, however the posterior modes are well separated which is a problem for gradient based sampling methods.
### References
[1]: https://arxiv.org/pdf/2006.11239 "Denoising Diffusion Probabilistic Models. Ho, Jain, Abbeel 2020"
*
### End
Code: https://gist.github.com/vincentmolin/1243cca04a111f6f28ca17017e5e2692