1D Denoising Diffusion

# 1D Denoising Diffusion ![](https://i.imgur.com/0F0im8Y.png) ## Intro We consider the Markov chain with transitions $$ x_t \mid x_{t-1} \sim N(\sqrt{1-\beta_t}x_{t-1}, \beta_tI), \quad t=1,...,T $$ and initial state $x_0 \sim p(x_0)$. With suitable choices of noise-scales $\beta_t$ and sufficiently large $T$ we have that the final state $x_T$ is approximately Gaussian and independent of $x_0$, i.e., $$ p(x_T \mid x_0) \approx p(x_T) \approx N(0,I). $$ ~~Additionally, the reverse process for "small" steps preserves the functional form of the forward process, that is~~ $$ x_{t-1} \mid x_t = N(\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)). $$ Generative modeling based on these observations seeks to learn the reverse transition distributions $Q_{\theta,t} = N(x; \mu_\theta(x_t,t),\Sigma_\theta(x_t,t))$. After training, approximate samples of the prior $p(x_0)$ can then be generated by sampling $x_T \sim N(0,I)$ and iteratively sampling $x_{t-1} \sim Q_{\theta,t}$. There are many possible choices of parametrisation of the reverse transition distributions. We follow the one suggested in [Ho Et Al 2020 Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239). ## Toy problem We define a simple Gaussian mixture prior on $x_0$ by $$ x_0 \sim \frac 1 2 N(-1,\sigma^2=0.1) + \frac 1 2 N(1,\sigma^2=0.1). $$ ![](https://i.imgur.com/6zdrodv.png) For the forward transitions we use $\beta_t$ linearly increasing from 0.001 to 0.02, and $T=100$ timesteps. Sampling the forward process for 1000 points from the prior looks like this. ![](https://i.imgur.com/JPRv3lj.gif) One can derive the distribution of $x_t$ conditioned on $x_0$ in closed form. Let $\alpha_t := 1-\beta_t$ and $\bar\alpha_t := \prod_{s=1}^t\alpha_s$. Then $$ x_t \mid x_0 \sim N(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I) $$ (Note: For our choice of $T$ and $\beta_t$ we do not have $\bar\alpha_T \approx 0$ but it does however hold that $x_T \sim N(0,1)$ due to the choice of prior on $x_0$. A typical choice of $T$ in the literature is 1000.) It can be shown that the reverse mean $\mu_\theta$ must predict $$ \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon) $$ given $x_t$, where $\epsilon\sim N(0,I)$. Reparametrising to instead predict $\epsilon$ with $\epsilon_\theta(x_t,t)$ for a given timestep can be shown to lead to the following lower variational bound (derivation omitted, see e.g. [Ho et al 2020][1]) $$ \mathbb{E}_{x_0,\epsilon,t}\left[ \frac{\beta^2_t}{2\sigma^2_t\alpha_t(1-\bar\alpha_t)} \Vert \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + {\sqrt{1-\bar\alpha_t}}\epsilon, t)\Vert^2 \right]. $$ In practice, a reweighted version of this is used as a loss for $\epsilon_\theta$, namely $$ L_{\text{simple}} := \mathbb{E}_{x_0,\epsilon,t}\left[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + {\sqrt{1-\bar\alpha_t}}\epsilon, t)\Vert^2 \right]. $$ ## Sampling The previous loss can be used for training a neural network $\epsilon_\theta$ using stochastic gradient descent. After training, samples can be generated by first sampling $$ x_T \sim N(0,I)\\ $$ and then for $t = T,...,2$ $$ z \sim N(0,1) \\ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_tz $$ and finally $$ x_{0} = \frac{1}{\sqrt{\alpha_1}}\left(x_1 - \frac{1-\alpha_1}{\sqrt{1-\bar\alpha_1}}\epsilon_\theta(x_1,t)\right). $$ Here there are some choices for $\sigma_t$. In Ho 2020 they test two different with no clear difference (?), so let's use the simpler one $\sigma^2_t=\beta_t$. Training a small MLP (2=>16, relu, 16=>4, relu, 4=>1) on the problem defined above using Adam and a batchsize of 1024 yields the following loss curve. ![](https://i.imgur.com/mCw63ql.png) Sampling using the trained network yields the blue distribution, where the red line is the true prior. ![](https://i.imgur.com/a4Nf1Ht.png) ## Connection to Langevin dynamics and likelihood guiding Up to this point we have seen how we can approximate a prior on $x_0$. We now consider the process where we additionally have a lossy observation $y$ of $x_0$, and the problem of sampling the posterior $p(x_0 \mid y)$. ![](https://i.imgur.com/0F0im8Y.png) First, we sketch a connection between the previous sampling procedure and Langevin dynamics. Let $\pi(x)$ be a differentiable probability density. [Apparently](https://abdulfatir.com/blog/2020/Langevin-Monte-Carlo/), $\pi$ is the ~~stationary~~ steady-state distribution of the SDE $$ dX_t = \nabla\log\pi(x)dt + \sqrt{2}dB_t. $$ Discretising this using the Euler-Maruyama method yields the following approximation $$ X_{t+\tau} = X_t + \tau\nabla\log\pi(x)+\sqrt{2\tau}\xi, \quad \xi\sim N(0,I). $$ Comparing this with $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_tz, \quad z \sim N(0,1), $$ and noting that the time indices here are arbitrary, the methods seem... related. In fact, assuming for some reason that they in fact *are* the same, i.e., $$ \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) = x_t + \tau_t\nabla\log\pi(x_t), \\ \sigma_t = \sqrt{2\tau_t}, $$ one might be tempted to use the fact that, by Bayes $$ \nabla\log p(x_0 \mid y) = \nabla\log p(x_0) + \nabla\log p(y \mid x_0) $$ to perturb the reverse iterations by $\tau_t\nabla\log p(y\mid x)$. *Note that score based generative diffusion models, e.g. [Song and Ermon 2019 ](https://arxiv.org/abs/1907.05600), are trained to estimate $\nabla\log\pi$* ## Likelihood guiding experiment We continue with the previous example, and add to this an observation $y=0.3$ assumed to be sampled by $$ y \mid x_0 \sim N(x_0, \sigma^2=0.5^2). $$ In this case we have access to the true posterior density, which more or less should look like this. ![](https://i.imgur.com/mtsi4PF.png) The log likelihood of $y$ given $x=x_0$ is then $$ \log p(y \mid x) = -\frac{(x-y)^2}{2\cdot 0.5^2} + C $$ so $$ \nabla\log p(y \mid x) = - \frac{x-y}{0.5^2}. $$ Starting with $x_T\sim N(0,1)$ as before and iterating $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \frac{\beta_t}{2}\frac{x-y}{0.5^2} + \sqrt{\beta_t}z, \quad z \sim N(0,1), $$ yields, where red is the true posterior. ![](https://i.imgur.com/UKuqi7C.png) Scaling the gradient factor by 2, i.e., $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \beta_t\frac{x-y}{0.5^2} + \sqrt{\beta_t}z, \quad z \sim N(0,1), $$ yields, ![](https://i.imgur.com/bCWwGBc.png) which pulls more mass towards the observation $y=0.3$. The results are not very satisfactory, however the posterior modes are well separated which is a problem for gradient based sampling methods. ### References [1]: https://arxiv.org/pdf/2006.11239 "Denoising Diffusion Probabilistic Models. Ho, Jain, Abbeel 2020" * ### End Code: https://gist.github.com/vincentmolin/1243cca04a111f6f28ca17017e5e2692