Deep generative model - Diffusion model

# Deep generative model - Diffusion model Diffusion model is like Markovian hierarchical VAE but with three key restrictions: 1. The latent dimension is exactly equal to the data dimension 2. The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep. 3. The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep $T$ is a standard Gaussian. ![Fig 3. diffusion model](https://hackmd.io/_uploads/rJIOkek7Wx.png) For the second assumption, the Gassuian encoder is a Gaussian distribution with mean $\mu_t(x_t)=\sqrt{\alpha_t}x_{t-1},$ and variance $\Sigma_t(x_t)=(1-\alpha_t)I_n.$ Mathematically, it is denoted as: $$q(x_t|x_{t-1})=\mathcal N(x_t;\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)I_n).$$And the diffusion model posterior is the same as the Markovian HVAE, $$q(x_{1:T}\mid x_0)=\frac{q(x_{0:T})}{q(x_0)}=\frac{q(x_T\mid x_{0:T-1})q(x_{0:T-1})}{q(x_0)}=\ldots=\prod^T_{t=1}q(x_t\mid x_{t-1}).$$ The form of the coefficients are chosen such that the variance of the latent variables stay at a similar scale; in other words, the encoding process is *variance-preserving*. For the third assumption, $$p_\theta(x_{0:T})=p(x_T)\prod_{i=1}^Tp_\theta(x_{t-1}\mid x_{t}),$$where $p(x_T)=\mathcal N(x_T;0,I_n).$ (Prior distribution.) Now the ELBO becomes: $$\begin{align}\log p_\theta(x_0)&=\mathbb{KL}(q(x_{1:T}|x_0)\ ||\ p_{\theta}(x_{1:T}|x_0))+\mathbb{E}_{q(x_{1:T}|x_0)}[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}]\\&=\mathbb{KL}(q(x_{1:T}|x_0)\ ||\ p_{\theta}(x_{1:T}|x_0))+\mathcal L(\theta,x_0)\\\\&\geq\mathcal L(\theta,x_0).\end{align}$$And the loss function can be expanded as: $$\begin{align}\mathcal L(\theta,x_0)&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)\prod^T_{t=1}p_\theta(x_{t-1}\mid x_t)}{\prod^T_{t=1}q(x_t\mid x_{t-1})}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log p(x_T)+\sum^T_{t=1}\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_t\mid x_{t-1})}\right],\ \text{(equ. (3) in DDPM)}\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{T})p_\theta(x_0\mid x_1)\prod_{t=1}^{T-1}p_\theta(x_t\mid x_{t-1})}{q(x_T\mid x_{T-1})\prod_{t=1}^{T-1}q(x_t\mid x_{t-1})}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log p_{\theta}(x_0\mid x_1)\right]+\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p(x_T)}{q(x_T\mid x_{T-1})}\right]+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_t\mid x_{t+1})}{q(x_t\mid x_{t-1})}\right]\\&=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_{\theta}(x_0\mid x_1)\right]+\mathbb{E}_{q(x_{T-1},x_{T}|x_0)}\left[\log \frac{p(x_T)}{q(x_T\mid x_{T-1})}\right]+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1},x_t,x_{t+1}\mid x_0)}\left[\log\frac{p_\theta(x_t\mid x_{t+1})}{q(x_t\mid x_{t-1})}\right]\\&=\color{skyblue}{\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_{\theta}(x_0\mid x_1)\right]}\color{Red}{\,-\,\mathbb E_{q(x_{T-1}\mid x_0)}\left[\mathbb{KL}({q(x_{T}\mid x_{T-1})}\ \Vert\ {p(x_T)})\right]}\\&\qquad\color{GreenYellow}{-\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1},x_{t+1}\mid x_0)}\left[\mathbb{KL}(q(x_t\mid x_{t-1})\ \Vert\ p_\theta(x_t\mid x_{t+1}))\right]}\end{align}$$ This is the primitive loss function in DDPM (equ. (5)). It can be interpreted in terms of three components: 1. $\color{skyblue}{\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_{\theta}(x_0\mid x_1)\right]}$ is *reconstruction term*, predicting the log probability of the original data sample given the first-step latent. Also appears in a vanilla VAE. 2. $\color{Red}{-\mathbb E_{q(x_{T-1}\mid x_0)}\left[\mathbb{KL}({q(x_{T}\mid x_{T-1})}\ \Vert\ {p(x_T)})\right]}$ is *prior matching term*. This term requires no optimization, as it has no trainable parameters. And we assume a large enough $T$ to make the final distribution is a Gassuian, this term becomes zero. 3. $\color{GreenYellow}{-\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1},x_{t+1}\mid x_0)}\left[\mathbb{KL}(q(x_t\mid x_{t-1})\ \Vert\ p_\theta(x_t\mid x_{t+1}))\right]}$ is *consistency term*. This term can be interpreted to be an extension of "prior matching" or to make the distribution at $x_t$ consistent from both forward and backward. ![[Fig 4. diffusion model optimized.png]] These three terms can be approximated using Monte Carlo method. However, the variance of the MC estimation of consistency term might be higher, since it is an expectation over two random variables $\{x_{t-1},x_{t+1}\}$ for every timestep. Let derive again but rewrite $q(x_t\mid x_{t-1})=q(x_t\mid x_{t-1},x_0):$ $$\begin{align}\mathcal L(\theta,x_0)&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)\prod^T_{t=2}p_\theta(x_{t-1}\mid x_t)}{q(x_1\mid x_0)\prod^T_{t=2}q(x_t\mid x_{t-1})}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)\prod^T_{t=2}p_\theta(x_{t-1}\mid x_t)}{q(x_1\mid x_0)\prod^T_{t=2}q(x_t\mid x_{t-1},{x_0})}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)}{q(x_1\mid x_0)}+\sum_{t=2}^T\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_t\mid x_{t-1},{x_0})}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)}{q(x_1\mid x_0)}+\sum_{t=2}^T\log\frac{p_\theta(x_{t-1}\mid x_t)}{\frac{q(x_{t-1}\mid x_t,x_0)q(x_t\mid x_0)}{q(x_{t-1}\mid x_0)}}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)}{q(x_1\mid x_0)}+\sum_{t=2}^T\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_{t-1}\mid x_t,x_0)}\cdot\frac{q(x_{t-1}\mid x_0)}{q(x_t\mid x_0)}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)}{q(x_1\mid x_0)}+\log \frac{q(x_{1}\mid x_0)}{q(x_T\mid x_0)}+\sum_{t=2}^T\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_{t-1}\mid x_t,x_0)}\right]\\&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0\mid x_1)}{q(x_T\mid x_0)}+\sum_{t=2}^T\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_{t-1}\mid x_t,x_0)}\right]\\&=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0\mid x_1)\right]+\mathbb{E}_{q(x_{T}|x_0)}\left[\log\frac{p(x_T)}{q(x_T\mid x_0)}\right]+\sum^T_{t=2}\mathbb{E}_{q(x_{t},x_{t-1}|x_0)}\left[\log\frac{p_\theta(x_{t-1}\mid x_t)}{q(x_{t-1}\mid x_t,x_0)}\right]\\&={\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0\mid x_1)\right]}{\,-\,\mathbb{KL}(q(x_T\mid x_0)\ \Vert\ p(x_T))}\\&\qquad{-\sum^T_{t=2}\mathbb{E}_{q(x_{t}|x_0)}\left[\mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))\right]}.\end{align}$$ Again, these three terms can be interpreted: 1. ${\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0\mid x_1)\right]}$ is reconstruction term. (has appeared above) 2. ${\,-\,\mathbb{KL}(q(x_T\mid x_0)\ \Vert\ p(x_T))}$ is prior matching term. (also has appeared above) 3. ${-\sum^T_{t=2}\mathbb{E}_{q(x_{t}|x_0)}\left[\mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))\right]}$ is a *denoising matching term*. Maximizing it forces the denoising transition step $p_\theta(x_{t-1}\mid x_t)$ approaching to the ground-truth denoising transition step $q(x_{t-1}\mid x_t,x_0)$. If $T=1$ the denoising matching term is elimated, and therefore we have a vanilla VAE. The key lies in the third term, $q(x_{t-1}\mid x_t,x_0)$ can be simplified by Gaussian distribution. We already know that $q(x_{t}\mid x_{t-1},x_0)=q(x_{t}\mid x_{t-1})=\mathcal N(x_t\mid\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)I_n).$ With reparameterization trick, we can sample from an arbitrary timestep, $x_t\sim q(x_t\mid x_0).$ $$\begin{align}x_t&=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}\epsilon_{t-1}\\&=\sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}\epsilon_{t-2}\right)+\sqrt{1-\alpha_t}\epsilon_{t-1}\\&=\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-2}+\sqrt{1-\alpha_t}\epsilon_{t-1}\\&=\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{\sqrt{\alpha_{t}-\alpha_t\alpha_{t-1}}^2+\sqrt{1-\alpha_t}^2}\epsilon^*_{t-2}\\&=\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon^*_{t-2}\\&=\ldots\\&=\sqrt{\prod^t_{i=1}\alpha_i}x_0+\sqrt{1-\prod^t_{i=1}\alpha_i}\epsilon^*_0\\&=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon^*_0\\&\sim\mathcal N(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t) I_n)=q(x_t\mid x_0).\end{align}$$ So $$\begin{align}q(x_{t-1}\mid x_t,x_0)&=\frac{q(x_t\mid x_{t-1},x_0)q(x_{t-1}\mid x_0)}{q(x_t\mid x_0)}\\&=\frac{\mathcal N(x_t;\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)I_n)\cdot\mathcal N(x_{t-1};\sqrt{\bar\alpha_{t-1}}x_0,(1-\bar\alpha_{t-1}) I_n)}{\mathcal N(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t) I_n)}\\&\propto\exp\left\{-\left[\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{2(1-\alpha_t)}+\frac{(x_{t-1}-\sqrt{\bar\alpha_{t-1}}x_0)^2}{2(1-\bar\alpha_{t-1})}-\frac{(x_t-\sqrt{\bar\alpha_t}x_0)^2}{2(1-\bar\alpha_t)}\right]\right\}\\&\propto\exp\left\{-\frac12\left[-\frac{2\sqrt{\alpha_t}x_tx_{t-1}}{1-\alpha_t}+\frac{\alpha_t x^2_{t-1}}{1-\alpha_t}+\frac{x^2_{t-1}}{1-\bar\alpha_{t-1}}-\frac{2\sqrt{\bar\alpha_{t-1}}x_{t-1}x_0}{1-\bar\alpha_{t-1}}\right]\right\}\\&\propto\mathcal N(x_{t-1};{\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})x_t+\sqrt{\bar\alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar\alpha_t}},{\frac{(1-\alpha_t)(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}I_n})\\&=\mathcal N(x_{t-1};{\mu_q(x_t,x_0)},{\color{VioletRed}\Sigma_q(t)}).\end{align}$$ In DDPM, in order to match approximate denoising transition step $p_\theta(x_{t-1}\mid x_t)$ to ground-truth denoising transition step $q(x_{t-1}\mid x_t,x_0)$ as close as possible, we can model it as a Gaussian: $$p_\theta(x_{t-1}\mid x_t)=\mathcal N(x_{t-1};\mu_\theta,\Sigma_q(t)).$$ ⭐*Acutally, it is a reasonable choice, and this choice make the KL-divergence has a closed-form solution. But, it is not necessary to choose Gaussian. The successors of DDPM have some modifications on that.* Recall that the KL-divergence between two Gaussian distribution: $$\mathbb{KL}(\mathcal N(x;\mu_x,\Sigma_x)\ \Vert\ \mathcal N(y;\mu_y,\Sigma_y))=\frac12\left[\log\frac{|\Sigma_y|}{|\Sigma_x|}-d+\text{tr}(\Sigma_y^{-1}\Sigma_x)+(\mu_y-\mu_x)^T\Sigma_y^{-1}(\mu_y-\mu_x)\right].$$The denoising matching term can be simplified as $$\begin{align}\mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))&=\frac{1}{2}\left[(\mu_\theta-\mu_q)^\top\Sigma_q(t)^{-1}(\mu_\theta-\mu_q)\right]\\&=\frac12\left[(\mu_\theta-\mu_q)^\top(\sigma_q^2(t)I_n)^{-1}(\mu_\theta-\mu_q)\right]\\&=\frac1{2\sigma_q^2(t)}\left[\lVert\mu_\theta-\mu_q\rVert_2^2\right],\end{align}$$where $\sigma_q^2(t)=\frac{(1-\alpha_t)(1-\bar\alpha_{t-1})}{1-\bar{\alpha_t}}.$ And to optimize it, we seek a $\theta$ such that: $$\arg\min_\theta\frac1{2\sigma_q^2(t)}\left[\lVert\mu_\theta-\mu_q\rVert_2^2\right]$$ Expand $\mu_q,$ $$\mu_q(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})x_t+\sqrt{\bar\alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar\alpha_t}.$$ We can match the form of $\mu_\theta(x_t,t)$ as close as possible: $$\mu_\theta(x_t,t)=\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})x_t+\sqrt{\bar\alpha_{t-1}}(1-\alpha_t){\hat x_\theta(x_t,t)}}{1-\bar\alpha_t},$$where $\hat x_\theta(x_t,t)$ is parameterized by a neural network that predict $x_0$ from noisy image $x_t$ and time index $t.$ Therefore the denoising matching term optimization problem simplifies to $$\arg\min_\theta \mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))=\arg\min_\theta\frac1{2\sigma^2_q(t)}\frac{\bar\alpha_{t-1}(1-\alpha_t)^2}{(1-\bar\alpha_t)^2}\left[\lVert\hat x_\theta(x_t,t)-x_0\rVert^2_2\right].$$We conclude that optimizing a diffusion model is actually to learn a neural network to predict the original ground truth image form an arbitrarily noisified verison of it. Futhermore, minimizing the denoising matching term across all noise levels can be approximated by minimizing the expectation over all timesteps: $$\arg\min_\theta\mathbb E_{t\sim U[2,T],\,x_t\sim q(x_t\mid x_0)}[\ \mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))\ ].$$ The diffusion noise parameters, $\alpha_t$ can be learned by a neural network $\hat\alpha_\eta(t)$ with parameters $\eta.$ The derived denosing matching term can reduce to: $$\frac1{2{\sigma^2_q(t)}}\frac{\bar\alpha_{t-1}(1-\alpha_t)^2}{(1-\bar\alpha_t)^2}\left[\lVert\hat x_\theta(x_t,t)-x_0\rVert^2_2\right]=\frac12\left(\frac{\bar\alpha_{t-1}}{1-\bar\alpha_{t-1}}-\frac{\bar\alpha_t}{1-\bar\alpha_t}\right)\left[\lVert\hat x_\theta(x_t,t)-x_0\rVert^2_2\right].$$ Recall that $q(x_t\mid x_0)=\mathcal N(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I_n).$ Following the definition of signal-to-noise ratio (SNR) as $\text{SNR}=\frac{\mu^2}{\sigma^2},$ we can write the SNR at each timestep $t$ as: $$\text{SNR}(t)=\frac{\bar\alpha_t}{1-\bar\alpha_t}.$$ The denosing matching term will be: $$\frac12\left(\text{SNR}(t-1)-\text{SNR}(t)\right)\left[\lVert\hat x_\theta(x_t,t)-x_0\rVert^2_2\right].$$ *In DDPM, the noise schedule is linear, from 0.0001 to 0.02.* *In Improved DDPM (Nichol & Dhariwal, 2021), they use cosine schedule.* ### Diffusion model as a denoising model A diffusion model can be trained by simply learning a neural network to predict the original natural image $x_0$ from a arbitray noised version $x_t$ and $t.$ We know that $$x_t=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon\iff x_0=\frac{x_t-\sqrt{1-\bar\alpha_t}\epsilon}{\sqrt{\bar\alpha_t}}.$$ Plugging this into $$\begin{align}\mu_q(x_t,x_0)&=\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})x_t+\sqrt{\bar\alpha_{t-1}}(1-\alpha_t){x_0}}{1-\bar\alpha_t}\\&=\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})x_t+\sqrt{\bar\alpha_{t-1}}(1-\alpha_t){\frac{x_t-\sqrt{1-\bar\alpha_t}\epsilon}{\sqrt{\bar\alpha_t}}}}{1-\bar\alpha_t}\\&=\frac1{\sqrt{\alpha_t}}x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}{\epsilon}.\end{align}$$To match the denoising transition mean: $$\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}{\hat\epsilon_\theta(x_t,t)}.$$ The corresponding optimization problem becomes: $$\arg\min_\theta \mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))=\arg\min_\theta\frac1{2\sigma^2_q(t)}\frac{(1-\alpha_t)^2}{(1-\bar\alpha_t)\alpha_t}\left[\lVert\epsilon_0-\hat\epsilon_\theta(x_t,t)\rVert^2_2\right].$$ Here $\hat\epsilon_\theta(x_t,t)$ is a neural network that learns to predict the noise of $x_t.$ Although, predicting $x_0$ is equivlent to predicting the noise, but empirically, predicting noise version usually has a better performance. Now we conclue that: The loss function of DDPM is to maximize $$\begin{align}\mathcal L(\theta;x_0)&={\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0\mid x_1)\right]}{\,-\,\mathbb{KL}(q(x_T\mid x_0)\ \Vert\ p(x_T))}{-\sum^T_{t=2}\mathbb{E}_{q(x_{t}|x_0)}\left[\mathbb{KL}(q(x_{t-1}\mid x_t,x_0)\ \Vert\ p_\theta(x_{t-1}\mid x_t))\right]}\\&={-\text{MSE}(x_0,\hat x_0)}{-\mathbb E_{t\sim U[2,T],\,x_t\sim q(x_t\mid x_0),\,\epsilon\sim\mathcal N(0,I_n)}\left[\frac1{2\sigma^2_q(t)}\frac{(1-\alpha_t)^2}{(1-\bar\alpha_t)\alpha_t}\left[\lVert\epsilon-\hat\epsilon_\theta(x_t,t)\rVert^2_2\right]\right]}{+C},\end{align}$$where $q(x_t\mid x_0)=\mathcal N(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t) I_n).$ However, they found a simplified loss function leads to better sample quality, since it make the network focus on more difficult denoising tasks at larger $t$ terms. This can be seen from the curve of the weight: ![[The weight of DDPM primitive loss function.png]] The extreme distribution causes the network put too much efforts on learning to denoise the easier noisy images. Without the weights, the simple loss function becomes $$\mathcal L_{\text{simple}}(\theta;x_0)=\mathbb E_{t\sim U[1,T],\,x_t\sim q(x_t\mid x_0),\,\epsilon\sim\mathcal N(0,I_n)}\left[\lVert\epsilon-\hat\epsilon(x_t,t)\rVert_2^2\right].$$ The inference is: given a $x_T\sim\mathcal N(0,I_n),$ predict the noise on it, until $x_0.$ $$x_{t-1}\sim p(x_{t-1}\mid x_t)=\mathcal N(\mu_\theta(x_t,t),\sigma^2_q(t)I_n),$$where $\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}{\hat\epsilon_\theta(x_t,t)},\,\sigma_q^2(t)=\frac{(1-\alpha_t)(1-\bar\alpha_{t-1})}{1-\bar{\alpha_t}}.$ In DDPM, they use U-Net to implement the denoising network: ```python T = 1000 model = UNet2DModel(down_block_types=("DownBlock2D", "DownBlock2D", "DownBlock2D", "DownBlock2D"), up_block_types=("UpBlock2D", "UpBlock2D", "UpBlock2D", "UpBlock2D"),) noise_scheduler = DDPMScheduler(num_train_timesteps=T) # train for epoch in range(tot_epoch): for step, batch in enumerate(train_dataloader): x0 = batch timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (batch_size,)) noise = torch.randn_like(x0) xt = noise_scheduler.add_noise(x0, noise, timesteps) noise_pred = model(noisy_images, timesteps).sample loss = MSELoss(noise_pred, noise) loss.backward() optimizer.step() # generate/sample x = randn(image_shape) for t in range(T,1): noise = model(x, t) x = noise_scheduler.sub_noise(x, noise, t) predict_x0 = x # now we get the clean image ! ``` ### Denoising Diffusion Implicit Model (DDIM) Notice that the loss function of DDPM depends on only the marginal $q(x_t\mid x_0),$ not the joint distribution $q(x_{1:T}\mid x_0).$ If we are able to find an alternative inference process that are non-Markovian, we can have a more efficient approach to generate. In other words, we only need to match the marginal $q(x_t\mid x_0)=\mathcal N(\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I_n).$ In order to break the Markovian, we **assume that $\pmb{x_t}$ depends on not only $\pmb{x_{t-1}}$ but also $\pmb{x_0}$.** Therefore, the joint distribution in DDPM: $$q(x_{1:T}\mid x_0)=\frac{q(x_{0:T})}{q(x_0)}=\frac{q(x_T\mid x_{0:T-1})q(x_{0:T-1})}{q(x_0)}=\ldots=\prod^T_{t=1}q(x_t\mid x_{t-1})$$becomes $$\begin{align}q(x_{1:T}\mid x_0)&=\frac{q(x_{0:T})}{q(x_0)}\\&=\frac{q(x_{T}\mid x_{0:T-1})q(x_{0:T-1})}{q(x_0)}\\&=\frac{q(x_T\mid x_{0:T-1})q(x_{T-1}\mid x_{0:T-2})\ldots q(x_1\mid x_0)q(x_0)}{q(x_0)}\\&=q(x_T\mid x_{0:T-1})q(x_{T-1}\mid x_{0:T-2})\ldots q(x_1\mid x_0)\\&=q(x_T\mid x_0,x_{T-1})q(x_{T-1}\mid x_0,x_{T-2})\ldots q(x_1\mid x_0),\quad\text{(By assumption)}\\&=\frac{q(x_{T-1}\mid x_0,x_T)q(x_T\mid x_0)}{q(x_{T-1}\mid x_0)}\frac{q(x_{T-2}\mid x_0,x_{T-1})q(x_{T-1}\mid x_0)}{q(x_{T-2}\mid x_0)}\ldots\frac{q(x_{1}\mid x_0,x_2)q(x_2\mid x_0)}{q(x_{1}\mid x_0)}q(x_1\mid x_0)\\&=q(x_T\mid x_0)\prod_{t=2}^T q(x_{t-1}\mid x_0,x_t).\end{align}$$ We also first assume $q(x_T\mid x_0)$ follows the marginal $q(x_T\mid x_0)=\mathcal N(\sqrt{\bar\alpha_T}x_0,(1-\bar\alpha_T)I_n).$ Then we let $X\sim q(x_t\mid x_0),\,Y\sim q(x_{t-1}\mid x_0).$ So $P(Y\mid X)=q(x_{t-1}\mid x_t,x_0).$ (Mathematic induction: $x_T$ meets the requirement, prove if $x_{t}$ meets, $x_{t-1}$ also meets.) If $P(X)=q(x_t\mid x_0)=\mathcal N(\mu,\Lambda),$ and $P(Y\mid X)=q(x_{t-1}\mid x_t,x_0)=\mathcal N(AX+b,L),$ then $P(Y)=q(x_{t-1}\mid x_0)=\mathcal N(A\mu+b,L+A\Lambda A^T).$ We want $P(Y)=q(x_{t-1}\mid x_0):=\mathcal N(\sqrt{\bar\alpha_{t-1}}x_0,(1-\bar\alpha_{t-1})I_n).$ Therefore, we have to solve $(L,A,b)$ such that $$\begin{align}A\mu+b&=\sqrt{\bar\alpha_{t-1}}x_0,\\L+A\Lambda A^T&=(1-\bar\alpha_{t-1})I_n,\end{align}$$where $\mu=\sqrt{\bar\alpha_t}x_0,\Lambda=(1-\bar\alpha_t)I_n.$ Let $L=\sigma_t^2 I_n,$ $$\begin{align}&\begin{pmatrix}A\mu+b=\sqrt{\alpha_{t-1}}x_0,\\L+A\Lambda A^T=(1-\bar\alpha_{t-1})I_n\end{pmatrix}\\\Rightarrow&\begin{pmatrix}A(\sqrt{\bar\alpha_t}x_0)+b=\sqrt{\alpha_{t-1}}x_0,\\ \sigma_t^2I_n+(1-\bar\alpha_t)AA^T=(1-\bar\alpha_{t-1})I_n\end{pmatrix}\\\Rightarrow&\begin{pmatrix}A(\sqrt{\bar\alpha_t}x_0)+b=\sqrt{\alpha_{t-1}}x_0,\\(1-\bar\alpha_t)AA^T=(1-\bar\alpha_{t-1}-\sigma_t^2)I_n\end{pmatrix}\\\Rightarrow&\begin{pmatrix}A(\sqrt{\bar\alpha_t}x_0)+b=\sqrt{\alpha_{t-1}}x_0,\\A=\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}I_n\end{pmatrix}\\\Rightarrow&\begin{pmatrix}b=\sqrt{\alpha_{t-1}}x_0-\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}(\sqrt{\bar\alpha_t}x_0),\\A=\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}I_n\end{pmatrix}.\end{align}$$ So $$\begin{align}P(Y\mid X)&=q(x_{t-1}\mid x_t,x_0)=\mathcal N(Ax_t+b,L)\\&=\mathcal N(\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}x_t+\sqrt{\alpha_{t-1}}x_0-\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}(\sqrt{\bar\alpha_t}x_0),\sigma_t^2I_n)\\&=\mathcal N(\sqrt{\alpha_{t-1}}x_0+\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}(x_t-\sqrt{\bar\alpha_t}x_0),\sigma_t^2I_n).\end{align}$$ We conclude that if $q(x_{t-1}\mid x_t,x_0)$ is equal to the above form, the marginal keeps. The same trick here, let $$p_{\theta}(x_{t-1}\mid x_t)=\mathcal N(\sqrt{\alpha_{t-1}}\hat x_0+\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}(x_t-\sqrt{\bar\alpha_t}\hat x_0),\sigma_t^2I_n)=q(x_{t-1}\mid x_t,\hat x_0),$$where $$\hat x_0=\frac{x_t-\sqrt{1-\bar\alpha_t}\hat\epsilon_\theta(x_t,t)}{\sqrt{\bar\alpha_t}}.$$ By the posterior, the sampling process is $$\begin{align}x_{t-1}&=\sqrt{\bar\alpha_{t-1}}\hat x_0+\sqrt{\frac{1-\bar\alpha_{t-1}-\sigma_t^2}{1-\bar\alpha_t}}(x_t-\sqrt{\bar\alpha_t}\hat x_0)+\sigma_t\epsilon^*_t\\&=\sqrt{\bar\alpha_{t-1}}\left(\frac{x_t-\sqrt{1-\bar\alpha_t}\hat\epsilon_\theta(x_t,t)}{\sqrt{\bar\alpha_t}}\right)+\sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\hat\epsilon_\theta(x_t,t)+\sigma_t\epsilon_t^*.\end{align}$$ Especially, when $\sigma_t^2=\frac{(1-\bar\alpha_{t-1})(1-\alpha_t)}{(1-\bar\alpha_t)\bar\alpha_{t-1}},$ it becomes DDPM. When $\sigma_t=0,$ it is DDIM. IF, we change the assumption, **$\pmb{x_t}$ depends on $\pmb{x_{t-1}}$ and $\pmb{x_0},$** into, **$\pmb{x_{\tau_i}}$ depends on $\pmb{x_{\tau_{i-1}}}$ and $\pmb{x_0}$**, where $\tau$ is a sub-sequence of $[1,\ldots,T]$ of length $S.$ **The derivation above will not change except for the timesteps**. ($t\rightarrow \tau_{i},\,t-1\rightarrow\tau_{i-1}.$) It means, the sampling process can "jump" on the timesteps, since the Markovian dependency is broken. And because we don't change the loss function, the trained DDPM model can use this sampling process to accelerate also.