> **author**: Gan Chee Kim 顏子鈞 (r11942161@ntu.edu.tw) GICE, NTU > Version 1.0, June 2023 > discussion at here (mandarin): https://www.dcard.tw/f/graduate_school/p/242606013 <details close> <summary>Table of Contents</summary> - Chapter 1 Introduction to Diffusion Model - Chapter 2 Mathematical Theory of Diffusion Model - 2.1 Forward diffusion process - 2.2 Backward diffusion process - 2.3 Training of Diffusion Model - 2.4 Improvements on training - 2.4.1 Noise Scheduling Optimization - 2.4.2 Speeding up Diffusion Model’s sampling - Chapter 3 Network Architecture - 3.1 U-Net architecture - 3.2 U-Vit architecture - Chapter 4 Conditioned Generation of Diffusion Model - 4.1 Classifier guided diffusion - 4.2 Classifier-Free guidance - 4.3 Cross-Attention - Chapter 5 Applications of Diffusion Model - 5.1 Image Generation - 5.2 Super-Resolution - 5.3 Inpainting - 5.4 Semantic Segmentation - 5.5 Image Translation - Chapter 6 Diffusion Model on Video Generation - 6.1 Video Diffusion Model (VDM) - 6.2 Video Probabilistic Diffusion Models in Projected Latent Space (VPDM) - 6.3 VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation - 6.4 Conditional Image-to-Video Generation with Latent Flow Diffusion Models (LDDM) - Chapter 7 Conclusion - Reference </details> ## Chapter 1 Introduction to Diffusion Model Diffusion model is a generative model proposed by [1] in 2015, but then carry forward by DDPM[2] in 2020. It achieves SOTA quality in many modern image generation task (DALL-E 2, StableDiffusion, Imagen, etc.). The high-level concept of Diffusion model can be described as 2 processes of <u>Markov chain</u> *(the status of an object in the chain depends solely on the previous object)*: - A forward diffusion process that gradually adds Gaussian noise to the real data. - A backward diffusion process that learns to denoise from a noisy data. ![](https://hackmd.io/_uploads/H1uRvupw2.jpg) <p style="text-align: center">Fig. 1: overview of forward and backward diffusion process (modified from DDPM[2])</p> We will explain more details regarding Fig. 1, but first let us elaborate the notations: $x_0$: real data (original clean image) $x_T$: latent space (pure noisy image, repeatedly added with Gaussian noise $T$ times) $q(x_t|x_{t-1})$: the forward process, the operation that outputs of $x_t$ given $x_{t-1}$ $p_\theta(x_{t-1}|x_t)$: the backward process, the operation that outputs of $x_{t-1}$ given $x_t$. ($\theta$ indicates neural network) In the forward diffusion process, given a clean image $x_0$, we iteratively add a small amount of Gaussian noise to the image. Eventually, the image will become undistinguishable $x_T$, which is nearly an isotropic Gaussian if $T$ was large enough. In the backward diffusion process, the Diffusion model learns the mechanism of denoising a noisy image. It works by predicting the noise added to $x_t$, so that by removing the noise we can obtain a less noisy image $x_{t-1}$. In the inference/testing phase, given a random gaussian noise image $x_T$, the trained Diffusion model is able to output a clean image $x_0$ by iteratively denoising the input image. In this tutorial, we will thoroughly review the mathematical theory behind diffusion model **(Chapter 2)** and the network architecture **(Chapter 3)** that is suitable for implementing diffusion model. Apart from these, we will also discuss about the conditioned generation **(Chapter 4)** as well as diffusion model’s application **(Chapter 5)** in computer vision. At the end, I will talk about some research regarding the most trending field of diffusion model – video generation **(Chapter 6)**. ## Chapter 2 Mathematical Theory of Diffusion Model Acknowledgement: Some of the contents in this chapter are referenced from [33]. ### 2.1 Forward Diffusion Process As we mentioned, the forward diffusion process is a Markov chain that gradually adds Gaussian noise to the data. More specifically, the Gaussian noise is controlled by a scheduled variance sequence $\left\{\beta_1,\ \cdots,\beta_T\right\}\in\left\{0,\cdots,1\right\}$. Recall that a Gaussian noise can be expressed as: $N(\mu,\sigma^2)=c\ e^{-\frac{1}{2}{(\frac{x-\mu}{\sigma})}^2}$ where $c$ is some constant. Thus, the forward diffusion process can be expressed as: $q\left(x_t\middle| x_{t-1}\right)=N(x_t\ ;\ \sqrt{1-\beta_t}x_{t-1},\ \beta_tI)$ where $\mu=\sqrt{1-\beta_t}x_{t-1}$ and $\sigma^2=\beta_tI$. The entire Markov chain can be expressed as: $q\left(x_{1:T}\middle| x_0\right)=\prod_{t=1}^{T}{q\left(x_t\middle| x_{t-1}\right)}$ For the efficiency in training (which we will discuss later), we need to be able to sample $x_t$ at any arbitrary timestep t, or else we will be performing a lot of multiplication (as shown above). Fortunately, this can be done in a closed form using <u>reparameterization</u> *(express a random variable $Z$ as a deterministic variable $Z(x,\ \epsilon)$, where $x$ is determined and $\epsilon$ is an auxiliary independent random variable.)*: Let $\alpha_t=1-\beta_t$ and $\bar{\alpha_t}=\prod_{i=1}^{t}\alpha_i$ $x_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilon_{t-1}$ $=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}\epsilon_{t-1}$ $=\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}{\bar{\epsilon}}_{t-2}$ $=\sqrt{\alpha_t\alpha_{t-1}\cdots\alpha_1}x_0+\sqrt{1-\alpha_t\alpha_{t-1}\cdots\alpha_1}\epsilon$ $=\sqrt{\bar{\alpha_t}}x_0+\sqrt{1-\bar{\alpha_t}}\epsilon$ Hence the forward process conditioned on $x_0$ can be rewritten as: $q\left(x_t\middle| x_0\right)=N(x_t;\ \sqrt{{\bar{\alpha}}_t}x_0,\ (1-{\bar{\alpha}}_t)I)$ Also, due to the design that makes $\beta_1<\ \beta_1<\cdots<\beta_T$, therefore ${\bar{\alpha}}_1>{\bar{\alpha}}_2>\cdots>{\bar{\alpha}}_T$, and when $T$ is large enough ${\bar{\alpha}}_T\rightarrow0, x_T=\epsilon$ (a pure Gaussian noise). In short, the forward diffusion process aims to convert the real data distribution into a <u>latent space</u> *(an embedded feature space)* of simple Gaussian distribution. ### 2.2 Backward Diffusion Process Recall that the forward process is $q\left(x_t\middle| x_{t-1}\right)$, so if we can obtain $q\left(x_{t-1}\middle| x_t\right)$ then we can recreate (or denoise) the data. Unfortunately, we cannot estimate any arbitrary $q\left(x_{t-1}\middle| x_t\right)$ without knowing the distribution of all the entire real data. Therefore, we need to train a neural network to approximate the conditional probabilities $p_\theta(x_{t-1}|x_t)$: $p_\theta\left(x_{t-1}\middle| x_t\right)=N(x_{t-1};\ \mu_\theta(x_t,t),\ \mathrm{\Sigma}_\theta(x_t,t))$ where $\mu_\theta$ and $\mathrm{\Sigma}_\theta$ are the predictions of our neural network, as our goal is to predict the distribution of the Gaussian noise $N(\mu_\theta,{\sigma_\theta}^2)$ added on $x_t$. Again, the entire Markov chain of backward process can be expressed as: $p_\theta(x_{1:T})=p_\theta(x_T)\prod_{t=1}^{T}{p_\theta\left(x_{t-1}\middle| x_t\right)}$ As for now, the objective of backward diffusion process is to approximate the reverse conditional probability $q\left(x_{t-1}\middle| x_t\right)$ by using $p_\theta\left(x_{t-1}\middle| x_t\right)$. Naively the term is intractable, but instead we can evaluate it using relative probability conditioned on $x_0$: $q\left(x_{t-1}\middle| x_t\ ,\ x_0\right)=N(x_{t-1}\ ;\ \widetilde{\mu}(x_t,x_0),\ {\widetilde{\beta}}_tI)$ and using Bayes' rule, we can derive $\widetilde{\mu}(x_t,x_0)$ and ${\widetilde{\beta}}_t$ as shown as below: $q\left(x_{t-1}\middle| x_t\ ,\ x_0\right)=q\left(x_t\middle| x_{t-1}\ ,\ x_0\right)\ \frac{q\left(x_{t-1}\middle|\ x_0\right)}{q\left(x_t\middle|\ x_0\right)}$ $=N\left(x_t\ ;\ \sqrt{\alpha_t}x_{t-1},\ \beta_tI\right)\ \cdot\ N\left(x_{t-1}\ ;\ \sqrt{{\bar{\alpha}}_{t-1}}x_0,\ {(1-\bar{\alpha}}_{t-1})I\right)\ \cdot\ \ [N(xt ; α_tx0,(1-α_t)I)]^{-1}$ $=C_0 \cdot exp({-\frac{1}{2}\left(\frac{\left(x_t-\sqrt{\alpha_t}x_{t-1}\right)^2}{\beta_t}+\frac{\left(x_{t-1}-\sqrt{{\bar{\alpha}}_{t-1}}x_0\right)^2}{{1-\bar{\alpha}}_{t-1}}-\frac{\left(x_t-\sqrt{{\bar{\alpha}}_t}x_0\right)^2}{{1-\bar{\alpha}}_t}\right)})$ $=C_0\cdot exp({-\frac{1}{2}\left(\frac{{x_t}^2-2\sqrt{\alpha_t\ }x_t\ x_{t-1}+{\alpha_tx_{t-1}}^2}{\beta_t}+\frac{{x_{t-1}}^2-2\sqrt{{\bar{\alpha}}_t\ }x_{t-1}\ x_0+{{\bar{\alpha}}_{t-1}x_0}^2}{{1-\bar{\alpha}}_{t-1}}-\frac{\left(x_t-\sqrt{{\bar{\alpha}}_t}x_0\right)^2}{{1-\bar{\alpha}}_t}\right)})$ $=C_0\cdot exp({-\frac{1}{2}\left(\left(\frac{\alpha_t}{\beta_t}+\frac{1}{{1-\bar{\alpha}}_{t-1}}\right){x_{t-1}}^2-\left(\frac{2\sqrt{\alpha_t\ }}{\beta_t}x_t+\frac{2\sqrt{{\bar{\alpha}}_t\ }}{{1-\bar{\alpha}}_{t-1}}x_0\right)x_{t-1}+C_1(x_t,\ x_0)\right)})$ Recall that $N\left(\mu,\sigma^2\right)=c\ e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}=c\cdot\exp({-\frac{1}{2}\left(\frac{1}{\sigma^2}x^2-\frac{2\mu}{\sigma^2}x+\frac{\mu^2}{\sigma^2}\right)})$. By comparing the terms, we get: ${\widetilde{\beta}}_t=\frac{1}{\left(\frac{\alpha_t}{\beta_t}+\frac{1}{{1-\bar{\alpha}}_{t-1}}\right)}=\frac{{1-\bar{\alpha}}_{t-1}}{{1-\bar{\alpha}}_t}\beta_t$ $\widetilde{\mu}\left(x_t,x_0\right)$ $=\frac{\left(\frac{2\sqrt{\alpha_t\ }}{\beta_t}x_t+\frac{2\sqrt{{\bar{\alpha}}_t\ }}{{1-\bar{\alpha}}_{t-1}}x_0\right)}{{2\widetilde{\beta}}_t}$ $=\left(\frac{\sqrt{\alpha_t\ }}{\beta_t}x_t+\frac{\sqrt{{\bar{\alpha}}_t\ }}{{1-\bar{\alpha}}_{t-1}}x_0\right)\ \cdot\ \frac{{1-\bar{\alpha}}_t}{{1-\bar{\alpha}}_{t-1}}\beta_t$ $=\frac{\sqrt{{\bar{\alpha}}_t\ }(1-{\bar{\alpha}}_{t-1})}{{1-\bar{\alpha}}_t}x_t+\frac{\sqrt{{\bar{\alpha}}_{t-1}\ }\beta_t}{{1-\bar{\alpha}}_t}x_0$ $=\frac{\sqrt{{\bar{\alpha}}_t\ }(1-{\bar{\alpha}}_{t-1})}{{1-\bar{\alpha}}_t}x_t+\frac{\sqrt{{\bar{\alpha}}_{t-1}\ }\beta_t}{{1-\bar{\alpha}}_t}\ \cdot\ \frac{1}{\sqrt{{\bar{\alpha}}_t\ }}\left(x_t-\sqrt{1-\bar{\alpha_t}}\epsilon_t\right)$ $=\frac{1}{\sqrt{\alpha_t\ }}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon_t)$ The result is pretty important in deriving our loss function for the training of diffusion model. ### 2.3 Training of Diffusion Model ![](https://hackmd.io/_uploads/rkSoGK6Dh.png) <p style="text-align: center">Fig. 2: Comparisons between the flow of VAE and DM.</p> Diffusion model’s setup is actually very similar to VAE, thus Variational Lower Bound (VLB) is introduced to minimize the negative log-likelihood: ![](https://hackmd.io/_uploads/S1wK9cav2.png) [1] also proved that the same result can be obtained using Jensen’s inequality, i.e. $L_{CE}=L_{VLB}$, where $CE$ indicates cross entropy. In addition to that, [1] rewrite $L_{VLB}$ into a combination of several KL-divergence and entropy terms: ![](https://hackmd.io/_uploads/BJEacqpP3.png) $\therefore L_{VLB}=L_T+L_{T-1}+\cdots+L_1+L_0$ $L_T$ is a known constant (forward diffusion process conditioned on $x_0$) which has no parameter to be trained, $L_0$ can also be ignored ([2] models it using a separate discrete decoder), which only lefts $L_t$ to be trained, which are the KL divergence between the real backward diffusion process and the predicted backward diffusion process's Gaussian distribution. Recall that the KL divergence between two Gaussian distribution is: $D_{KL}\left(N(x;\mu_x,\ \mathrm{\Sigma}_x)\ ||\ \ N(y;\mu_y,\ \mathrm{\Sigma}_y\right)=\frac{1}{2}\left[\log{\frac{{|\mathrm{\Sigma}}_y|}{|\mathrm{\Sigma}_x|}}-d+tr\left(\mathrm{\Sigma}_y^{-1}\mathrm{\Sigma}_x\right)+\left(\mu_y-\mu_x\right)^T\mathrm{\Sigma}_y^{-1}(\mu_y-\mu_x)\right]$ The naive approach is to compute KL divergence directly, but this comes with a heavy computation cost. Actually, there is a cleverer way to calculate $L_t$, as our objective is to minimize the KL divergence between two Gaussian distribution. The mean and variance of the real Gaussian distribution is known, and we assume that the predicted Gaussian distribution's variance is also fixed by ${\widetilde{\beta}}_t$. Thus, the objective can be simplified into minimizing the difference between the mean of two Gaussian distribution. > Simply put, we want to **"pull"** the **predicted** Gaussian distribution (see blue group in Fig. 3) closer to the **real** Gaussian distribution (green group in Fig. 3), by only changing the **mean**. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/H1u9Kcaw3.png"> </p> <p style="text-align: center"> Fig. 3: Conceptual diagram of the simplified KL divergence. </p> Recall from above that the mean of real distribution $q$ is $\widetilde{\mu}\left(x_t,x_0\right)=\frac{1}{\sqrt{\alpha_t\ }}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon_t)$, we now introducing the mean $\mu_\theta\left(x_t,t\right)$ of the predicted distribution $p_\theta$ as: $\mu_\theta\left(x_t,t\right)=\frac{1}{\sqrt{\alpha_t\ }}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta\left(x_t,t\right))$ Then, $L_t$ is reparameterized to minimize the difference between $\widetilde{\mu}\left(x_t,x_0\right)$ and $\mu_\theta\left(x_t,t\right)$: $L_t{=\mathbb{E}}_{x0,\ \epsilon}[\ \frac{1}{2||\mathrm{\Sigma}_\theta(x_t,t){||}_2^2}\ ||\widetilde{\mu}\left(x_t,x_0\right)-\ \mu_\theta\left(x_t,t\right){||}^2\ ]$ ${=\mathbb{E}}_{x0,\ \epsilon}[\ \frac{{(1-\alpha_t)}^2}{2\alpha_t(1-{\bar{\alpha}}_t)||\mathrm{\Sigma}_\theta{||}_2^2}\ ||\epsilon_t\ -\epsilon_\theta\left(\sqrt{{\bar{\alpha}}_t}x_0+\sqrt{1-{\bar{\alpha}}_t}\epsilon_t,t\right){||}^2\ ]$ According to [2], the training works better if the weighting term is ignored: ${L_t}^{simple}{=\mathbb{E}}_{x0,\ \epsilon}[\ ||\epsilon_t\ -\epsilon_\theta\left(\sqrt{{\bar{\alpha}}_t}x_0+\sqrt{1-{\bar{\alpha}}_t}\epsilon_t,t\right){||}^2\ ]$ In other words, to optimize the neural network of Diffusion model, the loss function is designed as the **Mean Square Error (MSE)** between the added Gaussian noise in forward diffusion process and the predicted to-be-denoised Gaussian noise in backward diffusion process. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/HJzNccTPn.png"> </p> <p style="text-align: center"> Fig. 4: Pseudocode of training and sampling algorithms proposed by DDPM[2]. </p> <p style="text-align: center"> <img src="https://hackmd.io/_uploads/B1VU9qTP3.png"> </p> <p style="text-align: center"> Fig. 5 Overview diagram of every training iteration of diffusion model,</br> the source of German Shepherd image is ImageNet [3] </p> ### 2.4 Improvements on Training ### 2.4.1 Noise Scheduling Optimization ### (I) Improvement on $\Sigma_\theta$ In the previous part, we only discuss about $\mu_\theta$, and we assume $\mathrm{\Sigma}_\theta$ is set to $\sigma_t^2I$. According to [2], setting $\sigma_t^2=\beta_t$ and $\sigma_t^2={\widetilde{\beta}}_t$ yields roughly the same generation quality (where $\beta_t$ and $\tilde{\beta_t}$ is predefined, not learnable). This is because learning a diagonal variance $\mathrm{\Sigma}_\theta$ might encounters instabilities. [4] concludes that as we increase the number of diffusion steps (i.e. $T$), the choice of $\sigma_t^2$ might not matter at all, as $\mu_\theta$ determines the Gaussian distribution much more than $\mathrm{\Sigma}_\theta$. Despite that, fixing $\sigma_t^2$ says nothing about minimizing the negative log-likelihood *(i.e. optimizing the Diffusion model)*. Thus, [4] proposed that $\mathrm{\Sigma}_\theta$ should be parameterized as an interpolation between $\beta_t$ and $\tilde{\beta_t}$ in the log domain. To achieving this purpose, the neural network will be designed to output an extra vector $v$, such that: $\mathrm{\Sigma}_\theta=exp(v\cdot log\beta_t+(1-v)\cdot log{\widetilde{\beta}}_t)$ However, $L_{simple}$ has no $\mathrm{\Sigma}_\theta$ term, hence a new hybrid objective is defined as: $L_{hybrid}=L_{simple}+\lambda L_{VLB}$ where $λ = 0.001$ (small enough to $prevent L_{VLB}$ from overwhelming $L_{simple}$), and the gradient descent of $\mu_\theta$ in the $L_{VLB}$ is frozen (i.e. $L_{VLB}$ only guides the learning of $\mathrm{\Sigma}_\theta$). ### (II) Improvement on $\beta_t$ [4] found that the linear noise scheduling proposed by [2] (i.e. $\left\{\beta_1,\ \cdots,\beta_T\right\}\in{[{10}^{-4},\cdots,0.02]}$, a simple linear increasing sequence) makes the end of forward diffusion process too noisy, and doesn’t contribute much to the generation quality. To address this problem, [4] construct a cosine-based noise scheduling: $\beta_t=clip(1-\frac{{\bar{\alpha}}_t}{{\bar{\alpha}}_{t-1}},0.999)$, where ${\bar{\alpha}}_t=\frac{f(t)}{f(0)}$ , $f(t)=cos{{(\frac{t/T+s}{1+s}\cdot\frac{\pi}{2})}^2}$ where $s=0.008$, a small offset to prevent $\beta_t$ from being too small when close to t=0. This adjustment decreases the amount of denoising in each time step of backward diffusion process, allowing more error tolerance for the prediction of neural network $\epsilon_\theta(x_t,\ t)$. ![](https://hackmd.io/_uploads/ryTQ8YaD3.png) <p style="text-align: center">Fig. 6: Linear schedule (top) and cosine schedule (bottom) respectively spaced of t from 0 to T. The red border indicates which the forward diffusion process already makes x_t become isotropic Gaussian. The blue arrow indicates the effective denoising in backward diffusion process.</br> (image source from [4]) </p> ### 2.4.2 Speeding up Diffusion Model’s Sampling The inference/sampling of DDPM is very slow compared to GAN. One way to speed up is to run a strided sampling schedule by taking the sampling every $\lceil T/S\rceil$ steps. Another approach is DDIM[5], recalled that in DDPM the variance is $\sigma_t^2={\widetilde{\beta}}_t=\frac{{1-\bar{\alpha}}_{t-1}}{{1-\bar{\alpha}}_t}\beta_t$ , whereas DDIM proposed to let: $\sigma_t^2=\eta\cdot{\widetilde{\beta}}_t$ where $\eta$ is an adjustable hyperparameter to control the sampling stochasticity, and DDPM is a special case of $\eta=1$. Note that $\eta=0$ makes the sampling process deterministic. During inference, DDIM ($\eta=0$) can sample a small subset of S diffusion steps only and still produce good quality, while DDPM will underperform on small S. However, DDPM does perform better if the inference is run on full reverse Markov steps. There are many researches in speeding up Diffusion model’s sampling, please refer to the latest one DPM-Solver[6]. ## Chapter 3 Network Architecture ### 3.1 U-Net architecture As we discussed in section 2.3 and illustrated in fig. 5, the diffusion model is a neural network that requires a noisy image as input, and it outputs a predicted noise that has same size as the input. To achieve this kind of fine grained task, DDPM[2] adopted the classical U-Net[7] that is well known for fine grained image segmentation task. ![](https://hackmd.io/_uploads/SJ6wPKTPn.png) <p style="text-align: center"> Fig. 7 the architecture of U-Net[7] </p> The original U-Net uses a stack of residual layers and downsampling convolutions, followed by a stack of residual layers with upsampling convolutions and skip connections. Apart from these, [2] introduced a global attention (single head) layer at the 16x16 resolution layer, and add a projection (which I would prefer to say Linear layer) of the timestep embedding into each residual block. In [8], the author further explores the following architectural changes that give substantial boost to sample quality: - Increase depth versus width, holding model size relatively constant. - Increase the number of attention heads. - Add attention layer at 32x32, 16x16, 8x8 resolutions. (compared to only at 16x16) - Use the BigGAN[9] residual block rather than conventional ones. - Rescaling residual connection with a factor of \frac{1}{\sqrt2}. - Introduced adaptive group normalization (AdaGN), which incorporates the timestep and class embedding into each residual block after a group normalization operation. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/rJHpwYTDn.png"> </p> <p style="text-align: center"> Fig. 8 diagram of AdaGN, where h is the intermediate activations of residual block, </br> ys and yb are output of the linear projection of the time and class embedding. </p> <p style="text-align: center"> <img src="https://hackmd.io/_uploads/BkqKutpwh.png"> </p> <p style="text-align: center"> Fig. 9: The residual block proposed by BigGAN[9]. </p> ### 3.2 U-ViT architecture Up until 2023, most diffusion model uses U-Net as the network backbone. Meanwhile, Transformer has shown dominance in the computer vision field. Coincidentally, **U-ViT**[10] and **DiTs**[11] proposed to combine Vision Transformer (ViT) with diffusion model (which we will only discuss about U-ViT). <p style="text-align: center"> <img src="https://hackmd.io/_uploads/BySyYFTD3.png"> </p> <p style="text-align: center"> Fig. 10: The U-ViT[10] architecture for diffusion model. </p> As illustrated in fig. 10, U-ViT separates the noisy image $x_t$ into patches, and input along with timestep $t$ and condition $c$ as tokens into the embedding layer. The rest of the network is generally similar with the original ViT, here we will discuss some implementation details: - **Long skip connection** Let $h_m$, $h_s$ be the embeddings from the main branch and the long skip branch. They are combined by concatenating them and then performing a linear projection. - $Linear(Concat(h_m,h_s))$ - **Patchifying input image** Same as original ViT, using a 2D convolution block with kernelSize k (same as patchSize) to produce feature map (default dimension is 768), then flatten them. - $Flatten(Conv2d(inChannels=3,outChannels=768,kernelSize=k,stride=k))$ - **Position embedding** Same as original ViT, using a 1-dimensional learnable position embedding. - **Patch embedding** Same as original ViT, using a linear projection that maps a patch to a token embedding. - **The last convolution block before output** Adds a 3x3 convolution block after the linear projection that maps the token embedding to image patches. - **Conditioning** Any type of conditioning (i.e. class labelling, cross-attention, layer normalization, etc.) is fine, because U-ViT only requires the conditioning information as a token. Here we also show some hyperparameter that perform the best in the ablation study conducted by [10]: - Network depth (i.e. how many Transformer Block): 13 - Network width (i.e. Transformer Block’s hidden dimension): 512 - Patch size: 2 (note that further decreasing to 1 brings no gain) ## Chapter 4 Conditioned Generation of Diffusion Model This section discusses about some techniques used to condition the generation result of diffusion models. By contrast, the conditional image generation of **GAN** architecture make heavy use of class labels, because discriminators are designed to behave like classifiers $p(y|x)$. ### 4.1 Classifier guided diffusion Inspired by GAN, [8] exploited a classifier $p(y|x)$ to achieve conditioned generation for diffusion model. First, they pretrain a classifier to predict the label of a noisy image $x_t$, then they train the diffusion model afterwards. The conditioning happens at the inference stage. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/HkeCYtTw2.png"> </p> <p style="text-align: center"> Fig. 11: Typical approach to pretrain a classifier </p> The authors derived a conditional transition operator that can be approximated by a Gaussian similar to the unconditional transition operator, but with its mean shifted by $s{\cdot\mathrm{\Sigma}}_\theta\cdot\mathrm{\nabla}_{x_t}logp_\varphi(y|x_t)$, where $s$ is a scaling factor. In conclusion, this method relies on gradients from the image classifier to guide the conditional generation. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/HyT49tTP2.png"> </p> <p style="text-align: center"> Fig. 12: The inference stage of conditional generation using classifier guidance. </p> However, there are some cons for this type of conditioning: (I) you need to train an extra classifier network, which is time consuming; (II) the classifier is train on noisy images, which might be easy to attack it adversarially; (III) there are no standard pretrained weight, you have to train it yourself. ### 4.2 Classifier-Free guidance Instead of training a separate classifier network, [12] proposed to train a single neural network for unconditional model $p_\theta(x)$ parameterized through a score estimator $\epsilon_\theta(x_t)$, and the conditional model $p_\theta(x|y)$ to parameterized through $\epsilon_\theta(x_t,y)$. The mean shifting term derived by [8] can be further reparameterized into: $\because p_\varphi(y|x_t) \propto p(x_t|y) / p(x_t)$ $\therefore\mathrm{\nabla}_{x_t}logp_\varphi(y|x_t)=\mathrm{\nabla}_{x_t}logp(x_t|y)\ -\ \mathrm{\nabla}_{x_t}logp(x_t)$ $=\frac{1}{\sqrt{1-{\bar{\alpha}}_t}}(\epsilon_\theta(x_t,y)-\epsilon_\theta(x_t))$ Modifying the score estimator $\epsilon_\theta(x_t,y)$ into ${\bar{\epsilon}}_\theta(x_t,y)$: ${\bar{\epsilon}}_\theta(x_t,y)=\epsilon_\theta(x_t,y)-\sqrt{1-{\bar{\alpha}}_t}w\mathrm{\nabla}_{x_t}logp(x_t)$ $=\epsilon_\theta\left(x_t,y\right)-w\left(\epsilon_\theta\left(x_t,y\right)-\epsilon_\theta\left(x_t\right)\right)$ $=(w+1)\epsilon_\theta(x_t,y)-w\epsilon_\theta(x_t)$ where $w$ is a parameter that controls the strength of classifier guidance. From the above derivation, the gradient of an extra classifier $\varphi$ can be represented equivalently as a conditional and unconditional score estimators. Note that when we want to generate unconditionally, we can input $y=\emptyset$, a special null token. At training time, $y$ is randomly set to null with the probability set as a hyperparameter (default=0.1), so the network learns to do unconditional generation and also conditional generation. To put it in simpler words, we train the diffusion model on a paired data $(x,y)$, where we randomly discard the conditioning information $y$, according to what the author stated: > “It is only a one-line change of code during training—to randomly drop out the conditioning—and during sampling—to mix the conditional and unconditional score estimates.” Increasing $w$ produces images that are more typical, but less diverse, as illustrated in fig. 13. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/H1Jwst6vh.png"> </p> <p style="text-align: center"> fig. 13 MNIST generation results under different strength of guidance w. </p> ### 4.3 Cross-Attention The **Latent Diffusion Model (LDM)** [13] proposed a more flexible conditioning method with cross-attention mechanism, which was originally introduced in **“Attention is all you need”** [14] back in 2017. It is very effective for various input modalities (class label, text, speech, image, etc.) as the condition for generation. The cross-attention layer is plugged in to the U-Net backbone at the end of every resolution level. (Actually, cross-attention is very flexible that you can insert it to anywhere of any existing backbone) Let’s discuss deeper about cross-attention: Given 2 sequences (vector): $s_1$ *(condition)* and $s_2$ *(the intermediate result of generation)*, what it does is to map $s_2$ to $s_1$. Note that these 2 sequences must have the same dimension, the simplest way to do so is to apply a linear projection. First, compute the query $(Q)$, key $(K)$ and value $(V)$: $Q=W_Q^T\cdot s_2$ $K=W_K^T\cdot s_1$ $V=W_V^T\cdot s_1$ where $W_Q, W_K$ and $W_V$ are trainable parameters. Then, compute the attention matrix $A$ using $K$ and $Q$: $A=softmax(\frac{Q\cdot K^T}{\sqrt d})$ Finally, the output is obtained by the dot product of $A$ and $V$. This will produce sequence with the same dimension as $s_2$. The implementation is fairly simple as shown as [[link]](https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/attention.py). <p style="text-align: center"> <img src="https://hackmd.io/_uploads/H1T42F6v2.png"> </p> <p style="text-align: center"> Fig. 14: The cross-attention combined with U-Net by LDM [13] </p> ## Chapter 5 Applications of Diffusion Model Below we introduce some popular application of diffusion model, specifically for computer vision. Some contents below are referred from [15]. ### 5.1 Text-to-Image Generation Google Research’s **Imagen**[20] proposes a state-of-the-art text-to-image diffusion model and a comprehensive benchmark for performance evaluation. Besides that, some other approaches including **LDM**[13] and OpenAI’s **DALL-E2**[19] can also generate photorealistic result with the ability to input text prompts freely. Currently the text-to-image task is stomping the deep learning field at a crazy rate, hence you might see a new SOTA work every month, or even week. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/SkpThF6v2.png"> </p> <p style="text-align: center"> Fig. 15: Text2Image generation result from Imagen[20]. </p> ### 5.2 Super-Resolution Super-resolution is a computer vision task to restore high-resolution image from low-resolution input. **Super-Resolution via Repeated Refinement (SR3)**[17] uses DDPM to enable conditional image generation. SR3 conducts super-resolution through a stochastic, iterative denoising process. The **Cascaded Diffusion Model (CDM)**[18] consists of multiple diffusion models in sequence, each generating images of increasing resolution. Both the SR3 and CDM directly apply the diffusion process to input images, which leads to larger evaluation steps. The **Latent Diffusion Model (LDM)** [13] shifted the diffusion process to latent space using pre-trained autoencoders, this reduces the computation resource required for the training of diffusion model. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/HymE6FTDn.png"> </p> <p style="text-align: center"> Fig. 16: CDM[18] comprising a base model and 2 super-resolution models. </p> ### 5.3 Inpainting Image inpainting is the process of reconstructing missing or damaged regions in an image. Hence by using diffusion model, it predicts the missing pixels of an image using a mask as condition. **RePaint** [16] modifies the diffusion model by sampling the known region from the input and the inpainted part from the DDPM output. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/ryqOTtpPn.png"> </p> <p style="text-align: center"> Fig. 17: Overview of RePaint[16]. </p> ### 5.4 Semantic Segmentation Semantic segmentation's goal is to predict the corresponding object label of every pixel in an image. **Recent work**[21] has shown that the representations learned through DDPM contain high-level semantic information. Besides that, **Decoder Denoising Pretraining (DDeP)**[22] integrates diffusion models with denoising autoencoder[23] and achieve promising results on label-efficient semantic segmentation. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/S1u3pFpv3.png"> </p> <p style="text-align: center"> Fig. 18: Overview of Label-Efficient Semantic Segmentation with Diffusion Models[21]. </p> ### 5.5 Image Translation Image translation refers to transform an image to another visual content. The typical application of diffusion model in this domain includes: **(I) Style Transfer - Inversion-Based Style Transfer with Diffusion Models (InST)**[24] **InST** uses **LDM**[13] as the generative backbone and propose an attention-based textual inversion module. During image synthesis, the inversion module takes the **CLIP**[25] image embedding of an artistic image and gives the learned corresponding text embedding, then encoded into the standard form of caption conditioning (for **StableDiffusion**). After that, the LDM generative backbone utilizes the encoded latent and random noise that is conditioned on the caption to generate new images. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/rkoZRFTPn.png"> </p> <p style="text-align: center"> Fig. 19: Overview of InST[24]. </p> **(II) image-to-image translation - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation (PnP-DFs)**[26] [26] propose to take an input image and a guidance text prompt describing the desired translation. First, the image is inverted to initial noise and progressively denoise using **DDIM**[5] sampling. During this process, the **Plug-and-Play Diffusion Features (PnP-DFs)** - the spatial feature $(f_t)$ from the decoder layer, Query $(Q_t)$ and Key $(K_t)$ from its self-attention layer are extracted, and then are injected to guide the image translation along with the guidance text prompt. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/ByZICF6w2.png"> </p> <p style="text-align: center"> Fig. 20: Overview of PnP-DFs[26] </p> ## Chapter 6 Diffusion Model on Video Generation From 2022, the diffusion model community found the way to generate high fidelity, temporally coherent videos rather than just independent images using diffusion models. Below we will introduce the pioneer work **VDM**[27] and the other works. ### 6.1 Video Diffusion Model (VDM)[27] VDM is a natural extension of the standard image diffusion model, and it enables jointly training from image and video data, which the authors found to reduce the variance of minibatch gradients and speed up optimization. Below showed their modifications: **(1)** Using 3D U-Net to factorize over space and time: - changing each 2D convolution (3x3) into a space-only 3D convolution (1x3x3, corresponding to frame, spatial height, spatial width) - after each spatial attention block (that remains the same), add a temporal attention block that performs attention over the frame axis. **(2)** Train on 16-frames data. To produce longer sequence, the author proposed 2 approaches: - Method 1, first generate a sequence of 16-frames $x^a$, then autoregressively generate the next 16-frames $x^b$ conditioned on $x^a$: $x^b$~$p_\theta(x^b|x^a)$ - Method 2, generate $x^a$ to represent a low frame rate video, then generate $x^b$ to be those frames between in $x^a$. **(3)** followed by (2), the conditional sampling is modified from [31] as the reconstruction guidance: ${\widetilde{x}}_\theta^b(z_t)={\hat{x}}_\theta^b(z_t)-\frac{w_r\alpha_t}{2}\mathrm{\nabla}_{z_t^b}||x^a-\hat{x}^a_θ(z_t)||^2_2$ where $z_t$ is the latent, $w_r$ is the weighting factor of guidance, ${\hat{x}}_\theta^a(z_t)$ is a reconstruction of the conditioning data $x^a$ provided by the denoising model, $\alpha_t=1-\beta_t$ the reparameterized variance mentioned in section 2.1. **(4)** Joint training on video and image is implemented as randomly concatenating independent image to the end of each video. Then, mask the attention in the temporal attention blocks to prevent the mixing of information across video frames and individual image. ### 6.2 Video Probabilistic Diffusion Models in Projected Latent Space (VPDM)[28] VPDM proposed an autoencoder that represents a video with three 2D latent vectors (i.e. 3D to 2D projections of video) as illustrated in fig. 20: - The first latent vector across the temporal dimension (i.e. frame axis) is to parameterize the common contents of the video (i.e. background). - The latter two latent vector across the spatial height and weight is to encode the motion of a video. Besides that, VPDM design a new video diffusion model architecture that is based on their 2D image-like latent space to avoid the computation-heavy 3D U-Net. They train a single 2D U-Net to denoise each three latent vectors $z^s,z^h,z^w$, where the dependency among $z^s,z^h,z^w$ are joint by attention layers. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/BJdKr9awh.png"> </p> <p style="text-align: center"> Fig. 20: The autoencoder architecture of PVDM. </p> ### 6.3 VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation[29] VideoFusion observe that previous work usually generates video frames with independent noises, resulting the temporal correlations are also destroyed in the latent space. They proposed to resolve the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The latent space $z_t^i$ can be formulate as: $z_t^i=\sqrt{{\hat{\alpha}}_t}x^i+\sqrt{1-{\hat{\alpha}}_t}\epsilon_t^i$ where $x^i$ denotes the ith frame of a video. [29] re-formulate $x^i$ to utilize the similarity between frames as: $x^i=\sqrt{\lambda^i}x^0+\sqrt{1-\lambda^i}{\mathrm{\Delta x}}^i$ where $x^0$ is the base frame (i.e. common part of video) and ${\mathrm{\Delta x}}^i$ is the difference between ith frame and base frame. Plugging the re-formulated $x^i$ into $z_t^i$, we obtain: $z_t^i=\sqrt{{\hat{\alpha}}_t}(\sqrt{\lambda^i}x^0+\sqrt{1-\lambda^i}{\mathrm{\Delta x}}^i)+\sqrt{1-{\hat{\alpha}}_t}\epsilon_t^i$ and we split the noise $\epsilon_t^i$ into a base noise $b_t^i$ and residual noise $r_t^i$ as the following term: $\epsilon_t^i=\sqrt{\lambda^i}b_t^i\ +\sqrt{1-\lambda^i}r_t^i$ combining everything, we get $z_t^i=\sqrt{\lambda^i}(\sqrt{{\hat{\alpha}}_t}x^0+\sqrt{1-{\hat{\alpha}}_t}b_t^i)+\sqrt{1-\lambda^i}{(\sqrt{{\hat{\alpha}}_t}\mathrm{\Delta x}}^i+\sqrt{1-{\hat{\alpha}}_t}r_t^i)$ we can further simplify this using the fact that $b_t^i$ is shared across all frames, $z_t^i=\sqrt{{\hat{\alpha}}_t}x^i+\sqrt{1-{\hat{\alpha}}_t}{(\sqrt{\lambda^i}}^\ b_t^\ +\sqrt{1-\lambda^i}r_t^i)$ And we can also derive a recursive form between $z_t^i$ and $z_{t-1}^i$: $z_t^i=\sqrt{\alpha_t}z_{t-1}^i+\sqrt{1-\alpha_t}{(\sqrt{\lambda^i}}^\ b_t^\ +\sqrt{1-\lambda^i}r_t^i)$ VideoFusion’s architecture is designed as 2 neural networks, a base generator $z_\emptyset^b$ to predict $b_t$ and a residual generator $z_\varphi^r$ to predict $r_t^i$. The base generator can be a pretrained diffusion model (i.e. DALL-E 2 or Imagen) to speed up the training process. During inference, VideoFusion first using $z_\emptyset^b$ to estimate $b_t$ , then removing it from all frames. Afterwards, all the other frames' latent $z_t^i$ are feed into $z_\varphi^r$ to estimate each $r_t^i$ respectively. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/HJH689Tvh.png"> </p> <p style="text-align: center"> Fig. 21: The overview of VideoFusion. </p> ### 6.4 Conditional Image-to-Video Generation with Latent Flow Diffusion Models (LFDM)[30] LFDM are designed to better synthesis the temporal dynamics corresponding to the given image and condition, as it is one of the biggest challenges of video generation task. LFDM consists of 2 stages: **(1)** An unsupervised learning to train a **latent flow auto-encoder **(**LFAE**) by predicting the <u>*latent optical flow*</u> between two frames from a video: a reference ($ref$) frame and a driving ($dri$) frame. Then, the ref is warped with predicted flow, and LFAE is optimized to minimize the loss between $dri$ frame and warped $ref$ frame (perceptual loss[32] is selected here). Besides that, LFAE will also output an occlusion map $m$ to tell the diffusion model (in stage 2) to generate those invisible (occluded) parts in $z$. > *Optical flow is the relative motion of a visual scene, which can be represented as an (H,W,2)-size array, corresponding to every pixel’s xy-offsets. In LFDM, this is implemented through a differentiable bilinear sampling operation.* > ![](https://hackmd.io/_uploads/SyvvD5aPn.png) **(2)** Train a 3D U-Net based diffusion model using a paired condition y and latent flow sequence predicted by LFAE for temporal latent flow generation. The diffusion model in LFDM operates in a simple and low-dimensional latent flow space which only describes motion dynamics. <p style="text-align: center"> <img src="https://hackmd.io/_uploads/H1iFP5TPn.png"> </p> <p style="text-align: center"> Fig. 22: Overview of LFDM. </p> ## Chapter 7 Conclusion In this tutorial, we provided a comprehensive overview of the diffusion model, delving into its fundamental theory and discussing the network architecture commonly used in diffusion models. We also explored the conditioning methods for generating desired contents and examined various applications of diffusion models. Lastly, we focused on the latest trend in diffusion model research, specifically video generation. While the diffusion model has already achieved state-of-the-art results in many domains, its true potential lies in its remarkable capabilities for generative tasks. We highlighted several cutting-edge papers that have further enhanced the performance of the diffusion model. It is worth noting that most of the references cited in this tutorial are up until June 2023 (CVPR 2023). Therefore, if you are reading this from a future time beyond the present, I recommend conducting additional literature surveys to stay updated. Thank you for taking the time to read through this tutorial. I sincerely hope that it has been informative and beneficial to you. **Acknowledgement**: special thanks to [33] for the awesome blog. ## Reference [1] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep Unsupervised Learning using Nonequilibrium Thermodynamics," arXiv:1503.03585 [cs.LG], Mar. 2015. [2] J. Ho, A. Jain, P. Abbeel, “Denoising Diffusion Probabilistic Models”, arXiv:2006.11239v2 [cs.LG], Dec 2020. [3] ImageNet, "German shepherd", available at http://vision.stanford.edu/aditya86/ImageNetDogs/n02106662.html [4] A. Nichol and P. Dhariwal, "Improved Denoising Diffusion Probabilistic Models," arXiv:2102.09672 [cs.LG], Feb. 2021. [5] J. Song, C. Meng, and S. Ermon, "Denoising Diffusion Implicit Models," arXiv:2010.02502 [cs.LG], Oct. 2022. [6] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps," arXiv:2206.00927 [cs.LG], Jun. 2022. [7] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation," arXiv:1505.04597 [cs.CV], May 2015. [8] P. Dhariwal and A. Nichol, "Diffusion Models Beat GANs on Image Synthesis," arXiv:2105.05233 [cs.LG], May 2021. [9] A. Brock, J. Donahue, and K. Simonyan, "Large Scale GAN Training for High Fidelity Natural Image Synthesis," arXiv:1809.11096 [cs.LG], Sep. 2019. [10] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu, "All are Worth Words: A ViT Backbone for Diffusion Models," arXiv:2209.12152 [cs.CV], Sep. 2023. [11] W. Peebles and S. Xie, "Scalable Diffusion Models with Transformers," arXiv:2212.09748 [cs.CV], Dec. 2023. [12] J. Ho and T. Salimans, "Classifier-Free Diffusion Guidance," arXiv:2207.12598 [cs.LG], Jul. 2022. [13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models," arXiv:2112.10752 [cs.CV], Dec. 2022. [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," arXiv:1706.03762 [cs.CL], Jun. 2017. [15] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, "Diffusion Models: A Comprehensive Survey of Methods and Applications," arXiv:2209.00796 [cs.LG], Sep. 2023. [16] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, "RePaint: Inpainting using Denoising Diffusion Probabilistic Models," arXiv:2201.09865 [cs.CV], Jan. 2022. [17] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, "Image Super-Resolution via Iterative Refinement," arXiv:2104.07636 [eess.IV], Apr. 2021. [18] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, "Cascaded Diffusion Models for High Fidelity Image Generation," arXiv:2106.15282 [cs.CV], Jun. 2021. [19] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, "Hierarchical Text-Conditional Image Generation with CLIP Latents," arXiv:2204.06125 [cs.CV], Apr. 2022. [20] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding," arXiv:2205.11487 [cs.CV], May 2022. [21] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, "Label-Efficient Semantic Segmentation with Diffusion Models," arXiv:2112.03126 [cs.CV], Dec. 2022. [22] E. B. Asiedu, S. Kornblith, T. Chen, N. Parmar, M. Minderer, and M. Norouzi, "Decoder Denoising Pretraining for Semantic Segmentation," arXiv:2205.11423 [cs.CV], May 2022. [23] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders," in Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008, pp. 1096-1103, doi: 10.1145/1390156.1390294. [24] Y. Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, "Inversion-Based Style Transfer with Diffusion Models," arXiv:2211.13203 [cs.CV], Nov. 2023. [25] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," arXiv:2103.00020 [cs.CV], Mar. 2021. [26] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, "Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 1921-1930. [27] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, "Video Diffusion Models," arXiv:2204.03458 [cs.CV], 2022. [28] J. Shin, K. Sohn, S. Yu, and S. Kim, "Video Probabilistic Diffusion Models in Projected Latent Space," in Proceedings of the 2023 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [29] Z. Luo, D. Chen, Y. Zhang, Y. Huang, L. Wang, Y. Shen, D. Zhao, J. Zhou, and T. Tan, "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation," arXiv preprint arXiv:2303.08320, 2023. [30] H. Ni, C. Shi, K. Li, S. X. Huang, and M. R. Min, "Conditional Image-to-Video Generation with Latent Flow Diffusion Models," arXiv preprint arXiv:2303.13744, 2023. [31] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, "Score-Based Generative Modeling through Stochastic Differential Equations," arXiv preprint arXiv:2011.13456, 2021. [32] J. Justin, A. Alexandre, and F. Li, “Perceptual losses for real-time style transfer and super-resolution,” In European conference on computer vision, pages 694–711. Springer, 2016 [33] L. Weng, "What are diffusion models?," lilianweng.github.io, Jul. 2021. [Online]. Available: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ ## Citation ``` @article{Gan_2023, title = "A Tutorial for Diffusion Model in 2023", author = "Gan, Chee Kim", year = 2023, month = Jun, url = "https://hackmd.io/@GanCheeKim99/tutorial-DM-2023", } ```