An overview of Diffusion Probabilistic Models

# An overview of Diffusion Probabilistic Models We all have used image generators these days. Even though, the general public has grown quite used to this, these are fantastic amalgamations of math and engineering, that continue to amaze me. As an engineer in the making :P, I do not quite get satisfied until I know how things work behind the scenes. So with some basic knowledge of pytorch and deep learning, I set out to code it out, as I prefer python over math(sorry). <a href="https://emoji.gg/emoji/7689-vsign"><img src="https://cdn3.emoji.gg/emojis/7689-vsign.png" width="25px" height="25px" alt="vsign"></a> NOTE: This will have some code blocks as well, and a little bit of knowledge about neural networks may be required. Will try my best to make it as simple as possible. ## Architecture ![image](https://hackmd.io/_uploads/SkLqhkQRp.png) This seemingly complex diagram is how basically the architecture for Stable Diffusion, Dall-E, midjourney etc. The concept behind all of these image generators, Diffusion, was introduced in the paper [DDPM](https://arxiv.org/abs/2006.11239)- Denoising Diffusion Probabilistic Models. What is Diffusion? Diffusion is the process of transforming noise into meaningful content. We can divide the process in forward process and the backward process. ### The Forward Process: ![Pasted image 20240303210331](https://hackmd.io/_uploads/SJmj3UZkA.png) As we can see, at each step, some noise is being added by the model evenly throughout at each timestep, and the image loses information to become noise. The increments of noise added follow a particular distribution, in this case a normal distribution with mean 0 and standard deviation 1, as indicated by **z ~ N(0, 1)**. ### The Backward Process: The Backward process is the more complex one, since the model has to predict the the noise that was added at each step. This can be done by either of the ways, by predicting either the : 1. Mean of the noise 2. The original image or 3. The noise of the original image As you can guess, the second option when tried, gave pretty bad results as a step wise process was reversed in a single shot and denoising at multiple step will without a doubt have some losses. 1st and 3rd options more or less mean the same, and is what is implemented in diffusion models to get discernible images. Lets assume that the original data(image) that we start with was x₀ and we add noise at β<sub>i</sub> for i belongs to (1,T) where T is the number of steps to add noise Let the final noise that we get be xᵣ ~ N(xᵣ₋₁, βᵣI) here, I is the [Identity Matrix](https://dcvp84mxptlac.cloudfront.net/diagrams2/equation-2-examples-of-identity-matrices-of-different-dimensions.png) We define the reverse variance schedule α̃ₜ = 1 - Π₁≤ₛ≤ₜ(1 - βₛ) for t = 1, 2, ..., T. here Π means, taking the multiplication of all the products of (1 - βₛ) at each s from 1 to T. (1 - βₛ) keeps track of the uncorrupted data at each part of the forward process. The scheduler helps us to decide how much noise to remove at each time step We run a backward loop from T to 1 and at each step we do the following: 1. We sample a random noise vector ε from a standard normal distribution N(0, I). This represents the noise that needs to be removed at this step. 2. Compute x̃ₜ₋₁ = (xₜ - √(1-α̃ₜ)ε) / √α̃ₜ Here, we compute an estimate x̃ₜ₋₁ of the "partially denoised" data at the previous step t-1. This is done by taking the current noisy data xₜ, and removing a portion of the sampled noise √(1-α̃ₜ)ε, scaled by √α̃ₜ which represents how much noise should remain. Here the scheduler helps to calculate α̃ₜ . 3. Predict xₜ₋₁ = x̃ₜ₋₁ + σₜε̃ₜ using the learned denoising model Finally, we predict the denoised data xₜ₋₁ at the previous step by adding some noise σₜε̃ₜ back to the estimate x̃ₜ₋₁. The noise is predicted by the denoising model. The scaling by √α̃ₜ and σₜ helps control the amount of noise removed/added based on the diffusion schedule. So this is how basically diffusion works, according to atleast what I could interpret of it. Paper for reference: [Denoising Diffusion Probabilistic Model](https://) ## Implementation: There were several parts to be built, namely the following: 1. Encoder 2. Decoder 3. Attention 4. CLIP Encoder 5. U-NET 6. DDPM Module - Timesteps and others 7. Pipeline to combine all of them ### Attention: Implemented self attention and cross attention. Attention was first introduced in the [Attention is All You Need ](https://arxiv.org/abs/1706.03762) ![image](https://hackmd.io/_uploads/B1e1L5My0.png) The core idea is that the attention mechanism allows the model to dynamically prioritize and combine different parts of the input sequence based on their relevance to the current target output, rather than treating the entire input equally. This has proven to be a powerful technique for handling long-range dependencies and capturing contextual information in sequence data. The main difference between cross-attention and self-attention in transformer models lies in the sources of the query, key, and value vectors used for calculating the attention weights. 1. **Self-Attention**: In self-attention, the query, key, and value vectors are all derived from the same input sequence. This means the attention mechanism attends to different positions of the same sequence to capture the internal relationships and dependencies within that sequence. 2. **Cross-Attention**: In cross-attention, the query vectors are derived from one input sequence, while the key and value vectors come from a different sequence. This allows the attention to relate different input sequences, such as attending to the relevant parts of the source sequence when generating the target sequence in a sequence-to-sequence task like machine translation. Example: ``` q,k,v = self.in_proj(x).chunk(3,dim = -1)``` ### Encoder The Encoder borrows the concept of Dimension reduction and Feauture Increase introduced in VAE. The encoder architecture is designed to map the input data (e.g., images) to a lower-dimensional latent representation, which can then be used by the decoder part of the VAE to reconstruct the input or generate new samples. The reparameterization trick and the splitting of the output into mean and variance tensors are characteristic of the VAE framework, which aims to learn a continuous latent distribution that can be sampled from during generation. Example: ``` # Params: (BATCH_SIZE,Channel, height, width) -> (Batch_size,128,Height,Width) nn.Conv2d(3,128,kernel_size = 3,padding = 0), #(Batch_size, 128,Height, Width) -> (Batch_size, 128,Height, Width) VAE_Residual(128,128), #same as above, VAE_Residual(128,128), #(Batch_size, 128,Height, Width) -> (Batch_size, 128,Height * 0.5, Width * 0.5) nn.Conv2d(128,128,kernel_size = 3, stride = 2,padding = 3 ), #Just feature increase from 128 to 256 VAE_Residual(128,256), VAE_Residual(256,256), #(Batch_size, 128,Height*0.5, Width*0.5) -> (Batch_size, 128,Height * 0.25, Width * 0.25) nn.Conv2d(256,256,kernel_size = 3,stride =2,padding = 0), #Just feature increase from 256 to 512 VAE_Residual(256,512), VAE_Residual(512,512), # 8 times reduction from original size nn.Conv2d(512,512,kernel_size = 3, stride = 2, padding = 0), VAE_Residual(512,512), VAE_Residual(512,512), VAE_Residual(512,512), VAE_AttentionBlock(512), VAE_Residual(512,512), nn.GroupNorm(32,512), nn.SiLU(), nn.Conv2d(512,8,kernel_size = 3, padding =1 ) ``` ### Decoder The Decoder architecture is designed to take a low-dimensional latent representation (combined with additional conditioning information) and generate a high-dimensional output tensor, such as an image. It uses upsampling methods to recreate the original output image like tensor from the encoder bottleneck Constructor example: ``` def __init__(self): super().__init__( nn.Conv2d(4,4,kernel_size=1,padding = 0), nn.Conv2d(4,512,kernel_size = 3, padding = 1), VAE_Residual(512,512), VAE_AttentionBlock(512), VAE_Residual(512,512), VAE_Residual(512,512), VAE_Residual(512,512), VAE_Residual(512,512), nn.Upsample(scale_factor = 2), nn.Conv2d(512,512,kernel_size = 3, padding = 1), VAE_Residual(512,512), VAE_Residual(512,512), VAE_Residual(512,512), nn.Upsample(scale_factor = 2), nn.Conv2d(512,512,kernel_size = 3, padding = 1), VAE_Residual(512,256), VAE_Residual(256,256), VAE_Residual(256,256), nn.Upsample(scale_factor = 2), nn.Conv2d(256,256,kernel_size = 3, padding = 1), VAE_Residual(256,128), VAE_Residual(128,128), VAE_Residual(128,128), nn.GroupNorm(32,128), nn.SiLU(), nn.Conv2d(128,3,kernel_size = 3, padding = 1) ) ``` ### Conclusion of Part 1 The Rest of the Architectural and Code blocks, go out of scope of this blog, and will be published in a part 2 of the same. Thanks for Reading C