📃 Diffusion Models

--- tags: paper-reading --- # 📃 Diffusion Models ### Papers - [經典：📃 DDPM paper (Jonathan Ho et al, 2020)](https://arxiv.org/pdf/2006.11239.pdf) - [NCKU 資工所深度學習課程報告選讀論文：📃 improved DDPM (Alex Nichol et al, 2021)](https://arxiv.org/pdf/2102.09672v1.pdf) ### Notes - [💡學長的投影片](https://share.goodnotes.com/s/c4tGeZcokEkNLN9fXm5euB) - [塗塗](https://share.goodnotes.com/s/lkv15Lg7Uarw29zTB8PWSG) ### Other References - [邊實作邊學習diffusion-model-從ddpm的簡化概念理解](https://medium.com/ai-blog-tw/邊實作邊學習diffusion-model-從ddpm的簡化概念理解-4c565a1c09c) - [What are Diffusion Models?](https://www.youtube.com/watch?v=fbLgFrlTnGU) - [ML2021 HW6 GAN(動漫人臉生成)](https://colab.research.google.com/github/ga642381/ML2021-Spring/blob/main/HW06/HW06.ipynb#scrollTo=dgkqPih1o5Az) --- ### Q1. Forward Process (Diffusion Process) 這邊描述$x_{t-1}$ 怎麼走到$x_t$（加Gaussian noise的過程）。不過後面這坨$\sqrt{1-\beta t}x_{t-1},\beta t I$ 又各是什麼？是 Gaussian Distribution 的mean和variance嗎？ ![](https://i.imgur.com/rMGxTEq.png =200x) ![](https://i.imgur.com/rI4qX26.png =300x) - 是, 第一個圖表示的是 normal distribution(每個$x_t$服從後面的mean,variance形成的分佈) - mean, variance為人為定義。 - 第二個圖$\alpha_t$跟$\overline{\alpha_t}$是人為定義的，而下面的$x_t$則是利用上次提到的重參數化技巧獲得。 #### Q1-1. $I$ 是什麼？ - $I$是identity matrix，所以最後形成的高斯分佈會是一個isotropic的normal distribution。 - isotropic中文翻譯叫做各向同性，其實就是在矩陣上 $i=j$ 那條左上至右下的對角線上元素均相等(其他位置元素均為0) 。所以理論上上面那句應該要是矩陣看起來會比較合理XD 但其實向量也是做得到的，像我給的toy example裡面就是向量。 ![](https://i.imgur.com/RCvXLtd.png =300x) - ![](https://i.imgur.com/uk8EKKb.png =500x) - 所以準確來說我們不只希望每個$x_t$都可以從Gaussian分佈sample出來，還希望這個分布是**isotropic** Gaussian distribution。 - 因此在推導DDPM reverse process中的$q(x_{t-1}|x_t)$時，也是套這樣的公式（所以才會假設變異數是長這樣，然後最後導出他是一個t-unrelated term)。 ![](https://i.imgur.com/NgaI3kH.png) #### Q1-2. $\overline{z_{t-2}}$ 的意義？是一個隨機數，從$\mathbb{N}(0, \sigma^2 + \sigma'^2)$ 這個常態分佈裡sample。最後這行濃縮的z的意義是什麼？從常態分佈裡抽出的一個值？（常態分布的連乘還是常態分佈）。 ![](https://i.imgur.com/0qEoff4.png =400x) ### Q2. KL-Divergence - Asymmetric - ([Some blog](https://glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/)) Quantifies how much one probability distribution differs from another probability distribution. - ([Wiki](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)) In other words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P. - 看到KL-divergence 有兩種表示方法，一種是(a)積分態，另一種是(b)期望值。能從(a)推導到(b)嗎？ ![](https://i.imgur.com/bpITXRt.png) ### Q3. Reverse Process 最下面這坨正比是哪裡有導過的公式plug in嗎？ ![](https://i.imgur.com/83bdc3o.jpg) 也就是這個medium blog裡說的這段化簡是怎麼化簡 ![](https://i.imgur.com/FXYryE3.png) ### Q4. DDPM paper 語意釐清 ![](https://i.imgur.com/GKTaOoj.png) $\sigma^2_t$: reverse process時（訓練），timestep = t 時的sample distribution。 $x_0$每次都用sample（？）的話，用$\sigma^2_t = \beta_t$ 效果較好，而$x_0$用定點（只看一張圖的效果）去訓練的話用$\sigma^2_t = \tilde{\beta_t}$當變異數效果較好。 --- ## improved DDPM ### 1. Intro (skipped) ### 2. DDPM簡介 (skipped) ### 3. Improving the Log-likelihood - DDPM: First, we set ... to ... to untrained time dependent constants. - iDDPM: $L_{simple}$ provides no learning signal for variance. This is irrelevant, since DDPM achieves their best results by fixing the variance rather than learning it. **They set it to $\beta_t$ and $\tilde{\beta_t}$, which are the upper and lower bounds on the variance given by ...** ($\beta_t$: fixed, $\tilde{\beta_t}$: learnt -> 2 extremes, hence lower and upper bounds). ![](https://i.imgur.com/X9KfeB3.png) ![](https://i.imgur.com/tu2D1cJ.png) #### [EXTRA] FID: Frechet Inception Distance Rather than directly comparing images pixel by pixel (for example, as done by the L2 norm), the FID compares **the mean and standard deviation of the deepest layer in Inception v3**. These layers are closer to output nodes that correspond to real-world objects such as a specific breed of dog or an airplane, and further from the shallow layers near the input image. As a result, they tend to mimic human perception of similarity. FID越低越好。想像要評價產出圖$x'$們的多樣性，因此要幫圖做分類（用的分類器：Inception v3），但在分類之前，把softmax()前的特徵向量$h(x')$取出。對於所有產出圖$x' \in X'$都取出特徵向量;同理對所有訓練用的真圖$x \in X$也都過一遍Inception v3並得到$h(x) \forall x \in X$。 Inception v3在做的事：ImageNet有1000類：eg.車、房子、動物（？），希望generator/diffusion model/...產的圖，能夠被正確地分類進這1000類，並且能夠cover足夠多原始資料裡的類別。 ## Eval Metrics IS (Inception Score): 分子看預測類別準度，分母看類別coverage 但是從頭到尾IS都沒在看ground truth圖片的樣子，只依pretrained Inception v3 算分數，看起來有點危險。 FID：概念很像IS，但他直接取ground truth和DDPM gen image distribution兩個分佈算距離，因此不會有IS的問題。接下來假設這兩組向量們都呈高斯分佈，FID基本上就是計算這兩組高斯分佈的差異（參考：[李宏毅的2021 ML](https://www.youtube.com/watch?v=MP0BnVH2yOo)）。 ![](https://i.imgur.com/WW6ow8C.png) DDPM雖然產圖FID很高，但是在log-likelihoods這個metric上表現不好。而log-likelihood通常是「模型是否能夠掌握每種形式的資料分佈（capture all the modes of data distribution）」的指標。improved DDPM作者使用DDPM paper原先的setup在ImageNet上得到3.99bits/dim的log-likelihood。把T（timesteps量）變大後可以更進步到3.77 bits/dim。從Fig 1.可以看到其實只要t > 0，兩者就很接近了，在剩下的process中兩者也都一直很接近。越多diffusion steps，sample distributions越被mean影響、隨之受變異數影響也越小。 Q: 這張圖裡應該是畫diffusion steps (forward)，所以model在做訓練與預測時是反著（右到左）走回來的嗎？ ![](https://i.imgur.com/yWWfNPz.jpg) ### 3.1 Learning variance 因為variance的範圍很小（$\beta_t \in [0,1]$），用interpolation來當作新的variance然後讓model學習預測$v$。v是一個vector（how?) ![](https://i.imgur.com/osu4mq4.png) $L_{simple}$ 也變成hybrid的形式：![](https://i.imgur.com/MBEbhXu.png) ![](https://i.imgur.com/7qV0Y0e.png) 背後ㄉreasoning是：mean仍然能被$L_{simple}$主導，而變異數的預測（v的預測）則由$L_{vlb}$帶領。 $L_0$ 好像是個未知的term，只能用某種方式去近似值，DDPM中直接忽略這個term。 ![](https://i.imgur.com/uRES5oK.png) ### 3.2 Improving the noise schedule **where we see that a model trained with the linear schedule does not get much worse (as measured by FID) when we skip up to 20% of the reverse diffusion process.** <- 看不太懂語意。 DDPM: Linear variance ($\beta_t$) schedule iDDPM: Cosine variance schedule + clipping Issue: For certain pic resolution eg. 32 x 32 or 64 x 64, the is forward process's final sampling from $N(0,1)$ is too noisy and destroys information quicker than necessary. Solution: Designed s.t. its shape is: changing very little at start, having a linear dropoff in the middle, and changing very little at the end. A function that is close to this shape is $f(x) = cos(x)$. To be specific, ![](https://i.imgur.com/21cp0VN.png =400x). $s$是超參數。Choose $s$ s.t. $\sqrt \beta_0$ is slightly smaller than the pixel bin size 1/127.5. <- 1/127.5是哪裡derive出來的數字哇。 ### 3.3 Reducing Gradient Noise - 真正的（數學推導出來的）loss：$L_{vlb}$ - DDPM中用的loss：$L_{simple}$ - 3.1 裡使用的loss：![](https://i.imgur.com/MBEbhXu.png) ![](https://i.imgur.com/eVV97sk.png) 對於真loss，gradient noise scale太大了因此用來當loss objective反而不佳。真loss是由每一步的loss![](https://i.imgur.com/1NJQ2UI.png) 加起來的結果，而這些terms的數值大小（magnitude）差距很大（iDDPM有附在圖二，說是最初的diffusion steps的loss數值比後面的大很多），因此一次sum起來optimize效果不好。iDDPM對此提出了「對loss term」做importance sampling的想法， ![](https://i.imgur.com/GOM2Kxe.png) 意即對於每一步t，保留下前10 epochs （？）的loss（$L_t$），然後以它們的平方計算平均再開根號這個值當作probability（$p_t$），這個改良版的$L_{vlb}$就有點像weighted mean of original $L_{vlb}$。iDDPM稱它為 $L_{vlb}$**（resample）**。 #### [EXTRA] Gradient noise scale https://arxiv.org/abs/1812.06162 ### 3.4 Results and Ablations $L_{hybrid}$ 和 cosine schedule讓log-likelihood下降了而且FID可以持平。 $L_{vlb}$ (resample)雖然可以再讓log-likelihood也下降，但FID會上升（變糟）。表格三裡有拿來和其他log-likelihood-based models做比較（純看ll的話transformers系列的都比DDPM/iDDPM好）。 ### 4. Improving Sampling Speed T = 4000。每一步（t）都要做一次sampling和backprop的話，4000步照理來說要做4000次sampling，從而使得model訓練速度很慢。iDDPM發現能夠以sampling steps的概念加快速度（取K = 25, 50, 100, ..., T，以T/K為間隔做一次sampling）。以DDPM, iDDPM, DDIM輔以不同種loss去比較FID。結果發現K越小的時候，$L_{simple}$+var為$\beta_t$時所得到的FID最大，猜測是fixing variance導致的結果。iDDPM可以將4000步濃縮成100步（K=100）來得到near-optimal的表現。 #### [EXTRA]DDIM: (skipped) ### 5. Comparison to GANs 想要在非log-likelihood-based的模型上比較模型生成圖片的mode coverage/data distribution coverage，則需要別的eval metrics，如Precision/Recall。Diffusion models 的recall勝過GAN(bigGAN-deep)。 ### 6. Scaling Model Size 機器學習界的潮流是增大模型和加長訓練時間（~= increase training compute），藉此得到更厲害的模型(Kaplan et al., 2020; Chen et al., 2020a; Brown et al., 2020)。這篇paper想看FID, NLL在training compute增大之下會有何變化。做法是把第一層layer的參數（#channels）調整成 64, 96, 128, 192。並藉著該參數（用 $\frac{1}{\sqrt{\#channel}}$ 乘上一個初始值當ratio去）調整Adam optimizer的lr。最後畫了圖看FID, NLL的曲線和Power law: $f(x) = kx^\alpha$曲線的圖，隨後宣稱FID的線和Power Law相符："model improves in a predictable way as training compute increases." ### 7. Related Work (skipped) ### 8. Conclusion & Contribution iDDPM can achieve better NLL than DDPM, sample much faster than DDPM while retaining sample quality (surpassing GAN). #### Notes ![](https://i.imgur.com/CkWBGc2.png) Term 分析 iterations: update次數（看過一個batch更新一次） epochs：看過一次dataset算一個epoch

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.