Denoising Diffusion Probabilistic Model

# Denoising Diffusion Probabilistic Model [TOC] **Paper** : https://arxiv.org/abs/2006.11239 示範 **Example** : https://github.com/jason19990305/Denoising-Diffusion-Probabilistic-Model.git ## Introduction **DDPM(Denoising Diffusion Probabilistic Model)** 原論文是針對圖像生成設計的，但我們會用 `sklearn` 生成一個 **2D** 版的瑞士捲來示範，**Neural Network** 也是最簡單的 **MLP(Multilayer Perceptron )**，**DDPM** 其中一個想法是將生成的方式從一步到位，改成了多個步驟 (**Timestep**)來完成，減少 **Neural Network** 的壓力，降低生成的難度，多個步驟是怎麼實現的呢?，假如你用沙子堆出了瑞士捲的圖形，然後讓它飛走，這個過程就是擴散(**Diffusion**)，這個過程也可以稱做 **Forward Process**，如果將這個過程倒放，讓 **Neural Network** 學習從一攤隨機的沙子飛回來變成瑞士捲的圖形，這個過程就稱為 **Reverse Process**。 ## Example 這個章節用 **sklearn** 生成一個 **2D** 的點，形狀類似瑞士捲，並且示範擴散與去雜訊的過程 ### Forward Process 在這個階段，不需要 Neural Network 參與，在每一個 Timestep $t$ 往原始數據加入一點雜訊，原本的真實數據為 $x_0$，經過 $T$ 次加入雜訊的過程，變成一個看起來毫無規則的 $x_T$，這是一個固定且已知的過程，我們可以精確計算出任意時間點 $t$ 的數據長甚麼樣子，以下用 40 的點的瑞士捲，經過 $T=200$ 的擴散過程，並且對每一個點加上軌跡 : ![ddpm_part1_forward_traj](https://hackmd.io/_uploads/H1A8YgnGbe.gif =70%x) ### Reverse Process 這個步驟是 **DDPM** 的靈魂所在，**Model** 會專注在預測單一 **Step** 的雜訊，我們的目標就是從 $X_T$ 開始，一步步去除雜訊，最終還原回 $x_0$，我們會建立一個 **Noise Predictor**，他作用是用來預測當前含有雜訊的資料 $x_t$ 裡的雜訊，並且將 $x_{t-1}$ 到 $x_t$ 之間的雜訊去掉，這個去雜訊的過程不斷重複最後就能夠還原出原本的瑞士捲 ![ddpm_training_reverse](https://hackmd.io/_uploads/ByvwYgnM-e.gif =70%x) ## Detail 以下講解實作細節 ### Forward Process 在執行 **Forward Process (Diffusion Process)** 的過程，$x_t$ 中的 $t$ 代表加了幾次雜訊，每個 **step** 加雜訊的方式 : $$ x_t=\sqrt{1-\beta_t}\cdot x_{t-1}+\sqrt{\beta_t}\cdot \epsilon_t $$ 其中 * $\beta_t$ : 雜訊規劃，這是一個預先設定好的數值，介於 0~1 之間，決定了第 $t$ 個 Step 的雜訊強度，$\beta_t$ 越大，代表這一步的雜訊越多，原數據保留的越少。 * $\epsilon_t$ : 代表在 **Timestep** $t$ **Sample** 的雜訊，這個雜訊使用常態分佈 $\epsilon\sim\mathcal{N}(0,I)$ 得到 * $x_t$ : 當下的瑞士捲狀態，是上一刻瑞士捲 $x_{t-1}$ 與雜訊混和的結果 * $x_{t-1}$ : 上一刻的瑞士捲狀態 ```python= def q_step(self, x_prev, t, noise=None): """ One step of forward diffusion (Markov Chain) for visualization. x_t = sqrt(1 - beta_t) * x_{t-1} + sqrt(beta_t) * epsilon """ if noise is None: noise = torch.randn_like(x_prev) beta_t = self.betas[t] # beta_t alpha_t = self.alphas[t] # alpha_t = 1 - beta_t return torch.sqrt(alpha_t) * x_prev + torch.sqrt(beta_t) * noise ``` 在 Example 中我們用 40 個點來畫圖，在 Call `q_step` 時，`x_prev` 代表的就是 $x_{t-1}$，每個 $x_t$ 都是 40 個表達瑞士捲的 2D 點，所以他的維度是 `[40,2]`，而對應的 Noise 也是 `[40,2]` ### Reverse Process #### Neural Network 我們訓練的目標是透過 **Neural Network** 預測雜訊來去除雜訊，這個 **Neural Network** 我們稱為 **Noise Predictor**，因為瑞士捲只是一個多個 **2D** 點組成的，Noise Predictor 的 **Input** 是 $x_t$，實際在訓練時，我們會隨機取 `batch_size` 個點，所以 $x_t$ 的維度是 `[batch_size,2]`，Noise Predictor 要做的就是預測每個點的 **Noise**，所以他的 **Input Size** 是 2，**Output Size** 也是 **2**，實作上會直接丟一個 **batch** 進去，會一次 **predict** 出 `batch_size` 個雜訊，然後我們就用 **MSE(Mean Square Error)** 計算預測出的 **Noise** 跟實際混和進 $x_t$ 的 **Noise** 差多少，目標就是要越接近越好。 ```python= class NoisePredictor(nn.Module): def __init__(self, input_dim=2, time_dim=32): super().__init__() self.time_dim = time_dim # Time Embedding self.time_mlp = nn.Sequential( SinusoidalPositionEmbeddings(time_dim), nn.Linear(time_dim, time_dim), nn.ReLU() ) # MLP Backbone # Increased hidden dim 128 -> 256 for better capacity self.model = nn.Sequential( nn.Linear(input_dim + time_dim, 256), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, input_dim) ) def forward(self, x, t): t_emb = self.time_mlp(t) x_input = torch.cat([x, t_emb], dim=1) return self.model(x_input) ``` 另一個細節是對時間的 **Embeddings**，如果 **time step** 設為 **200**，我們要告訴 **Noise Predictor** 現在是第幾個 **step**，但 **200** 對於 **Neural Network** 來說太大了，如果限制數值範圍小一點的話，有助於訓練，所以會將 **Timestep** 做正規化，並且將它升維(**32** 維)成一個向量，有助於給 **Neural Network** 一個更清楚的訊號，並且會增加一層 **Linear Layer** 讓他可以自由調整這個 **Vector**，將特徵放大針對正規化+升維，`T=200` 的情況，用 `dim=32` 視覺化會像這樣 : ![position_embedding_vis](https://hackmd.io/_uploads/HyFCeB3zWg.gif =70%x) ```python= class SinusoidalPositionEmbeddings(nn.Module): def __init__(self, dim): super().__init__() self.dim = dim def forward(self, time): device = time.device half_dim = self.dim // 2 embeddings = math.log(10000) / (half_dim - 1) embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings) embeddings = time[:, None] * embeddings[None, :] embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1) return embeddings ``` ### Training 以下列出 **Training** 步驟 : 1. **Random Sample** 出 `indices`，用來隨機從 $x_0$ 取 `batch_size` 個點 2. **Random Sample** 出 `t`，用來隨機選擇 **0~T** 之間的時間點，並且亂數的數量為 `batch_size` 3. 用常態分布生成 **Noise**，並透過這個 **Noise** 和 `t` 得到 $x_t$，但這個 $x_t$ 裡面其實是各種不同時刻的，因為 `t` 是隨機選擇的，所以裡面會有 `batch_size` 個不同 `t` 雜訊混和的 (x,y) 點 4. 將 $x_t$ 和 `t` 輸入 **Noise Predictor**，然後預測出 noise 5. 用 **MSE** 計算實際 **noise** 和預測的 **noise** 之間的差異 ```python= def fit(self, dataset, epochs, batch_size, lr, forward_diffusion, device): optimizer = optim.Adam(self.parameters(), lr=lr) loss_fn = nn.MSELoss() timesteps = forward_diffusion.timesteps print(f"Start Training ({epochs} steps)...") for epoch in range(epochs): # 1. Sample Batch indices = torch.randint(0, len(dataset), (batch_size,)) x_0 = dataset[indices] # dataset should already be on device # 2. Sample Random Timesteps t t = torch.randint(0, timesteps, (batch_size,)).long().to(device) # 3. Add Noise (Forward Process) noise = torch.randn_like(x_0) x_t = forward_diffusion.q_sample(x_0, t, noise) # 4. Predict Noise predicted_noise = self(x_t, t) # 5. Optimization loss = loss_fn(predicted_noise, noise) optimizer.zero_grad() loss.backward() optimizer.step() if epoch % 2000 == 0: print(f"Step {epoch}, Loss: {loss.item():.6f}") print("Training Complete!") ``` ### Forward (q_sample) 這個 **function** `q_sample` 可以給 $x_0$ 和指定時刻 $t$ 得到對應的 $x_t$ 正常來說給定一個時刻 $t$ 我們需要用 **for loop** 一步一步從 $x_0$ 慢慢加入雜訊，但是為了訓練效率，這邊可以做簡化，原本的計算方式展開 : $$ x_t = \sqrt{1-\beta_t}\cdot x_{t-1}+\sqrt{\beta_t}\cdot\epsilon_{t-1} $$ 將 $1-\beta_t$ 帶換成 $\alpha_t$ $$ x_t = \sqrt{\alpha_t}\cdot x_{t-1}+\sqrt{1-\alpha_t}\cdot\epsilon_{t-1} $$ 再展開 $$ x_t=\sqrt{\alpha_t}\cdot(\sqrt{\alpha_{t-1}}\cdot x_{t-2}+\sqrt{1-\alpha_{t-1}}\cdot\epsilon_{t-2})+\sqrt{1-\alpha_t}\cdot\epsilon_{t-1} $$ 相乘 : $$ x_t=\sqrt{\alpha_t\alpha_{t-1}}\cdot x_{t-2}+\sqrt{\alpha_t(1-\alpha_{t-1})}\cdot\epsilon_{t-2}+\sqrt{1-\alpha_t}\cdot\epsilon_{t-1} $$ 其中 $\sqrt{\alpha_t(1-\alpha_{t-1})}$ 和 $\sqrt{1-\alpha_t}$ 可以視為針對高斯雜訊的 **Standard Deviation**，根據兩個常態分佈加法下的封閉(兩個常態分佈合併時)，**Standard Deviation** $A+B=\sqrt{A^2+B^2}$ ![gaussian_combination](https://hackmd.io/_uploads/SJX3bP3MWl.png) $$ A=\sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}} $$ $$ B=\sqrt{1-\alpha_{t}} $$ $$ \sqrt{A^2+B^2}=[\alpha_t(1-\alpha_{t-1})]+[1-\alpha_t] $$ $$ =\alpha_t-\alpha_t\alpha_{t-1}+1-\alpha_t $$ $$ =1-\alpha_t\alpha_{t-1} $$ 所以就可以寫成 $$ x_t=\sqrt{\alpha_t\alpha_{t-1}}\cdot x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\cdot\epsilon $$ 因為 $\epsilon_{t-1}$ 和 $\epsilon_{t-2}$ 都是從同一個常態分佈 **Sample** 的，在統計上即便有 1000 個 $\epsilon$ 相加，也都跟重新 **Sample** 一個 $\epsilon$ 是一樣的上面的 **Equation** 還需要展開到 `t=0`，但我們可以寫成累乘的形式 $$ \bar{\alpha_t} = \prod^t_{i=1}a_i $$ $$ x_t=\sqrt{\bar{\alpha_t}}\cdot x_0+\sqrt{1-\bar{\alpha_t}}\cdot \epsilon $$ 這樣我們就可以預先算好所有 $\bar{\alpha_t}$ 來用 ```python= def q_sample(self, x_0, t, noise=None): """ Input data x_0 and timestep t, return noised x_t Formula: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon """ if noise is None: noise = torch.randn_like(x_0) # Get coefficients for timestep t (handling batch t) # Using self.sqrt_alphas_cumprod sqrt_alpha_bar_t = self.sqrt_alphas_cumprod[t].reshape(-1, 1) sqrt_one_minus_alpha_bar_t = self.sqrt_one_minus_alphas_cumprod[t].reshape(-1, 1) # Reparameterization trick return sqrt_alpha_bar_t * x_0 + sqrt_one_minus_alpha_bar_t * noise ``` ### Inference (p_sample) 現在我們訓練完了一個可以預測 **Noise** 的 **Neural Network**，現在要講解如何將預測的雜訊從 $x_t$ 解雜訊成 $x_{t-1}$ : 1. $x_t$ : 當前要去雜訊的數據 2. $t$ : 當前時刻 3. $\beta_t$ : 預先定義好的參數，要跟 **Forward Process** 一樣 4. $\epsilon_\theta$ : **Model** 預測的 **Noise** 5. $\mu$ : `mean` 代表 **Reverse Process** 的 $x_{t-1}$ 6. $\sigma_t$ : 代表標準差(**Standard Deviation**)，其實就是 **Forward Process** 施加雜訊時用的 $\sqrt{\beta_t}$ 7. $z$ : 為隨機雜訊 $z\sim\mathcal{N}(0,I)$ **Reverse Diffusion Process** 的 **Equation** : $$ \mu=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta)+\sigma_tz $$ > 後面有推導其中 $\mu$ 其實就是 $x_{t-1}$，也就是將 $x_t$ 去雜訊後的樣子 ```python= def p_sample(model, x_t, t, betas, sqrt_one_minus_alphas_cumprod): """ Reverse process step: Predict mean and variance to sample x_{t-1} from x_t Formula: x_{t-1} = 1/sqrt(alpha_t) * (x_t - (beta_t / sqrt(1-alpha_bar_t)) * epsilon_theta) + sigma_t * z """ # 1. Get coefficients beta_t = betas[t] alpha_t = 1 - beta_t sqrt_alpha_t = torch.sqrt(alpha_t) # 2. Predict Noise (epsilon_theta) # t is scalar, so create a tensor batch t_tensor = torch.tensor([t] * len(x_t)).long().to(x_t.device) epsilon_theta = model(x_t, t_tensor) # 3. Compute Mean # This formula removes the predicted noise to estimate the previous state sqrt_one_minus_alpha_bar_t = sqrt_one_minus_alphas_cumprod[t] mean = (1 / sqrt_alpha_t) * (x_t - (beta_t / sqrt_one_minus_alpha_bar_t) * epsilon_theta) # 4. Add Variance (Sigma * z) if t > 0: z = torch.randn_like(x_t) sigma_t = torch.sqrt(beta_t) # Simple option for variance return mean + sigma_t * z else: return mean # No noise added at the final step t=0 ``` 效果在上面展示過了，下面展示一個簡化版 --- 非官方簡化版本 : $$ x_t = \sqrt{\alpha_t}\cdot x_{t-1}+\sqrt{1-\alpha_t}\cdot\epsilon_{t-1} $$ 直接移項 (減右半，除$\sqrt{\alpha_t}$) : $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} (x_t - \sqrt{1-\alpha_t} \epsilon_{t-1}) $$ 但$\epsilon_{t-1}$ 不太準確，他代表的是加在 $x_{t-1}$ 的 **Noise**，我們應該用 **Noise Predictor** 預測的雜訊 $\epsilon_\theta$，這樣也能產生類似的效果 > 在數學上並不準確，跟原始 **DDPM** 是不同的會像這樣 : ![ddpm_training_reverse_simple](https://hackmd.io/_uploads/BJrRbspGbg.gif =70%x) 且經過實驗如果不加 `mean + sigma_t * z` 的話簡化版會變成一個點，有點像給一點阻力?，不要讓他在 **t** 還沒接近 0 之前就完全消除所有雜訊，因為 **Forward Process** 的時候就是前期雜訊較大，後期雜訊較小，所以讓他在處理最後的細節時，可以讓點都盡量散一點，不要縮成一個點，也就是直接釘在平均值上，我覺得應該是 t 較大時不同方向的 x、y 基本上是完全破壞的，**Noise Predictor** 不會預測出一個特定方向的 **Noise**，當 t 接近 0 的時候其實 **Noise Predictor** 就可以推敲出原圖的位置了，所以生成的 **Noise** 就能將 x、y 推到對應的位置會像這樣 : ![ddpm_training_reverse](https://hackmd.io/_uploads/rkKz_oazWe.gif =70%x) > 我們在 **Forward Process** 的時候加的雜訊都是 $\mathcal{N}(0,I)$，所以不會因為 **t** 不一樣 **Noise Predictor** 就會給出不同尺度的 **Noise**，主要是 **Standard Deviation** 調整變化量，也就是 $\beta_t$。**Neural Network** 其實在 **t** 時刻也只是預測了在這個時候 **Noise** 的期望值而已 ## Generate Image ### CNN **Convolution Neural Network** **Deep Learning** 常用的全連接層 (**Fully Connected Layer**)或多層感知機(**Multi-Layer Perceptron**) 長這樣 : ![image](https://hackmd.io/_uploads/B10HWFK-Ze.png =70%x) 而 **CNN** 的連接方式不一樣，是模仿影像處裡常用的卷積運算設計的，說連接反而會有點誤導，用 **Gaussian Filter** 的影像處理舉例 : ![1_Ra4DG6PT0hxnvH2aW2OUKw](https://hackmd.io/_uploads/BygFG3FtZbe.gif =70%x) **Deep Learning** 可訓練的部分是 **Weight** 和 **Bias**，而 **CNN** 的架構 **Weight** 就是 **Filter** 的數值，**Bias** 則是在卷積運算完成後加上 : ![image](https://hackmd.io/_uploads/H1jppYKZ-e.png) 以 **Pytorch** 的 **Conv2d** 舉例的話，每個參數的意義如下 : 1. `in_channel` : 代表 **Input Channel**，如果是彩色影像，通常會有 **RGB** 三個 **Channel**，所以 `in_channel` 就要填 **3** 2. `out_channel` : 代表 **Output Channel**，代表經過這層 **Layer** 後要有幾個 **Channel**，除了 **Input** 的原圖以外，其他的影像我們通常會叫他 **Feature Map**，這個 **Feature Map** 會隨著層數增加，越來越厚 3. `kernel_size` : 代表 **Filter** 的大小，像上面那張圖就是 **3**，代表 **3x3** 的 **Filter** 4. `stride` : 卷積運算移動的 **Step** 大小，預設為 **1** 5. `padding` : 讓圖片的 **Width** 和 **Height** 增加的大小，因為卷積會讓圖片縮小(**Width/Height - 1**)，所以很常使用 `padding` 用 **0** 來填充圖片外圍，讓圖片通過 **Conv2d** 時可以保持一樣的大小，`padding` 預設為 0 ![1_O06nY1U7zoP4vE5AZEnxKA](https://hackmd.io/_uploads/rJ7xE9KWWl.gif =70%x) **Weight** 和 **Bias** 數量的計算方式，`input_channel x kernel_size x kernel_size x output_channel + output_channel` ### Convolution Transpose 2D 這個是用來做 **Deconvolution**(反卷積)的 **ConvTranspose2d** : `stride = 1` ![upload_4acc4dabe15b606153a23b5a50552d71](https://hackmd.io/_uploads/ryokn3rVyl.gif) `Image size : 2x2 -> 4x4` **ConvTranspose2d** : `stride = 2` ![upload_67d4127c4e659fdf3204eb3b7f4d3b30](https://hackmd.io/_uploads/HyVJh3BNkg.gif) `Image size : 2x2 -> 5x5` ### Max Pooling **Max Pooling** 的計算方式，他會指定一個大小，比如 **2x2** 的方框，在 **Feature Map** 中以 **Stride = 2** 的步伐移動，每次計算會取方框內數值最大的，最後 **Feature Map** 的大小會減半。 ![image](https://hackmd.io/_uploads/SJDIWs0Z-e.png) **Max Pooling** 的作用是提取最重要的特徵，還能減少計算量，還有一個作用是他能涵蓋這個 **2x2** 範圍內的 **Feature**，不要小看她只有 **2x2**，經過多次的 **Max Pooling**，最後的 **Feature Map** 涵蓋的範圍就會非常大，只要圖片裡有得到他要的 **Feature**，最後的高維特徵就會被捕捉到，在影像辨識領域他尤為重要 ![image](https://hackmd.io/_uploads/Bk68JYeGZe.png) ### U-Net **U-Net** 有分成 **Encoder** 和 **Decoder** ，左半為 **Encoder** 右半為 **Decoder** : ![image](https://hackmd.io/_uploads/Sy61AqF-bx.png) **Encoder** 期間，每次縮減 **Image** 尺寸前，會經過兩次 **Convolution Layer**，然後 **Feature Map** 數量一致 **Encoder** 會將圖片漸漸縮小，並且 **Feature Map** 數量漸漸增加，**Decoder** 會將圖片漸漸變大，並且 **Feature Map** 數量漸漸變少灰色箭頭的部分就是 **Encoder** 和 **Decoder** 的 **Feature Map** 組合起來最後一層是將所有的 **Feature Map** 的數量直接縮成一張(**Binary Mask**)，也就是最後會看到的圖片或 **Mask** **Mask Example** : ![image](https://hackmd.io/_uploads/SyFHMjT-be.png =60%x) ## Detail ### DiffUNet 參考網址 : https://learnopencv.com/denoising-diffusion-probabilistic-models/ **Forward Process** 在這邊是對一個彩色影像施加雜訊的過程 : ![image](https://hackmd.io/_uploads/SyGmgXS7-e.png) **Reverse Process** 則是預測雜訊，並慢慢消除雜訊的過程 : ![image](https://hackmd.io/_uploads/HkPBlmHQWx.png) 我示範的是用比較原始的 **UNet**，但是他就不是生成一個 Mask，而是用來預測 **Noise** 預測影像的 **Noise** 一樣需要對整數 **t** 做 **Embedding**，將 **t** 變成一個不重複的 **Vector** **Position Embedding** : ```python! class SinusoidalPositionEmbeddings(nn.Module): def __init__(self, dim): super().__init__() self.dim = dim def forward(self, time): device = time.device half_dim = self.dim // 2 embeddings = torch.log(torch.tensor(10000.)) / (half_dim - 1) embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings) embeddings = time[:, None] * embeddings[None, :] embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1) return embeddings ``` --- **UNet** 包含 **Encoder** 和 **Decoder**，這兩個部分在每一個階層，都會經過兩次的 **Convolution** + **ReLu**，並且我們要將代表 **t** 的 **Vector** 加上去，並且讓 **Convolution** 輸出的 **feature map** 通過 **Group Normalization** : ```python! # --------------------------------- # Block Module # --------------------------------- class Block(nn.Module): def __init__(self, in_channels, out_channels, time_dim): super().__init__() self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1) # GroupNorm for better training stability # Using 32 groups for GroupNorm self.gn1 = nn.GroupNorm(32, out_channels) # Project time embeddings to match output channel dimension for residual connection self.time_mlp = nn.Linear(time_dim, out_channels) self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1) self.gn2 = nn.GroupNorm(32, out_channels) def forward(self, x, t): # 1. Conv + GN + ReLU x = self.conv1(x) x = self.gn1(x) x = torch.relu(x) # 2. Add time embedding t_emb = self.time_mlp(t) # Expand t_emb to match x's dimensions t_emb = t_emb.view(t_emb.shape[0], t_emb.shape[1], 1, 1) # [batch_size, out_channels, 1, 1] x = x + t_emb # 3. Conv + GN + ReLU x = self.conv2(x) x = self.gn2(x) x = torch.relu(x) return x ``` --- `Image Size : 256x256` **DiffUNet** 的本體就可以定義如下，最大的不同就是在 **Encoder** 和 **Decoder** 的 **Block** 多一個 **Time Embedding** : ```python! # --------------------------------- # DiffUNet Model # --------------------------------- class DiffUNet(nn.Module): def __init__(self, input_channels=3, time_dim=32): super().__init__() self.time_embedding = nn.Sequential( SinusoidalPositionEmbeddings(time_dim), nn.Linear(time_dim, time_dim), nn.ReLU() ) # Encoder # 256 -> 128 self.encoder1 = Block(input_channels, 64, time_dim) self.pool1 = nn.MaxPool2d(2) # 128 -> 64 self.encoder2 = Block(64, 128, time_dim) self.pool2 = nn.MaxPool2d(2) # 64 -> 32 self.encoder3 = Block(128, 256, time_dim) self.pool3 = nn.MaxPool2d(2) # 32 -> 16 self.encoder4 = Block(256, 512, time_dim) self.pool4 = nn.MaxPool2d(2) # 16 -> 8 self.encoder5 = Block(512, 512, time_dim) self.pool5 = nn.MaxPool2d(2) # Bottleneck # feature map : 512x16x16 -> 1024x16x16 self.bottleneck = Block(512, 1024, time_dim) # Decoder # 8 -> 16 self.up5 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2) # concat with encoder5 output : 512 + 512 = 1024 self.decoder5 = Block(1024, 512, time_dim) # 16 -> 32 self.up4 = nn.ConvTranspose2d(512, 512, kernel_size=2, stride=2) # concat with encoder4 output : 512 + 512 = 1024 self.decoder4 = Block(1024, 512, time_dim) # 32 -> 64 self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2) self.decoder3 = Block(512, 256, time_dim) # 64 -> 128 self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2) # concat with encoder2 output : 128 + 128 = 256 self.decoder2 = Block(256, 128, time_dim) # 128 -> 256 self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2) # concat with encoder1 output : 64 + 64 = 128 self.decoder1 = Block(128, 64, time_dim) # Output layer # feature map : 64x16x16 -> 3x16x16 self.out_conv = nn.Conv2d(64, input_channels, kernel_size=1) def forward(self, x, t): # Time embedding t_emb = self.time_embedding(t) # Encoder e1 = self.encoder1(x, t_emb) p1 = self.pool1(e1) e2 = self.encoder2(p1, t_emb) p2 = self.pool2(e2) e3 = self.encoder3(p2, t_emb) p3 = self.pool3(e3) e4 = self.encoder4(p3, t_emb) p4 = self.pool4(e4) e5 = self.encoder5(p4, t_emb) p5 = self.pool5(e5) # Bottleneck b = self.bottleneck(p5, t_emb) # Decoder u5 = self.up5(b) d5 = self.decoder5(torch.cat([u5, e5], dim=1), t_emb) u4 = self.up4(d5) d4 = self.decoder4(torch.cat([u4, e4], dim=1), t_emb) u3 = self.up3(d4) d3 = self.decoder3(torch.cat([u3, e3], dim=1), t_emb) u2 = self.up2(d3) d2 = self.decoder2(torch.cat([u2, e2], dim=1), t_emb) u1 = self.up1(d2) d1 = self.decoder1(torch.cat([u1, e1], dim=1), t_emb) out = self.out_conv(d1) return out ``` ### EMA (Exponential Moving Average) 為了增進 **Model** 穩定性，我們會對 **Model** 的 **Parameter** 變化進行 **Moving Average**，希望能讓 **Weight** 的變化不要那麼大 ```python! # --------------------------------- # EMA (Exponential Moving Average) Class # --------------------------------- class EMA: def __init__(self, model, beta = 0.995): self.beta = beta self.step = 0 # Create a copy of the model for EMA self.ema_model = copy.deepcopy(model) # Freeze the EMA model parameters for param in self.ema_model.parameters(): param.requires_grad_(False) def update(self, model): self.step += 1 for current_param , ema_param in zip(model.parameters(), self.ema_model.parameters()): # Update EMA parameter ema_param.data.mul_(self.beta) ema_param.data.add_(current_param.data * (1.0 - self.beta)) def copy_to(self, model): model.load_state_dict(self.ema_model.state_dict()) def save_pretrained(self, path): torch.save(self.ema_model.state_dict(), path) ``` 一開始會複製一份 **DiffUNet** ，並在每次更新參數後 **Call** `update()` 來更新 `ema_model` ，更新方式很簡單，就是舊的部分佔 `0.995` 然後更新後的 **model** 佔 `0.005` ，然後更新 `ema_model`，所以最後如果想存 **Model** 或是 **Inference** 的話，就使用這個 `ema_model` ### Dataset 我這次只用 `Oxford-IIIT Pet Dataset` 裡面挑三張貓的照片來訓練，比較需要注意的是為了讓 **Noise** 傳播正常，我們會希望整張圖是 **Normalization** 過的，所以我們可以用 `torchvision.transforms` 來做前處理 : ```python! class OxfordPetLoader: def __init__(self, root='./data', batch_size=8, image_size=256, download=True, cat_only=True): self.root = root self.batch_size = batch_size self.transform = transforms.Compose([ transforms.Resize(image_size), transforms.CenterCrop(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) full_dataset = datasets.OxfordIIITPet( root=root, split='trainval', target_types='category', download=download, transform=self.transform ) if cat_only: # 1. Define cat breeds cat_breeds = [ "Abyssinian", "Bengal", "Birman", "Bombay", "British Shorthair", "Egyptian Mau", "Maine Coon", "Persian", "Ragdoll", "Russian Blue", "Siamese", "Sphynx" ] # 2. Get corresponding label IDs # full_dataset.class_to_idx 是一個 dict {'Abyssinian': 0, ...} cat_ids = set() for cat in cat_breeds: if cat in full_dataset.class_to_idx: cat_ids.add(full_dataset.class_to_idx[cat]) print(f"Cat Label IDs: {cat_ids}") # 3. Filter dataset for cat images only # OxfordIIITPet does not provide direct access to labels, so we access the private attribute # full_dataset._labels is a list of labels corresponding to each image all_labels = full_dataset._labels cat_indices = [i for i, label in enumerate(all_labels) if label in cat_ids] self.dataset = torch.utils.data.Subset(full_dataset, cat_indices) print(f"Filtered Oxford-Pet for Cats. Total Images: {len(self.dataset)}") else: self.dataset = full_dataset def get_loader(self): return DataLoader( self.dataset, batch_size=self.batch_size, shuffle=True, # Shuffle for training num_workers=get_optimal_num_workers(), # Get CPU cores - 1 pin_memory=True # Pin memory for faster transfers ) ``` ### p_sample 為了優化圖片的生成品質，我多加了 $x_0$ **Predict** 和 **Clipping** 步驟 : `x0_pred` $$ x'_0 = \frac{(x_t-\sqrt{1-\bar\alpha_t}\cdot \epsilon_\theta)}{\sqrt{\bar\alpha_t}} $$ **clipping** $$ x'_0 = clip(x'_0,-1,1) $$ 反推 $\epsilon$ $$ \epsilon=\frac{(x_t-\sqrt{\bar\alpha_t}\cdot x'_0)}{\sqrt{1-\bar\alpha_t}} $$ 計算 **mean** $$ \mu=\frac{1}{\sqrt\alpha_t}(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\cdot \epsilon) $$ ```python! class ReverseDiffusion: @staticmethod @torch.no_grad() def p_sample(model , x_t , t, betas ,clip_range=(-1.0, 1.0) , clip_denoised=True): # Initialize alphas # Compute alphas and alpha_bars alphas = 1.0 - betas beta_t = betas[t] alpha_t = alphas[t] alpha_bar_t = torch.prod(alphas[:t+1]) # Pre-calculate square roots sqrt_alpha_t = torch.sqrt(alpha_t) sqrt_one_minus_alpha_bar_t = torch.sqrt(1.0 - alpha_bar_t) # 1. Predict noise using the model t_tensor = torch.full((x_t.shape[0],), t, device=x_t.device, dtype=torch.long) epsilon_theta = model(x_t, t_tensor) # 2. Estimate x0 from x_t and predicted noise # x0_pred = (x_t - sqrt(1-alpha_bar_t) * eps) / sqrt(alpha_bar_t) sqrt_alpha_bar_t = torch.sqrt(alpha_bar_t) x0_pred = (x_t - sqrt_one_minus_alpha_bar_t * epsilon_theta) / sqrt_alpha_bar_t # 3. Clip x0_pred to specified range if clip_denoised: low, high = clip_range x0_pred = torch.clamp(x0_pred, low, high) # 4. Compute epsilon used for posterior mean calculation epsilon_used = (x_t - sqrt_alpha_bar_t * x0_pred) / sqrt_one_minus_alpha_bar_t # 5. Compute the mean of the posterior q(x_{t-1} | x_t, x_0) mean = (1.0 / sqrt_alpha_t) * (x_t - (beta_t / sqrt_one_minus_alpha_bar_t) * epsilon_used) if t > 0: z = torch.randn_like(x_t) return mean + torch.sqrt(beta_t) * z else: return mean ``` ### Training loop ```python! # Initialize a copy of the model for EMA (Exponential Moving Average) ema_model = EMA(model, beta=0.995) # Training Loop print(f"Start Training ({EPOCHS} steps)...") for epoch in range(EPOCHS): time_start = time.time() total_loss = 0.0 # 1. Sample random timesteps t = torch.randint(0, TIMESTEPS, (BATCH_SIZE,), device=device) # 2. Forward diffusion process noise = torch.randn_like(imgs) x_t = forward_diffusion.q_sample(imgs, t, noise=noise) # 3. Predict noise predicted_noise = model(x_t, t) # 4. Compute loss loss = loss_function(predicted_noise, noise) # Backpropagation optimizer.zero_grad() loss.backward() # Gradient clipping (optional, can help with stability) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Update parameters optimizer.step() # Update EMA model ema_model.update(model) # Step the learning rate scheduler scheduler.step() total_loss += loss.item() avg_loss = total_loss / N_IMAGE time_end = time.time() if (epoch + 1) % 500 == 0 or epoch == 0: print(f"Epoch [{epoch + 1}/{EPOCHS}], Loss: {avg_loss:.4f}, Time: {time_end - time_start:.2f}s") ``` ### GIF ![ddpm_cat_reverse_process_1](https://hackmd.io/_uploads/S1O2EErXWe.gif) ### Reverse Process 推導 **Forward Process** 更新 (**One step**): $$ x_t=\sqrt{1-\beta_t}\cdot x_{t-1}+\sqrt{\beta_t}\cdot \epsilon $$ ![image](https://hackmd.io/_uploads/HkPBlmHQWx.png) * **Forward Process** 可以表達成 **Gaussian Distribution** 型式 : $$ q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI) $$ 相當於 **Normal Distribution** 的 $\mu$ 就是 $\sqrt{1-\beta_t}x_{t-1}$，$\sigma$ 就是 $\sqrt{\beta_t}$，藉由這兩個參數，**Sample** 出 $x_t$ --- 假如我們已知原圖 $x_0$ 和當前有雜訊的圖 $x_t$，在這個情況下 **Sample** 出 $x_{t-1}$ 的 **Distribution** 可以根據貝氏定理用三個 **Distribution** 組合 : 根據聯合機率 $P(A|B)=\frac{P(A,B)}{P(B)}$ : $$ q(x_{t-1}|x_t,x_0)=\frac{q(x_t,x_{t-1},x_0)}{q(x_t,x_0)} $$ 根據連鎖律 $P(C,B,A)=P(C|A,B)P(B|A)P(A)$，$q(x_t,x_{t-1},x_0)=q(x_t|x_{t-1},x_0)\cdot q(x_{t-1}|x_0)\cdot q(x_0)$，從右至左為知道 $x_0$ **Sample** 出 $x_{t-1}$ 的機率，然後在 $x_0$ 和 $x_{t-1}$ 都已知的情況下，**Sample** 出 $x_t$ 的機率，在 $q(x_t|x_{t-1},x_0)$ 的部分，因為 $x_t$ 是由 $x_{t-1}$ 來的，所以不需要知道 $x_0$，可以簡化成 $q(x_t,x_{t-1})$。根據連鎖律 $P(B,A)=P(B|A)P(A)$，$q(x_t,x_0)=q(x_t|x_0)\cdot q(x_0)$，然後可以寫成 : $$ q(x_{t-1}|x_t,x_0)=\frac{q(x_{t-1}|x_{t-1})\cdot q(x_{t-1}|x_0)\cdot q(x_0)}{q(x_t|x_0)\cdot q(x_0)} $$ 分子分母都有 $q(x_0)$ 所以可以消掉 : $$ q(x_{t-1}|x_t,x_0)=\frac{q(x_t|x_{t-1})\cdot q(x_{t-1}|x_0)}{q(x_t|x_0)} $$ 再來因為我們要解的 **Equation** 是關於 $x_{t-1}$ 的 **Distribution**，$x_{t-1}$ 是我們的變數，其他的 $x_t$ 和 $x_0$ 都被視為已知的常數，所以分母的部分可以先拿掉，後面再補回來，改成正比的關係式 : $$ q(x_{t-1}|x_t,x_0)\propto q(x_t|x_{t-1})\cdot q(x_{t-1}|x_0) $$ --- 將右半邊寫成 **Gaussian Distribution** : $$ q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1},\beta_tI) $$ $$ q(x_{t-1}|x_0)=\mathcal{N}(x_{t-1};\sqrt{\bar\alpha_{t-1}}x_0,(1-\bar\alpha_{t-1})I) $$ --- 接下來為了找出 **Gaussian Distribution** 的 $\mu$，要把 **PDF(Probability Density Function)** 的指數項部分拿出來化簡，也就是 $e^{-\frac{1}{2\sigma^2}(x_{t-1}-\mu)^2}$ ，且 $e^a\cdot e^b=e^{a+b}$ : $\large \mathrm{Exponent}\propto -\frac{1}{2}[\frac{(x_t-\sqrt\alpha_tx_{t-1})^2}{\beta_t}+\frac{(x_{t-1}-\sqrt{\bar\alpha_{t-1}}x_0)^2}{1-\bar\alpha_{t-1}}]$ 左半展開 : $$ \frac{x_t^2 - 2\sqrt{\alpha_t}x_t x_{t-1} + \alpha_t x_{t-1}^2}{\beta_t} $$ 右半邊展開 : $$ \frac{x_{t-1}^2 - 2\sqrt{\bar{\alpha}_{t-1}}x_0 x_{t-1} + \bar{\alpha}_{t-1}x_0^2}{1-\bar{\alpha}_{t-1}} $$ --- 兩邊的二次項 $x_{t-1}^2$ 的係數 : $$ \frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}} $$ 一次項的係數 : $$ -2 \left( \frac{\sqrt{\alpha_t} x_t}{\beta_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} x_0}{1-\bar{\alpha}_{t-1}} \right) $$ --- 在 **Gaussian Distribution** $\mathcal{N}(x'\mu,\sum)$ 中，指數展開式 $-\frac{1}{2} (\Sigma^{-1} x^2 - 2\Sigma^{-1}\mu x + \dots)$，所以 **Mean** $\mu$ = (一次項系數除以-2)/(二次項係數) : $$ \tilde{\mu}_t (x_t, x_0) = \frac{ \frac{\sqrt{\alpha_t} x_t}{\beta_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} x_0}{1-\bar{\alpha}_{t-1}} }{ \frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}} } $$ --- 化簡分母(二次項係數) : 已知 $\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}$ 且 $\beta_t = 1 - \alpha_t$ $$ \begin{aligned} \text{分母} &= \frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}} \\ &= \frac{\alpha_t(1-\bar{\alpha}_{t-1}) + \beta_t}{\beta_t(1-\bar{\alpha}_{t-1})} \\ &= \frac{\alpha_t - \alpha_t \bar{\alpha}_{t-1} + 1 - \alpha_t}{\beta_t(1-\bar{\alpha}_{t-1})} \quad (\text{把 } \beta_t \text{ 換成 } 1-\alpha_t) \\ &= \frac{1 - \alpha_t \bar{\alpha}_{t-1}}{\beta_t(1-\bar{\alpha}_{t-1})} \\ &= \frac{1 - \bar{\alpha}_t}{\beta_t(1-\bar{\alpha}_{t-1})} \quad (\text{因為 } \alpha_t \bar{\alpha}_{t-1} = \bar{\alpha}_t) \end{aligned} $$ --- 將分母倒數乘上去 : $$ \tilde{\mu}_t = \left( \frac{\sqrt{\alpha_t} x_t}{\beta_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} x_0}{1-\bar{\alpha}_{t-1}} \right) \cdot \frac{\beta_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} $$ * $x_t$ 的係數 : $$ \frac{\sqrt{\alpha_t}}{\beta_t} \cdot \frac{\beta_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} $$ * $x_0$ 的係數 : $$ \frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} \cdot \frac{\beta_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} $$ 最後得到 **Mean** : $$ \tilde{\mu}_t = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 $$ --- 為了把 $x_0$ 帶換掉，從 **Forward Process** 的 **Equation** 來下手 : $$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon $$ 移項 : $$ x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta}{\sqrt{\bar{\alpha}_t}} $$ 帶入至 **Mean** : $$ \tilde{\mu}_t = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} \left[ \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta}{\sqrt{\bar{\alpha}_t}} \right] $$ $\sqrt{\bar{\alpha}_t}$ 可以拆成 $\sqrt{\alpha_t}\sqrt{\bar{\alpha}_{t-1}}$，所以 $\frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t}} = \frac{1}{\sqrt{\alpha_t}}$ 整理 : $$ \tilde{\mu}_t = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\beta_t}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta) $$ 也可以寫成 : $$ \tilde{\mu}_t = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\beta_t}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)}x_t -\frac{\beta_t\sqrt{1-\bar{\alpha}_t}\epsilon_\theta)}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} $$ 然後合併 $x_t$ 的係數 : $$ \begin{aligned} \text{Coef}(x_t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} + \frac{\beta_t}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} \\ &= \frac{1}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} \left[ \alpha_t(1-\bar{\alpha}_{t-1}) + \beta_t \right] \quad (\text{通分，提出} \frac{1}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)}) \\ &= \frac{1}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} \left[ \alpha_t - \bar{\alpha}_t + 1 - \alpha_t \right] \quad (\beta_t=1-\alpha_t, \alpha_t\bar{\alpha}_{t-1}=\bar{\alpha}_t) \\ &= \frac{1-\bar{\alpha}_t}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} \\ &= \frac{1}{\sqrt{\alpha_t}} \end{aligned} $$ 再來看看剛剛沒處理的，最右邊的那一項(有$\epsilon$ 的部分) : $$ \frac{\beta_t}{\sqrt{\alpha_t}(1-\bar{\alpha}_t)} \cdot \sqrt{1-\bar{\alpha}_t} \cdot \epsilon_\theta $$ 可以寫成 : $$ \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}\sqrt{1-\bar\alpha_t}} \cdot \sqrt{1-\bar{\alpha}_t} \cdot \epsilon_\theta $$ 所以可以消掉 $\sqrt{1-\bar\alpha_t}$ : $$ \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\cdot \epsilon_\theta $$ 合併回 **Mean** : $$ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta \right) $$ ## DDIM **Github Example** : https://github.com/jason19990305/Denoising-Diffusion-Implicit-Models.git **DDIM (Denoising Diffusion Implicit Models)** 是一個 **DDPM** 的改良版，主要能加速 **Reverse Process** 的速度，雖然 **DDIM** 的數學定義上是非 **Markov Chain** ，但 **Forward Process Equation** 是一樣的 : * **Forward Process** 的 **One Step Equation** : $$ x_t = \sqrt{1-\beta_t}\cdot x_{t-1}+\sqrt{\beta_t}\cdot\epsilon_{t-1} $$ * 直接從 $x_0$ **Diffusion** 到 $x_t$ $$ x_t=\sqrt{\bar{\alpha_t}}\cdot x_0+\sqrt{1-\bar{\alpha_t}}\cdot \epsilon $$ * **DDPM Reverse Process** 的 **Equation** $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta) + \sigma_t z $$ --- **DDIM Reverse Process** 推導 : **Forward Process** : $$ x_t=\sqrt{\bar{\alpha_t}}\cdot x_0+\sqrt{1-\bar{\alpha_t}}\cdot \epsilon_t $$ 如果替換 $t$ 成 $t-1$ : $$ x_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\cdot x_0+\sqrt{1-\bar{\alpha}_{t-1}}\cdot \epsilon_{t-1} $$ **Forward Process** 移項求 $x_0$ $$ x_0=\frac{x_t-\sqrt{1-\bar\alpha_t}\cdot\epsilon_\theta}{\sqrt{\bar\alpha_t}} $$ 其中因為我們知道 **Noise Predictor** 可以預測 $\epsilon_t$，所以這邊換成 $\epsilon_\theta$ 代表他預測的雜訊，上面這段 **Equation** 代表的是能從任何 $x_t$ 反猜出 $x_0$，但現實是沒辦法一步到位將 $x_0$ 帶入 $x_{t-1}$ 的 **Equation** : $$ x_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\cdot \frac{x_t-\sqrt{1-\bar\alpha_t}\cdot\epsilon_\theta}{\sqrt{\bar\alpha_t}}+\sqrt{1-\bar{\alpha}_{t-1}}\cdot \epsilon_\theta $$ 原本預測 $x_0$ 替換掉之後，我們的目標變成預測 $x_{t-1}$ 了，這樣難度就降低很多，我們也可以稍微提高難度但結果不至於太差的 $x_t$，比如換成 $x_{t+10}$ 之類的，但計算累積 **Variance** 的部分也要跟著一起變 --- **DDIM** 的另一個特點就是允許跳點 **Example** : 假如產生一個 **Array** 來指定跳點的 $t$ ```python! [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 69, 73, 77, 81, 85, 89, 93, 97, 101, 105, 109, 113, 117, 121, 125, 129, 134, 138, 142, 146, 150, 154, 158, 162, 166, 170, 174, 178, 182, 186, 190, 194, 199] ``` 假如 $t=12$，那更新圖片的機算方式就是 : $$ x_{8}=\sqrt{\bar{\alpha}_{8}}\cdot \frac{x_{12}-\sqrt{1-\bar\alpha_{12}}\cdot\epsilon_\theta}{\sqrt{\bar\alpha_{12}}}+\sqrt{1-\bar{\alpha}_{8}}\cdot \epsilon_\theta $$ 而 **Noise Predictor** 的 **Input** 為 : $$ \epsilon_\theta(x_{12},t=12) $$ ### Result ![ddim_forward_traj](https://hackmd.io/_uploads/HkBHmf6mbe.gif) ![ddim_reverse_process](https://hackmd.io/_uploads/BkHrmG6XWx.gif) ![ddim_cat_reverse_process_0](https://hackmd.io/_uploads/S1irAMaXZe.gif)