Diffusion 背後的套路

# Diffusion 背後的套路 ###### tags: `CV` Based on [大金影片](https://youtu.be/JbfcAaBT66U) ## Framework ![](https://hackmd.io/_uploads/r1W_lDhS3.png) (1) 文字輸入變成向量 (藍色) (2) 輸入雜訊 (粉紅色方塊) + 文字 encoder 出來的東西，經過生成模型(e.g., diffusion model)，得到中間產物(壓縮後的圖片) (3) 過 decoder，還原回原始圖片 ### Decoder decoder 的訓練不需要文字資料，只要圖片就可以了！ decoder 訓練方式根據中間產物有所不同。若中間產物是小圖，那就讓 input 是小圖，output 是大圖，這樣來訓練 decoder。如果中間是 Latent represnetation，那變成要訓練 AE (encoder-decoder)，目標是讓輸入圖片跟輸出圖片越接近越好；若輸入是 H * W * 3，則 latent 通常是 h * w * c ![](https://hackmd.io/_uploads/S1ZGMwnH2.png) ### Generation model * Input: 文字 representation * Output: 產生圖片中間產物 * Training 過程: 1. encoder 吃圖片，產生中間產物 2. sample noise，加到中間產物上，做好幾個 step ![](https://hackmd.io/_uploads/ByFOtOnrh.png) 3. 訓練noise predictor: * Input: 文字 representation/xth step中間產物/x * Output: xth step noise ![](https://hackmd.io/_uploads/SkSKK_2H3.png) * Testing 過程: 從Normal distribution sample一個noise+文字 representation丟進denoise module(noise predictor)，逐步去除noise產生中間產物 ![](https://hackmd.io/_uploads/rJRP5O3r2.png) ## DALL-E 架構跟上面介紹的一樣生成模型使用 autoregressive (如果是要生成完整圖片，autoregressive會算太久，但現在只是要生成壓縮的圖片，所以可以) or diffusion model ## Imagen diffusion 先生出 64x64 的小圖，再經過最後 super resolution diffusion 轉成大圖 ## Metric 文字 encoder FID 越小，代表圖片生出來越好 clip score越高，代表文字跟圖片有關係實驗發現 encoder 越大，效果越好；反之，U-Net 大小 (diffusion model size) 就沒什麼影響 ## FID 評估影像生成模型的好壞的指標生成出來的影像跟真實影像先各自丟到pretrained CNN，得到 hidden representation 去比較生成跟真實 hidden representation 的分布，計算 distance，兩個分布越接近，代表生成結果越好這個方法需要大量 sample 才有辦法得到結果，像前一張投影片的結果就是需要 10k 張 image ## CLIP ![](https://hackmd.io/_uploads/r1oybDhrh.png) Text encoder讀一段文字產生對應的vector Image encoder讀一張圖片產生對應的vector 如果Text-image是同一個pair則兩個vector距離要近，否則要遠