Imagen - HackMD

# Imagen ## 資源 https://imagen.research.google/ https://www.assemblyai.com/blog/minimagen-build-your-own-imagen-text-to-image-model/ https://www.bilibili.com/video/BV1uv4y1o7VB/?spm_id_from=333.337.search-card.all.click&vd_source=796c7564c6b769675a4a711e201da355 ## Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Saharia, Chitwan, et al. "Photorealistic text-to-image diffusion models with deep language understanding." Advances in neural information processing systems 35 (2022): 36479-36494. [toc] 新模型 Imagen 在利用文字生成圖像上有重大進展，並提出 DrawBench 作為文本到圖像的新評估方法 ## 簡介過去幾年的研究，在文本生成圖像的領域有重大進展。 Imagen 是一種新模型，它將 transformer 語言模型與 diffusion 模型相結合，以更好地生成圖像。 Imagen is a new model that combines transformer language models with diffusion models for better image generation. Imagen 可以按照語義組合不相關的概念和物件，去生成新的圖像，而非強行拼接。 necessarily be able to combine unrelated concepts and objects in semantically plausible ways. ### 貢獻該論文提出五大貢獻： 1. 將凍結的大型語言模型作為text encoder，並調整其大小，比調整diffusion model的大小有更好的效果 2. 透過 dynamic thresholding ，以及high guidance weights能夠生成更真實與detailed的images 3. 提出 Efficient U-Net 作為 diffusion architecture，具有更簡單更高效的效果。 4. Imagen 生成的圖象在人工審查下，更符合文字的意思。 5. 提出 DrawBench 的評估方法，這此評估下，Imagen優於DALL-E 2。 ## 方法 Imagen 由Encoder、Generation model、Decoder所構成 ![image](https://hackmd.io/_uploads/B1cUBfFJJg.png) 文字經由 text encoder，將語意embedding成vector進到Generation model、Decoder內告訴 model 文字的意思(telling the model what is in the caption)，再經由 Generation model 產生小圖，並被 Decoder 所放大。 ![image](https://hackmd.io/_uploads/rJABcqQ1kx.png) :::spoiler word -> text encoder -> 語意封裝到 vector -> image-generation model(telling the model what is in the caption) -> super-resolution model(to a higher resolution) -> another super-resolution model(high-resolution image) ![image](https://hackmd.io/_uploads/SyMdqqmkJl.png) ::: ### Pretrained text encoders 目的：去理解每個詞之間的關係，而非把這些詞做串接，做出很奇怪的效果。 text encoding understands how the words within the caption relate to one another ![image](https://hackmd.io/_uploads/S1t8i5m1yg.png) 透過訓練好的 Frozen Text Encoder(T5 model)經 embedding 後進到 Image Generator(Diffusion Models)進行生成 ![image](https://hackmd.io/_uploads/B1h-NoQkJe.png) #### T5 model Text-To-Text Framework ![image](https://hackmd.io/_uploads/rkoUdo7yJg.png) 利用LLMs(如T5 model)，將文本轉變成向量這些向量能夠捕捉詞彙之間的語意關係，加強文本間的關聯性，讓圖像能生成符合文字描述的樣子。場景：貓咪戴眼鏡在桌子上。詞語：貓咪戴眼鏡在桌子上 encoder會將這些詞與轉換成向量，理解這些詞彙間的關係假如：理解「貓咪」跟「戴眼鏡」之間的關聯性生成相應的vector #### 為何選擇T5 以及調整語言模型的大小? ##### CLIP Score (Contrastive Language-Image Pretraining) 圖像跟文字是否對應？ * 訓練：利用大量成對的圖跟文字 * 評估：把敘述跟產生圖片丟進去，計算這個向量的距離，評估像不像。 ![image](https://hackmd.io/_uploads/Sy-ae7t1kx.png) ##### 為何選擇T5 ![image](https://hackmd.io/_uploads/Bk_jk7KkJl.png) T5-XXL 與 CLIP 在 CLIP Score上有同樣效果但在 DrawBench的評估中，T5-XXL 更符合人類的感知。 ##### 為何調整語言模型的大小 ![image](https://hackmd.io/_uploads/rJ2rxQKykl.png) 調整語言模型的大小，比調整 Diffusion Models 有更顯著的效果 ### Generation model: Diffusion Models and classifier-free guidance ![image](https://hackmd.io/_uploads/BycUgjm1Jx.png) ![image](https://hackmd.io/_uploads/HJaiejQy1l.png) 正向 $X_{t-1}→X_t$ 加入高斯雜訊 encoder ![image](https://hackmd.io/_uploads/BJNnlsX1ke.png) 反向 $X_{t+1}→X_t$ 神經網路還原(denoise的unet) 模型 ( $\hat{x}_\theta$ ) ，具體形式為：$E_{x,c,ϵ,t}[w_t∥\hat{x}_\theta(α_tx+σ_tϵ,c)−x∥^2_2]$ 這個公式用來衡量模型 ( $\hat{x}_\theta$ ) 在去噪過程中的表現也就是 loss function。 * $\hat{x}_\theta$ 是神經網路 * $w_t$ 是一個權重隨著時間(step)做變化，平衡不同step上的學習難度 * $x$ 圖像資料 * $c$ 輸入的文字 * $α_t$是文字對圖像影響的程度 * α might adjust how much influence the context c has on the image generation. * $ϵ$ 為噪聲，ϵ ∼ N (0, I) * $σ_t$是控制雜訊與隨機性 * σ could represent a level of noise or randomness in the process, which can help create more varied and interesting images. * $z_t:=α_tx+σ_tϵ$ 為加入噪聲的圖像，也可說是中間產物 * $z_1$純噪聲，$z_1$ ∼ N (0, I) * $z_{t_1},...,z_{t_T}$, where 1 = $t_1>...>t_T=0$ 噪聲逐漸減少換句話說就是在反向過程中將 $z_t:=α_tx+σ_tϵ$ 去噪為 x 的 squared error loss，也就是$L_2^2$ $\|v\|_2 = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2}$ $\|v\|_2^2 = v_1^2 + v_2^2 + \dots + v_n^2$ $\|\hat{x}_\theta - x\|_2^2 = (\hat{x}_1 - x_1)^2 + (\hat{x}_2 - x_2)^2 + \dots + (\hat{x}_n - x_n)^2$ :::spoiler $w_t$ 是一個權重隨著時間(step)做變化 * 較小的𝑡：雜訊較少，資料接近真實資料。 * 較大的𝑡：雜訊較多，資料接近純雜訊。權重的設計用來平衡不同時間步上的學習難度。控制模型在不同時間步驟的學習重點 $E_{x,c,ϵ,t}$ 期望值透過在所有可能的組合上取平均，確保模型在所有這些條件下都能表現良好損失函數中的期望值的作用是確保模型在所有不同的輸入條件（如時間步 𝑡、雜訊 𝜖、資料點 𝑥 和條件 c）下都能良好地工作。透過對這些隨機變數取期望，模型能夠避免過擬合某個特定條件，從而具備處理多樣化資料和雜訊的能力，最終使得生成的樣本更加穩定且高品質。 ‖‖ indicate a mathematical operation called "norm," 計算一個norm長度 $x$ 指需要分析的原始影像 In this context, x usually represents an image or a piece of data that we are working with. Think of it as a picture that we want to analyze or modify. $c$ 額外的訊息指應該產生怎麼樣的圖像 The letter c often stands for "condition" or "context." It provides additional information that helps the model understand what kind of image it should generate or work with. $t$是指擴散過程的step α,σ 是控制的參數 α (alpha) and σ (sigma) are parameters that help control the process. α 控制 c 對影像產生的影響程度 α might adjust how much influence the context c has on the image generation. σ 控制雜訊與隨機性 σ could represent a level of noise or randomness in the process, which can help create more varied and interesting images. $\hat{x}_\theta$ 是神經網路表示一個接受輸入並輸出的函數，用以生成影像的過程 The notation ˆxθ suggests a function that takes inputs (like αtx + σt and c) and produces an output. This function is likely designed to generate or modify images based on the inputs it receives. $\hat{x}_\theta(α_tx+σ_tϵ,c)−x$ 生成的影像與原始影像的差異了解模型的好壞 The expression ˆxθ (αtx + σt, c) − x means we are comparing the output of our function with the original image x. This comparison helps us understand how well the model is performing. If the output is very close to x, it means the model is doing a good job. $∥^2_2$ 利用平方強化差異性 $L_2$ norm 的平方 indicates that we are squaring the norm. This is a common practice in mathematics to emphasize larger differences. Squaring the norm means that if there is a big difference between the generated image and the original image, it will be highlighted even more. $\|v\|_2 = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2}$ $\|v\|_2^2 = v_1^2 + v_2^2 + \dots + v_n^2$ $\|\hat{x}_\theta - x\|_2^2 = (\hat{x}_1 - x_1)^2 + (\hat{x}_2 - x_2)^2 + \dots + (\hat{x}_n - x_n)^2$ https://ithelp.ithome.com.tw/articles/10221132 最小化影像和原始影像的差異，來訓練模型 This equation is crucial for training models that generate images from text descriptions. By minimizing the difference between the generated image and the original image, the model learns to create more accurate and realistic images. αt_x+σt_ϵ 加入噪聲的正向過程訓練$\hat{x}_\theta$的逆向過程通過計算模型預測值 ( $\hat{x}_\theta(\alpha_t x + \sigma_t \epsilon, c)$ ) 與真實數據 ( $x$ ) 之間的二次歐幾里得距離來評估模型的性能。這裡的期望值 ( $E$ ) 表示對所有可能的數據樣本 ( $x$ )、條件 ( $c$ )、噪聲 ( $\epsilon$ ) 和時間 ( $t$ ) 進行平均。損失函數中，二次範數差異 ( $|\hat{x}_\theta(\alpha_t x + \sigma_t \epsilon, c) - x|2^2 )$ 表示模型預測值 ( $\hat{x}_\theta(\alpha_t x + \sigma_t \epsilon, c)$ ) 與真實值 ( x ) 之間的平方差。這實際上就是計算這兩個點在高維空間中的歐幾里得距離的平方。其中 ($(x, c)$) 是數據和條件對，($t$) 是從均勻分佈中抽取的，($\epsilon$) 是從標準正態分佈中抽取的，($\alpha_t$)、($\sigma_t$) 和 ($w_t$) 是影響樣本品質的函數。 ::: ##### Classifier guidance 與 Classifier free guidance diffusion 在訓練過程中，為了讓模型變得更靈活，會有概率地在學習時(可能10%)，去忽略文字條件或標籤，讓其只根據噪聲去學習怎麼產生圖像。 https://arxiv.org/abs/2207.12598 #### Sampling 取樣過程使用調整後的 ( $x$ ) 預測雜訊，公式為： $\tilde{\epsilon}_\theta(z_t, c) = w \epsilon_\theta(z_t, c) + (1 - w) \epsilon_\theta(z_t)$ 其中 ( $\epsilon_\theta(z_t, c)$ ) 和 ( $\epsilon_\theta(z_t)$ ) 是條件和無條件的預測，( $w$ ) 是指導權重。 Guidance Weight 文字參與指導的程度 let the model understand how closely it should follow the text description * $w$是指導權重（guidance weight） conditional and unconditional predictions * $w=1$ 使用條件的噪聲預測$\tilde{\epsilon}_\theta(z_t, c)$ * $w>1$ 更依賴條件$c$ * $w=0$ 使用無條件 Classifier free diffusion $\tilde{\epsilon}_\theta(z_t)$ ![image](https://hackmd.io/_uploads/By63Jhmyyg.png) https://arxiv.org/pdf/2112.10741 :::spoiler 條件和無條件噪音預測的加權平均 $\tilde{\epsilon}_\theta(z_t, c) = w \epsilon_\theta(z_t, c) + (1 - w) \epsilon_\theta(z_t)$ $\tilde{\epsilon}_\theta(z_t, c)$ 有條件下的預測噪聲 $\tilde{\epsilon}_\theta(z_t)$ 無條件下的預測噪聲 $\epsilon_\theta(z_t, c) := \frac{z_t - \alpha_t \hat{x}_\theta}{\sigma_t}$ ::: ##### Problem: Large guidance weight samplers 雖然增加classifier-free的guidance，能讓生成的影像能更貼近文字的含意，但相反地，變得過於明亮，不真實 the images created do not look natural Instead, they become overly bright and unrealistic 這是模型在訓練時，預測出的影像$\hat{x}^t_0$皆會與訓練影像的x在同一個範圍內[-1,1]之間，但是較高的 guidance weight 會在測試時超出範圍，得到 Train-Test Mismatch 的結果。又因為diffusion model會重複迭代，就把這個問題給放大了。 ##### 如何在擴散過程的步驟中減小畫素的範圍到[-1,1] dynamically reduce the range of pixels during the recursive steps in the middle of the diffusion process 1. Static Thresholding 影像如果預測超過[-1,1]的範圍，會直接被裁減並保持在該範圍內，但影像可能過於明亮或缺乏細節 2. Dynamic Thresholding 跟數位影像課程內 Full range of an arithmetic operation 改變範圍至[0,K]相同選擇一個百分位數做切割，最大值s如果超過1，就會保持在新的範圍內 [min,max] = [-s,s] -> [-s/s,s/s] = [-1,1] ![image](https://hackmd.io/_uploads/S1Bn5XKkJg.png) ![image](https://hackmd.io/_uploads/rJtR5mtyJe.png) ![image](https://hackmd.io/_uploads/HkG_omFy1e.png) ### Generation model(base model architecture) text-to-image diffusion model Base model中加入cross-attention layers 加強對 text embeddings 的注意力，在每一次生成的迭代中都加入對文字的理解。輸入 step, text embeddings 後的 vector 生成 64x64的影像 ![image](https://hackmd.io/_uploads/r1fTNiQ1Jl.png) ### Decoder: Image Super-Resolution Diffusion Model 經改良的 Efficient U-Net 刪除了 self-attention layers 以提高生成的效率但保留了 cross-attention layers 對文字的理解再經由兩次 Efficient U-Net 逐步放大圖片的解析度先以 Small-to-Medium(STM) 生成256X256的影像再以 Medium-to-Large(MTL) 生成更大的1024X1024的影像 ![image](https://hackmd.io/_uploads/SyMdqqmkJl.png) #### Robust Cascaded Diffusion Models ## Evaluating Text-to-Image Models ### 驗證方法 #### CLIP Score 圖像跟文字是否對應？ #### FID 生成圖像的品質？衡量生成模型生成圖像與真實圖像之間差異的指標計算兩組真實與生成影像的 distribution 並假設其為高斯分布，計算出兩者之間的距離。 ![image](https://hackmd.io/_uploads/BkynUbgRC.png) #### DrawBench 圖像是否被人類接受？ DrawBench 是一種基於人類感知的評估方法分為 11個類別跟 201種文字樣本透過兩個模型生成兩組各8張圖片做對比，問人兩種問題： 1. Which set of images is of higher quality? 2. Which set of images better represents the text caption : {Text Caption}? 限縮三種回答： 1. I prefer set A. 2. I am indifferent. 3. I prefer set B. ![image](https://hackmd.io/_uploads/BJyWiItykg.png) ### 結果 #### 與其他模型之間的差異 ![image](https://hackmd.io/_uploads/r1oz6ItJ1x.png) ![image](https://hackmd.io/_uploads/rJCETLKJyl.png) #### 哪個環節對模型的影響最顯著 ![image](https://hackmd.io/_uploads/SkzUpLY1yg.png) 1. 增加 text encoder 語言模型的大小，比加大生成模型，對 image-text alignment 的效果更為顯著。 2. 擴散過程中使用 Dynamic Thresholding，有助於提升 image-text alignment 的效果