Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer

--- tags: Diffusion model --- [paper](https://arxiv.org/pdf/2303.08622.pdf) [source code](https://github.com/ZeConloss/ZeCon) ## Introduction conditional diffusion model : 需要 paired data set with matched source and target styles $\downarrow$ Unconditional diffusion model : 從 noise 回到圖片的過程是隨機的?????，導致圖片內容不一致 $\downarrow$ - DDIB : 2個 domain? - DiffusionCLIP : 不需要很大的資料集，但是需要 fine-tuning - DiffuseIT : 能保留圖片內容，但是風格轉換就被犧牲掉了 $\downarrow$ Zero-shot Contrastive loss : ? ## Related works 各種不同的方法(暫時跳過) ### Diffusion models for image style transfer noise: $\epsilon$ $\alpha_t=1-\beta_t$ $\overline{\alpha_t}=\prod^t_{i=0}{\alpha_i}$ #### DDPM ##### time = t 的圖 $x_t=\sqrt{\overline{\alpha_0}}{x_0}+\sqrt{1-\overline{\alpha_t}}\epsilon$ ##### Reverse sampling process ![reverse + ?](https://hackmd.io/_uploads/BJElO_krn.png) #### DDIM DDPM的缺點是在 style transfer 的過程中，有些內容會消失，但是 DDIM 可以保留這些內容??? ![sampling process](https://hackmd.io/_uploads/BJCC0zJBn.png) ![denoised image](https://hackmd.io/_uploads/S1HVRGySn.png) ## 本篇做法 ### loss function ![loss function](https://hackmd.io/_uploads/SJXstGHr2.png) $\ell_{clip}$ 代表 style loss ### Content preservation loss 轉換過程中，應該只有風格被轉換，圖片的內容應保持不變(以下圖為例，風格轉換過後，男孩帶的草帽應該被保留，而不是消失) ![style transfer example 1](https://hackmd.io/_uploads/ByVexsrrn.png) ![Zecon graph explanation](https://hackmd.io/_uploads/HkQ5jcrrn.png) ![](https://hackmd.io/_uploads/Sy6I2cHSh.png) 隨機選擇1個 pixel 和原圖做 cross entropy #### Content 的最終 function 原圖與 reversed image 之間 2 pixel 之間的 1. cross entropy 2. 從VGG提取的特徵值做mean square error(平方相減再去平均） 3. L2正規化 ### Style loss 計算 reversed image 和我們想要產生的 style 之間的差異(差異越小越好） 1. global: 在 CLIP 的 embedded space 計算產出圖與 target style 的 cosine distance 2. dir: ## 跟其他作法的比較 ### GAN based model 以下方法皆有colab 1. [StyleCLIP](https://github.com/orpatashnik/StyleCLIP) 2. [StyleGAN-NADA](https://github.com/rinongal/StyleGAN-nada) 3. [VQGAN-CLIP](https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ#scrollTo=EXMSuW2EQWsd) 4. [CLIPstyler](https://github.com/cyclomon/CLIPstyler) ![image style transfer result - GAN based model](https://github.com/Argentum11/DE0_mcu/assets/92793837/af4a2acc-8fc9-42ab-85e7-28384a036cb1) |style/method|StyleCLIP|StyleGAN-NADA|VQGAN-CLIP|CLIPstyler| |-|-|-|-|-| |Pixar|臉以外的物件不見(帽子、手)|同左|失敗|失敗| |Pop art|太像原圖| |-|-| |Gogh|太像原圖|-|-|-| |Ukiyo-e|太像原圖|-|失敗|失敗| ![image style transfer score - GAN](https://hackmd.io/_uploads/HkkW7ZO_3.png) Face ID 分數是由[arcface](https://insightface.ai/arcface)計算 |method|缺點| |-|-| |StyleCLIP|圖片語意保持得很好(很像原圖)，相對的風格轉換也較差| |StyleGAN-NADA & CLIPstyler|風格轉換過頭| ### diffusion model 1. [DDIB](https://github.com/suxuann/ddib) 2. [DiffusionCLIP](https://github.com/gwang-kim/DiffusionCLIP) 3. [DiffuseIT](https://github.com/cyclomon/DiffuseIT) ![image style transfer result - diffusion model](https://hackmd.io/_uploads/r1y7uWu_h.png) #### user study ![image style transfer score - diffusion model](https://hackmd.io/_uploads/rJjE_WOu2.png) //TODO |method|缺點| |-|-| |DDIB|| |DiffusionCLIP|| |DiffuseIT||r