Prompt Engineering for Vision Models。視覺模型的提示工程

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) ##### AI Agents - [AI Agents in LangGraph](https://hackmd.io/@YungHuiHsu/BJTKpkEHC) ##### RAG - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [[GenAI][RAG] Multi-Modal Retrieval-Augmented Generation and Evaluaion。多模態的RAG與評估 ](https://hackmd.io/@YungHuiHsu/B1LJcOlfA) - [[GenAI][RAG] Building Multimodal Search and RAG。多模態搜尋與RAG ](https://hackmd.io/8jXegjZgSECNZqcv9UdwM) ##### GenAI - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [Prompt Engineering for Vision Models。視覺模型的提示工程](https://hackmd.io/@YungHuiHsu/rkqs588d0) #### [[Mulimodal] CVPR 2023。Multimodal Foundation Models : From Specialists to General-Purpose Assistants<br>多模態基礎模型研究回顧](https://hackmd.io/@YungHuiHsu/HkOjXPg46) --- # [Prompt Engineering for Vision Models](https://learn.deeplearning.ai/courses/prompt-engineering-for-vision-models/lesson/1/introduction)![image](https://hackmd.io/_uploads/SJ2_RI8dA.png =100x) ![image](https://hackmd.io/_uploads/HyOsaB8_0.png =50x) ## Overview :::success 1. Introduction - What is visual prompt engineering? 2. Image Segmentation - Use the Segment Anything Model (SAM) to segment images using pixel coordinates and bounding boxes as prompts. 3. Object Detection - Use OWL-ViT + SAM to detect and segment existing objects within images using natural language. 4. Image Generation - Use natural language to generate areas of an image that didn’t previously exist. 5. Fine-tuning - Learn how to fine-tune an image generation model on your own custom images. ::: ### What is a Prompt? ::: info 對模型而言、任何一種形式的輸入資料都可被視為提示 > In fact, theoretically, any kind of data can be a prompt, including text and images, but also audio and video. 提示（例如文字、圖像、聲音、影片等）可以視為一個條件，這個條件用來限定生成模型的條件機率範圍，從而控制生成模型生成的內容。通過提示，模型從條件機率分佈 $p(x|c)$ 而非先驗機率分佈 $p(x)$ 生成樣本，使生成的結果符合提示所描述的特徵或內容。這裡的生成模型可以包括語言模型和擴散模型。 > **A prompt, is simply an input that guides the distribution of the output, and visual inputs do just that for diffusion models.** - 在機率分佈的框架下，生成模型（如擴散模型）的任務是從一個複雜的數據分佈 $p(x)$ 中生成新樣本 $x$ - **Prompt** 提供了一個條件 $c$，引導模型生成符合條件 $c$ 的樣本，即從條件機率分佈 $p(x|c)$ 中生成樣本從機率分佈的觀點看，提示詞（Prompt）是用來引導模型生成樣本的一個條件，改變了生成過程中的機率分佈，使模型生成符合提示詞所約束的樣本 ::: ![image](https://hackmd.io/_uploads/SyBNHvLOR.png =600x) Embedding ![image](https://hackmd.io/_uploads/H121iD8_0.png) Prompt vs. Input ![image](https://hackmd.io/_uploads/Hys1gSwu0.png) Prompt Engineering Workflows ![image](https://hackmd.io/_uploads/HJpBgrv_0.png =500x) ![image](https://hackmd.io/_uploads/SJ1BGrvO0.png =500x) --- ## Image Segmentation :::success * Segment Anything Model * Prompting with a set of pixel coordinates * Prompting with multiple sets of coordinates * Prompting with bounding boxes * Using a positive prompt along with a negative prompt ::: - Types of Image Segmentation ![image](https://hackmd.io/_uploads/HJ3N5rPOR.png =600x) ![image](https://hackmd.io/_uploads/HyPiorPd0.png =600x) * Image Segmentation：以pixel為單位，將具有共同特徵的pixel分類給予標籤 > assign a label to every pixel in an image > pixels with the same label share certain characteristics * Semantic Segmentation (語義分割) * 類別標籤（Class Label） * 每個像素被分配到一個語義類別，不區分具體實例。所有屬於同一類別的像素都有相同的標籤 * Instance Segmentation (實例分割) * 實例標籤（Instance Label） * 每個像素不僅被分配到一個類別，還被分配到該類別中的具體實例。即每個實例都有一個唯一的標籤 * Panoptic Segmentation (全景分割) * 語義標籤 + 實例標籤（Class Label + Instance Label） * 每個像素被分配到一個語義類別和具體實例。結合了語義分割和實例分割，提供每個像素的完整標籤信息 ### Segment Anything Model 更多細節參考[From Representation to Interface: The Evolution of Foundation for Vision Understanding-CH2 Visual Understanding筆記連結](https://hackmd.io/@YungHuiHsu/HyjAklf4T) ![image](https://hackmd.io/_uploads/rkHxArPuR.png =400x) ![image](https://hackmd.io/_uploads/ByNC0BPOR.png =800x) - Fast SAM - CNN Based - using only 2% of the original data set published by the SAM authors - 50x than SAM on 32x32 images ![image](https://hackmd.io/_uploads/By37wcvuC.png) 有text-prompt，但可以輸出bbox座標嗎? - [Fast Segment Anything Model (FastSAM)](https://docs.ultralytics.com/models/fast-sam/#val-usage) - [FastSAM official Usage](https://github.com/CASIA-IVA-Lab/FastSAM) ```python= # Create a FastSAM model model = FastSAM("FastSAM-s.pt") # or FastSAM-x.pt # Run inference on an image everything_results = model(source, device="cpu", retina_masks=True, imgsz=1024, conf=0.4, iou=0.9) # Prepare a Prompt Process object prompt_process = FastSAMPrompt(source, everything_results, device="cpu") # Everything prompt results = prompt_process.everything_prompt() # Text prompt results = prompt_process.text_prompt(text="a photo of a dog") # Point prompt # points default [[0,0]] [[x1,y1],[x2,y2]] # point_label default [0] [1,0] 0:background, 1:foreground results = prompt_process.point_prompt(points=[[200, 200]], pointlabel=[1]) prompt_process.plot(annotations=results, output="./") ``` This course was based off a set of two blog articles from Comet. Explore them here for more on how to use newer versions of Stable Diffusion in this pipeline, additional tricks to improve your inpainting results, and a breakdown of the pipeline architecture: * [SAM + Stable Diffusion for Text-to-Image Inpainting](https://www.comet.com/site/blog/sam-stable-diffusion-for-text-to-image-inpainting/?utm_source=dlai&utm_medium=course&utm_campaign=prompt_engineering_for_vision_models&utm_content=dlai_L2) ![image](https://hackmd.io/_uploads/HkiLy5wOA.png) * [Image Inpainting for SDXL 1.0 Base Model + Refiner](https://www.comet.com/site/blog/image-inpainting-for-sdxl-1-0-base-refiner/?utm_source=dlai&utm_medium=course&utm_campaign=prompt_engineering_for_vision_models&utm_content=dlai_L2) 使用[ultralytics](https://docs.ultralytics.com/models/sam/)公司包裝好的的YOLO載入FastSAM ```python= from ultralytics import YOLO model = YOLO('./FastSAM.pt') ``` - Prompting with a set of pixel coordinates - 給定一個指定點、自動偵測分割區域 - input_points = [ [350, 450 ] ] ![image](https://hackmd.io/_uploads/SyDSe9DuA.png =300x)![image](https://hackmd.io/_uploads/rkCBg9Pu0.png =300x) ```python= from utils import format_results, point_prompt # Define the coordinates for the point in the image # [x_axis, y_axis] input_points = [ [350, 450 ] ] input_labels = [1] # positive point # Run the model results = model(resized_image, device=device, retina_masks=True) results = format_results(results[0], 0) # Generate the masks masks, _ = point_prompt(results, input_points, input_labels) ``` - Prompting with multiple sets of coordinates 給定多個指定點、自動偵測多個分割區域 - Prompting with bounding boxes 給定bounding boxes 、自動偵測分割區域 ![image](https://hackmd.io/_uploads/BynjBcw_A.png =300x)![image](https://hackmd.io/_uploads/SynoH9vOC.png =300x) ```python= # Set the bounding box coordinates # [xmin, ymin, xmax, ymax] input_boxes = [530, 180, 780, 600] results = model(resized_image, device=device, retina_masks=True) # Generate the masks masks = results[0].masks.data # Convert to True/False boolean mask masks = masks > 0 from utils import box_prompt masks, _ = box_prompt(masks, input_boxes) ``` - Using a positive prompt along with a negative prompt 給定指定點(並給予正、負llabel)、自動偵測要分割與不分割區域 ![image](https://hackmd.io/_uploads/S1UHr9POC.png =300x)![image](https://hackmd.io/_uploads/rymISqPOA.png =300x) ```python= # Define the coordinates for the regions to be masked # [x_axis, y_axis] input_points = [ [350, 450], [400, 300] ] input_labels = [1, 0] # positive prompt, negative prompt # Run the model results = model(resized_image, device=device, retina_masks=True) results = format_results(results[0], 0) # Generate the masks masks, _ = point_prompt(results, input_points, input_labels) ``` :::info 我需要用文字提示來生生成bounding boxes座標範圍(text2Bbox) OWL-ViT Paper: https://arxiv.org/abs/2205.06230 [Described Object Detection on Description Detection Dataset](https://paperswithcode.com/sota/described-object-detection-on-description) [Open Vocabulary Object Detection on LVIS v1.0](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-lvis-v1-0) ::: :::warning FastSAM在[playground](https://huggingface.co/spaces/An-619/FastSAM) 測試效果不佳 - 非LLM驅動的只能用簡單的文字描述檢測物件 - 需要能了解(提問)複雜語意的VLM/MMLM作為object detector，來輸出絕對的座標點位 - 大圖牽涉到gpt-4o切割圖片方式，最好能讓她已不切割方式偵測、取得座標後再縮放回原尺寸 ::: --- ## Object Detection :::success * How to use text instead of prompts to generate masks * Creating a pipeline of models where the output of one models is fed into the second model * What is zero-shot object detection? ::: [OWL-ViT](https://arxiv.org/abs/2205.06230) ![image](https://hackmd.io/_uploads/HyF_EiPd0.png) - Transfer to Open-Vocabulary Detection - 將預訓練的圖像和文本編碼器應用於開放詞彙對象檢測任務 - **Text Encoder** - 將查詢字符串（如 "giraffe", "tree", "car"）嵌入為查詢嵌入，用於對象分類 - **Vision Encoder** 將圖像分割成小塊（patch），這些小塊被視為 tokens，並對每個 token 進行獨立處理，附加輕量級的分類與定位頭，分別生成： - **Object image embeddings** - 使用線性投影（Linear Projection）來預測類別 - 分類頭對每張圖像定義的查詢標籤空間輸出logits(表示各類別的分數），而不是在一個預定義的、固定的全局標籤空間（例如，只有一個類別列表） - 在開放詞彙檢測中，查詢標籤空間是由圖像的具體查詢決定的。每張圖像可以有不同的查詢，模型根據這些查詢來輸出logits - 例如，對於一張包含“giraffe”、“tree”和“car”的圖像，模型只針對這些查詢輸出logits - **Object box embeddings** - 使用MLP(MultiLinear Layer)作為定位器來預測座標(bboxs) - 注意這邊的tokens(image patches)是原始(image patches)，沒有經過資訊匯聚(即移除Token Pooling)，而是保留每個(image patches)的獨立表示，可以保留更多的細尺度的完整語資訊(特徵) - Token Pooling，例如: - 平均池化（Average Pooling）：對所有 token 的向量取平均 - 最大池化（Max Pooling）：在每個向量維度上取最大值 - CLS 池化（CLS Pooling）：使用特定的 [CLS] token 作為輸出表示 :::info "**Open-Vocabulary Detection**" 是一種利用預訓練模型來進行物件定位與分類的技術。由於預訓練模型中蘊含了對物件的知識（包括視覺和文字模態），因此可以偵測訓練資料集中未定義過的標籤 ::: ![image](https://hackmd.io/_uploads/SkRw8hD_C.png) > 對於文本查詢（未顯示），模型僅在頂部示例（“swallowtail butterfly”）中檢測到正確的物種，而在底部示例（“luna moth”）中未能正確檢測。 > 這表明圖像條件檢測在某些情況下比文本查詢更有效，特別是在處理具體視覺特徵時。使用COMET紀錄實驗過程 ![image](https://hackmd.io/_uploads/Sykzu3DdA.png =700x) --- ### Image Generation :::success Basics of Diffusion - Diffusion models transform an image of random noise into a target image - A U-Net is used to gradually denoise the image across many timesteps - A text encoder is used to generate prompt embeddings, which guides the denoising ::: ![image](https://hackmd.io/_uploads/SJPpwnP_R.png) #### Image Inpainting (replacing regions of an image) ![image](https://hackmd.io/_uploads/Skdls2Dd0.png =300x) - [diffusers](https://pypi.org/project/diffusers/) >　Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. ![image](https://hackmd.io/_uploads/H1lBhnPuC.png =100x) ```python= from diffusers import StableDiffusionInpaintPipeline sd_pipe = StableDiffusionInpaintPipeline.from_pretrained( "./models/stabilityai/stable-diffusion-2-inpainting", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.bfloat16, low_cpu_mem_usage=False if torch.cuda.is_available() else True, ).to(device) seed = 66733 generator = torch.Generator(device).manual_seed(seed) ``` - **Inference Steps** 推理步驟的數量決定了去噪過程的漸進程度，影響圖像品質 - More inference steps means a more gradual denoising process, leading to higher quality images - Too many steps is a waste of compute, and can lead to an “overly processed” image - **Guidance Scale** 控制模型生成圖像時對提示詞的依從程度 - Numeric value that determines how closely the model should follow the prompt 較低的指導尺度則提高圖像品質，但可能降低對提示詞的忠實度 - Higher guidance scale means higher prompt fidelity, but potentially lower image quality 較高的指導尺度提高提示詞的忠實度，但可能降低圖像品質 > An oil painting of a tabby cat dressed like a character in The Matrix ![image](https://hackmd.io/_uploads/HkF5nnDdR.png) - Negative Prompt - 控制生成模型在生成圖像時應避免的內容或特徵的超參數 --- ## Fine-Tuning - Fine-tuning challenges - 數據的穩健性（Robustness of Data）:is fine-tuning training data enough to allow the model to generalize? - 數據量和多樣性：微調過程通常使用的數據集比預訓練數據集小且專一。如果數據集不夠大或不夠多樣，模型可能難以學習到足夠廣泛的特徵，導致泛化能力不足。 - 偏差和過擬合：微調數據集如果存在偏差，模型可能會過擬合這些特定的數據特徵，從而在其他數據集上表現不佳 - 語言漂移（Language Drift）: will fine-tuning degrade the model's performance on other tasks? - 模型退化：模型在特定任務上進行微調時，可能會導致模型在其他未微調任務上的性能退化。這是因為微調過程中，模型可能會偏向於新數據集中的特定模式和語言，從而忽略了原本在預訓練過程中學到的廣泛知識 - 平衡多任務學習：如何在微調一個特定任務的同時，保持模型在其他相關任務上的性能是一個巨大的挑戰。需要設計合理的微調策略來避免語言漂移問題 ![image](https://hackmd.io/_uploads/ByqHTJu_A.png =400x) ### DreamBooth ![image](https://hackmd.io/_uploads/BknAAJudC.png =600x) - [DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation](https://dreambooth.github.io/) * Semantic Prior Knowledge（語義先驗知識）: model's existing understanding of the world. * 模型知道“狗”的概念，當它遇到“狗”這個詞時，可以利用其已有的知識來生成與狗相關的圖像 * Context Clues（上下文線索：Uses context clues when encountering a new word. - DreamBooth 基本概念 1. **微調模型以識別稀有使用的文本標記**： - 模型被微調來識別一個稀有使用的文本標記，比如在方括號中的字母 "V"：`"[V]"`，即識別符/佔位符(identifier) - 這個識別符用來在文字空間被模型識別並與影像空間的特定主體（比如某個具體的人）的圖像相關聯。 - 這樣，當使用包含 `"[V]"` 的提示詞時，模型會生成包含這個具體主體特徵的圖像。 2. **將稀有標記與常用標記配對**： - 識別符(identifier) `"[V]"` 被與一個常用標記(類別)配對，比如 `“man”`（男人）。 3. **創建修改後的類別**： - 這樣就創建了一個修改後的類別：`"[V] man"` - 提示詞：`"Photo of a [V] man standing"`（站著的 [V] 男人的照片）。 ![image](https://hackmd.io/_uploads/rJMWjeudC.png =400x) - **Prior Preservation Loss** - Regularization: Penalizes model for drifting from its original knowledge during fine tuning - Create a dataset of image + prompt ![image](https://hackmd.io/_uploads/SJcU2eOdC.png =400x) BLIP: image captioning: auto-generate "instance prompts” - Class Prompt ![image](https://hackmd.io/_uploads/B1Dk_Zud0.png =400x) - 在instance prompt中，使用唯一標識符identifier [V]來代表想要模型辨識出、個人化的主體，實現個人化辨識 - Class Prompt中，則用多個該類別的圖象，幫助模型保留對該類別特徵的辨識能力 - Prior Preservation Loss ![image](https://hackmd.io/_uploads/H15itZddR.png =400x) - Instance Prompt 對應於公式中的標準重建損失部分，用於確保模型能夠準確重建特定主體的圖像，這就是 Instance Loss - Class Prompt 對應於公式中的先驗保持損失部分，用於確保模型能夠保持對類別的理解並生成多樣化的圖像，這就是 Prior Loss --- ## 以下為DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation原始論文中的補充說明 ![image](https://hackmd.io/_uploads/HyCddlOOR.png =800x) ![image](https://hackmd.io/_uploads/BJrmyld_C.png =800x) ### 文字到影像模型的個人化。Personalization of Text-to-Image Models #### Few-Shot Personalization 的提示設計 - 目標將新的（唯一標識符，主體）對“植入”擴散模型的“字典”中。 > “implant” a new (unique identifier, subject) pair nto the diffusion model’s “dictionary” - 挑戰為了避免為給定的圖像集編寫詳細的圖像描述，研究中選擇了一種更簡單的方法，將主體的所有輸入圖像標記為“a [identifier] [class noun]”，其中 [identifier] 是與主體相關聯的唯一標識符，[class noun] 是主體的粗類別描述符（例如貓、狗、手錶等）。類別描述符可以由用戶提供或使用分類器獲取。 > In order to bypass the overhead of writing detailed image descriptions for a given image set we opt for a simpler approach and label all input images of the subject “a [identifier] [class noun]”, where [identifier] is a unique identifier linked to the subject and [class noun] is a coarse class descriptor of the subject (e.g. cat, dog, watch, etc.). > The class descriptor can be provided by the user or obtained using a classifier. - 方法 1. **標識符和類別名詞。[identifier] [class noun]**： - 將主體的所有輸入圖像標記為“a [identifier] [class noun]”。 - [identifier] 是唯一標識符，[class noun] 是主體的粗類別描述符。 2. **類別描述符(identifier)的作用**： - 使用類別描述符是為了將該類別的先驗知識與唯一主體關聯起來 - 使用錯誤的類別描述符或不使用類別描述符會增加訓練時間和語言漂移，同時降低性能 > We use a class descriptor in the sentence in order to tether the prior of the class to our unique subject and find that using a wrong class descriptor, or no class descriptor increases training time and - 核心思想希望利用模型對特定類別的先驗知識，將其與主體的唯一標識符的嵌入結合起來，這樣可以利用視覺先驗在不同的上下文中生成主體的新姿勢和動作 > In essence, we seek to leverage the model’s prior of the specific class and entangle it with the embedding of our subject’s unique identifier so we can leverage the visual prior to generate new poses and articulations of the ubject in different contexts. - [identifier] 這裡identifier的概念類似於占位符（placeholder），但與此同時，它需要避免過強的先驗語意，以防止模型對其產生誤解 - [identifier] 作為占位符 - 占位符角色：在提示中，[identifier] 充當占位符，用於代表特定的主體或物體。這樣，當模型看到這個標識符時，它知道應該生成與該標識符相關的具體內容。 - 假設 [identifier] 是“[V]”，可以使用“a [V] dog”來表示一隻特定的狗，這樣模型就能識別並生成與這隻特定狗相關的圖像 - 避免過強的先驗語意： - 如果 [identifier] 選擇了常見的英文單詞（例如“unique”或“special”），這些單詞已經有其特定的含義，模型需要將它們的原始含義與新含義分離，這會增加模型學習的負擔，並可能導致誤解 - 解決方案：選擇稀有的、無特定先驗含義的標識符，以減少模型對其產生誤解的風險。:pencil2:在論文中，使用 $V$ 這個TOKEN作為 [identifier]，在文字語意空間中代稱論文中希望在模型中被識別的主體，並將其與影像特徵關聯起來 ### 類別特定先驗保持損失。Class-specific Prior Preservation Loss - 目標類別特定先驗保持損失（Class-Specific Prior Preservation Loss）旨在解決以下兩個問題： 1. 語言漂移：模型在微調特定任務時，逐漸失去其語言的句法和語義知識。 2. 輸出多樣性降低：模型在小圖像集上微調時，生成的圖像可能會失去多樣性，僅限於少量的視角和姿勢。 - 方法論文中提出了一種自生(autogenou)的類別特定先驗保持損失，該方法鼓勵多樣性並對抗語言漂移。本質上，研究的方法是通過監督模型自身生成的樣本，以保留先驗知識，從而在微調過程中保持類別的多樣性和知識 - 具體步驟在擴散模型中，為了保持類別特定的先驗知識並鼓勵輸出多樣性，我們使用類別特定先驗保持損失。這涉及生成一些樣本數據來監督模型，確保它在微調過程中能夠保持對類別的理解並生成多樣化的圖像 1. **生成先驗數據$x_{\text{pr}}$**： $x_{\text{pr}} = \hat{x}(z_{t1}, c_{\text{pr}})$，其中 $z_{t1} \sim N(0, I)$，$c_{\text{pr}}$ 是條件 - 使用凍結的預訓練擴散模型和隨機初始雜訊生成數據 - 步驟： 1. **隨機初始雜訊 $z_{t1}$**： - 從標準常態分布中取樣，$z_{t1} \sim N(0, I)$ - 這提供了生成圖像的初始隨機雜訊 2. **條件向量 $c_{\text{pr}}$**： - 使用文本提示 "a [class noun]" - $c_{\text{pr}} := \Gamma(f("a [class noun]"))$，其中 $f$ 是將文本轉換為嵌入的函數，$\Gamma$ 是一個用於生成條件向量的操作 - 這樣生成的條件向量表示特定類別的語義信息，指示模型生成屬於該類別的圖像 2. **損失函數設計**： $$ \mathbb{E}_{\mathbf{x}, \mathbf{c}, \epsilon, \epsilon', t} \left[w_t \left\|\hat{\mathbf{x}}_\theta(\alpha_t \mathbf{x} + \sigma_t \epsilon, \mathbf{c}) - \mathbf{x}\right\|_2^2 + \lambda w_t \left\|\hat{\mathbf{x}}_\theta(\alpha_t \mathbf{x}_{\text{pr}} + \sigma_t \epsilon', \mathbf{c}_{\text{pr}}) - \mathbf{x}_{\text{pr}}\right\|_2^2\right], $$ - 主要符號說明 - $\mathbb{E}$：期望值，用於表示損失函數在所有訓練樣本上的平均值。 - $\mathbf{x}$：原始圖像。 - $\mathbf{c}$：條件向量（例如，文本提示）。 - $\epsilon, \epsilon'$：從標準常態分布中取樣的隨機雜訊。 - $t$：擴散步驟。 - $\alpha_t, \sigma_t$：時間步驟 $t$ 對應的權重，用於控制雜訊和原始圖像的比例。 - $\hat{\mathbf{x}}_\theta$：模型生成的圖像。 - $\lambda$：控制先驗保持項相對權重的超參數。 - $w_t$：時間步驟 $t$ 的權重，用於加權損失的不同部分。 - 公式中的兩部分 1. **第一部分：標準重建損失**： $$ w_t \left\|\hat{\mathbf{x}}_\theta(\alpha_t \mathbf{x} + \sigma_t \epsilon, \mathbf{c}) - \mathbf{x}\right\|_2^2 $$ - **解釋**：這部分損失計算模型生成的圖像 $\hat{\mathbf{x}}_\theta$ 與原始圖像 $\mathbf{x}$ 之間的二次差異。 - **目的**：確保模型能夠準確重建原始圖像。 2. **第二部分：先驗保持損失**： $$ \lambda w_t \left\|\hat{\mathbf{x}}_\theta(\alpha_t \mathbf{x}_{\text{pr}} + \sigma_t \epsilon', \mathbf{c}_{\text{pr}}) - \mathbf{x}_{\text{pr}}\right\|_2^2 $$ - **解釋**：這部分損失計算模型生成的圖像 $\hat{\mathbf{x}}_\theta$ 與由預訓練模型生成的先驗圖像 $\mathbf{x}_{\text{pr}}$ 之間的二次差異。 - **目的**：確保模型在微調過程中保持對類別的先驗知識，並生成多樣化的圖像。 - 綜合作用 - **重建準確性**：第一部分損失確保模型生成的圖像與原始圖像之間的差異最小，這有助於保持圖像的細節和質量 - **先驗保持**：第二部分損失確保模型能夠保持預訓練模型的先驗知識，並生成多樣化的圖像，這有助於避免語言漂移和過擬合 ![image](https://hackmd.io/_uploads/BJN1S-O_A.png) > 1. **生成先驗數據 (左側)**： > - **隨機初始雜訊**：從標準常態分布中取樣隨機初始雜訊。 > - **文本提示**：使用文本提示 "A dog" > - **生成圖像**：使用凍結的預訓練擴散模型和祖先采樣器生成圖像。這些生成的圖像是該類別（例如狗）的多樣化示例 > 2. **微調模型 (右側)**： > - **隨機初始雜訊**：從標準常態分布中取樣隨機初始雜訊。 > - **文本提示**：使用相同的文本提示 "A dog" > - **生成圖像**：使用微調過的模型生成圖像，這些圖像應該保留先前生成圖像的多樣性和特徵 > 3. **類別特定先驗保持損失**： > - **計算損失**：比較微調模型生成的圖像與先前生成的先驗數據，計算類別特定先驗保持損失 > - **反饋更新**：使用這一損失反饋更新模型參數，鼓勵模型在生成圖像時保持多樣性並避免失去對類別的理解