Tsen
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
      • Create new note
      • Create a note from template
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Write
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Save as template
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 【論文筆記】Guiding Text-to-Image Diffusion Model Towards Grounded Generation 論文連結: https://arxiv.org/abs/2301.05221 ## Overview 這篇論文提出了一個方法,擴展 Stable Diffusion model 來完成 object grounding,也就是在生成 image 的同時,也針對文字 prompt 描述的物體進行分割。他們主要的貢獻包含: 1. 建立了一個生成 dataset 的流程,用以訓練他們提出的模型 2. 提出了一個架構,可以同時生成 image 和把文字裡提到的物體分割出來 3. 經過 evaluate 之後,證實這個架構可以分割在訓練階段沒看過的類別 ![](https://hackmd.io/_uploads/By6pyrzr2.png) 下面會先介紹他們使用的模型架構,接著才介紹他們生成 dataset 的方法,最後再說明一些實驗的結果。 ## Architecture 首先先簡單講一下整個任務目標的大方向。作者取用了 Stable Diffusion model,目標是構建一個模型,可以輸入 noise 和文字 prompt,輸出 image 和 mask,如下所示: $$ \{\mathcal{I, m}\} = \Phi_{\text{diffusion}^+}(\epsilon, \mathcal{y}) $$ 其中 $\Phi_{\text{diffusion}^+}$ 代表他們利用 Stable Diffusion 進行擴展後的 model。 下圖是整個 grounding module 的架構,可以看到除了 diffusion model 之外,主要包含三個部分,分別是 visual encoder, text encoder 和 fusion module。 ![](https://hackmd.io/_uploads/SypeBrzB3.png =500x) ### Text Encoder Text encoder 採用的是 Stable Diffusion pre-trained text encoder(為 CLIP 的 encoder)。給定 text prompt 作為 input 後,它會生成對應的 embedding: $$ \mathcal{E}_{\text{obj}_i} = \Phi_\text{t-enc}(g(y_i)) $$ ### Visual Encoder Visual encoder 的輸入是在 denoising timestep $t=5$ 時,Stable Diffusion 裡 UNet 每個 layer 輸出的 intermediate features $\{f^1_i, ..., f^n_i\}$,他們把這些 feature 稱作 visual representation。選用 $t=5$ 的原因是因為經過 ablation study 試驗不同 timestep 的輸出之後,他們發現 $t=5$ 時的輸出可以有最好的效果。 這些 features 輸入 visual encoder 後會輸出一個 fused visual feature: $$ \mathcal{\hat{F}}_i = \Phi_\text{v-enc} (\{f^1_i, ..., f^n_i\}) $$ ![](https://hackmd.io/_uploads/Sygj9SMrn.png =500x) ### Fusion Module Fusion module 選用的是一個 3 層的 transformer decoder,把 text embedding 轉化成 transformer 的 Query,visual feature 轉化成 Key 和 Value 輸入到 transformer 之後,會輸出一個 segmentation embedding,最後會再經過一個 MLP 並計算和 visual feature 的內積得到 segmentation mask: \begin{align} &\mathcal{E}_{\text{seg}_i} = \Phi_\text{transfromer-D}(W^Q \cdot \mathcal{E}_{\text{obj}_i}, W^K \cdot \mathcal{\hat{F}}_i, W^V \cdot \mathcal{\hat{F}}_i) \\ &m_i = \mathcal{\hat{F}}_i \cdot [\Phi_{\text{MLP}}(\mathcal{E}_{\text{seg}_i})]^T \end{align} ### Training 先假設已經有一個 training set,裡面包含 {visual feature, segmentation, text prompt} pairs,則訓練的 loss 定義為 $$ \mathcal{L} = -\frac{1}{N} \sum^N_{i=1} [m_i^{gt} \cdot \log(\sigma(m_i)) + (1 - m_i^{gt}) \cdot \log (\sigma(1 - m_i))] $$ 其中 $\sigma(\cdot)$ 為 Sigmoid function。訓練的過程中,text encoder 是凍住的,只訓練 visual encoder 和 fusion module。 ## Dataset Collection 從前面 training 的過程,可以發現需要的 training set 當中每一筆資料都是一個 triplet,包含 {visual feature, segmentation, text prompt},因此這裡的目標就是要建立一個可以生成很多 triplet 的架構。 首先他們準備一些常見類別的詞彙(例如 PASCAL VOC 這個 dataset 裡就有 20 種常見的類別),先把這些詞彙分成兩個子集 $\mathcal{C}_\text{seen}$ 和 $\mathcal{C}_\text{unseen}$,只用 $\mathcal{C}_\text{seen}$ 裡有出現的類別來建立 training set。隨機抽取 $\mathcal{C}_\text{seen}$ 裡面的一到兩個類別建立一個 text prompt(例如 a photograph of a dog and cat),經過 Stable Diffusion model 得到 visual feature 和生成的 image,再把生成的 image 輸入至 pre-trained Mask R-CNN 得到 segmentation mask,如此就獲得了一筆訓練所需的 triplet {visual feature, segmentation, text prompt}。重複以上步驟多次,就可以獲得很多筆這樣的 triplet。 ![](https://hackmd.io/_uploads/B1_2lL7r2.png) ## Experiments ### Evaluation on Grounded Generation 第一個實驗測試他們提出的架構在 grounding 上的表現。他們使用 PASCAL VOC 和 MS-COCO 兩個 dataset 裡的類別,用前面提到的方法建立了兩種 training sets,分別訓練之後再做 testing,結果如下表: ![](https://hackmd.io/_uploads/SkMn4PMrh.png) 上表使用的指標是 mIoU (%),one 或 two 代表的是產生 text prompt 時用的類別數量,Split 1 到 3 則是代表三種不同 $\mathcal{C}_\text{seen}$ 和 $\mathcal{C}_\text{unseen}$ 的分法。 從表中可以看到他們的方法比 DAAM 這個使用 unsupervised leaning 的方法還要來得好。 ![](https://hackmd.io/_uploads/BJRKwDzrh.png) 上圖是一些輸出的結果,其中作者特別指出說 sofa, car, hot dog 和 bear 是在訓練過程中沒有看過的類別,但也可以被正確分割出來。 ### Open-vocabulary Segmentation 為了進一步驗證他們的 grounding 方法是有效的,這裡又做了另一個實驗,利用他們提出的 grounding module 生成人造的 image-segmentation dataset,接著嘗試用這個 dataset 來訓練一個 semantic segmentation model,並測試表現如何。 首先他們用 PASCAL VOC 裡的 20 種類別,透過他們的 guided Stable Diffusion 生成出一萬組人造的 image-segmentation pairs,接著用這些資料訓練 MaskFormer,得到 semantic segmentation model。 ![](https://hackmd.io/_uploads/B1cRpPMr2.png =500x) 上半部分的 zero-shot segmentation methods 是訓練在 PASCAL-VOC training set 上。從表中可以看到,訓練在人造 dataset 上的 MaskFormer 比大部分的 zero-shot segmentation 方法表現來得好,和目前的 state-of-the-art ZegFormer 則是表現還有一些差異,但 ZegFormer 訓練在 real image 上,而他們訓練 MaskFormer 是用生成出來的人造 images。

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully