---
# System prepended metadata

title: OSCAR  (Object-Semantics Aligned Pre-training)
tags: [V+L, Paper Notes]

---

###### tags: `Paper Notes` `V+L`
# OSCAR  (Object-Semantics Aligned Pre-training)
* 原文：Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
* 機構：Microsoft Corporation, University of Washington
* 時間：2020 年

### Introduction
* 受到 BERT 的啟發，最近 V+L models 也開始流行 vision-language pre-training (VLP)。而 OSCAR 正是其中的一員。
* OSCAR 的作法便是將 <word, tag, region> 丟進 multi-layer Transformer 做 pre-training，最後再針對特定任務做 fine-tuning。如 Fig. 1 所示。
  * word：word sequence
  * tag：a set of object tags
  * region：a set of image region features

<center><img src="https://i.imgur.com/sN4w5j9.png" width=600></center>

* VLP：透過 self-supervised learning 學習 image-text 的 cross-model representations，概念上就跟 BERT 差不多。

### Oscar Pre-training
* Input：text 與 image 會先做以下處理：
  * text：做 word embedding 得到 $w$。
  * image：經過 Faster R-CNN 後得到數個 object 的 region features $v$ 以及它們對應的 tags（文字形式）。對 tags 做 word embedding 得到 $q$。
* OSCAR input 可以用 2 種觀點（view）來看：
  * $x$：modality view
  * $x'$：dictionary view

<center><img src="https://i.imgur.com/nbVh5BQ.png" width=300></center>

* 針對不同的觀點，我們可以設計不同的 loss 做 self-supervised learning：
  * Modality View：$Contrasitive\ Loss$
    * 將 $h' = [q, v]$ 視為 image modality，$w$ 視為 language modality。
    * $q$ 有 50% 的機率被換成其它 tag sequence。
    * 使用資料集 $D$ 中的隨機抽樣而得 tag sequence 做替換。
    * [CLS] 的輸出端會接上一層 fully connected (FC) layer 當作 binary classifier $f(\cdot)$。用於預測 $q$ 是否為原來的 tag sequence。
    * contrasitive loss $L_{C}$ 公式如下：
      $$
      L_{C} = E_{(h',w) \sim D} \log p(y|f(h',w))
      $$
      * $y = 1$ 表示 $q$ 是原來的。
      * $y = 0$ 表示 $q$ 有被替換。
  * Dictionary View：$Masked\ Token\ Loss$
    * 定義 discrete token sequence $h = [q, w]$。
    * $h$ 中的 token 有 15% 的機率被蓋掉，蓋掉的布部分用 [MASK] 表示。
    * 模型要做的就是預測被蓋掉的這些 token 是什麼。訓練方法就跟 BERT 一樣。
    * masked token loss $L_{MTL}$ 的公式如下：
      $$
      L_{MTL} = E_{(v, h) \sim D} \log p(h_{i} | h_{\setminus i}, v)
      $$
      * $h_{i}$：被蓋掉的 token。
      * $h_{\setminus i}$：沒被蓋掉的 token。
* 綜上所述，OSCAR 的 pre-training objective 即為（如 Fig. 3 所示）：
  $$
  L_{pre-training} = L_{C} + L_{MTL}
  $$
  * $L_{C}$：二分類是否使用原本的 tag sequence。
  * $L_{MTL}$：預測被蓋掉的 token 是什麼。

<center><img src="https://i.imgur.com/xGG4oyu.png" width=520></center>

* **Pre-training Corpus**：OSCAR 用了很多 V+L datasets 作為 pre-training corpus，包括：COCO [21]、Conceptual Captions (CC) [31]、SBU captions [26]、flicker30k [44]、GQA [3]。總計 410 萬張圖片、650 萬對 text-tag-image。
* 模型超參數如下：
  * OSCAR~L~​ 對應 ​BERT~large~​ 的參數。H = 1024。
  * OSCAR~B~​ 對應 BERT~base~ ​的參數。H = 768​。
  * 為了讓 $v$ 的維度能更 $w, q$ 一樣，在 $v$ 輸入至 OSCAR 之前會先經過 linear 的轉換。
  * Optimizer：AdamW Optimizer
  * OSCAR~B~ 共訓練 100 萬個 steps、learning rate = 5e^-5^、batch size = 768。
  * OSCAR~L~ 共訓練 90 萬個 steps、learning rate = 1e^-5^、batch size = 512。
  * $h$ 與 $v$ 的 sequence lenth 分別是 35 與 50。

### Experiments & Results
（由於我是做 VQA 的，因此這裡只關注在 VQA 上的成果，其它部分詳見原文）
* OSCAR 與其它 VQA SOTA 如 Table 2 (b) 所示。

<center><img src="https://i.imgur.com/Lesm7vO.png" width=580></center>

* 作者將 image region 與 word token 丟入模型後，將模型最後一層的輸出抓出來當作 learned semantic features，然後將 learned semantic features 用 t-SNE 降維至二維，最後查看它們的空間分布。如 Fig 4. 所示，相較於 baseline，對於相似的物件，OSCAR 的輸出結果較為接近。

<center><img src="https://i.imgur.com/oMT3Uea.png" width=580></center>

* 作者分別用 Visual Geome (VG) 與 Open Image (OI) 訓練了兩種 Faster R-CNN。用來觀測不同的 tag sets 影響。實驗結果如 Table 4 所示。OSCAR^VG^ 的效能比 OSCAR^OI^ 好，推測就是因為 VG 中的物件種類較多。

<center><img src="https://i.imgur.com/zxlawVI.png" width=580></center>

### References
[3] Aligning sentences in parallel corpora.
[21] Microsoft COCO: Common objects in context.
[26] Im2text: Describing images using 1 million captioned photographs.
[31]  Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.
[44] From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.