---
# System prepended metadata

title: MCAN (Deep Modular Co-Attention Networks)
tags: [Paper Notes]

---

###### tags: `Paper Notes`
# MCAN (Deep Modular Co-Attention Networks)

### Introduction
* 原文網址：[Deep Modular Co-Attention Networks for Visual Question Answering](https://arxiv.org/abs/1906.10770)
* Github：https://github.com/MILVLG/mcan-vqa
* 發表時間：2019 年

### Self-Attention and Guided-Attention Units
* Modular Co-Attention Network (MCAN) 是由 Modular Co-Attention (MCA) layer 組成的。而 MCA layer 又是由 self-attention (SA) unit 與 guided-attention (GA) unit 組成的。
* attention / transformerTraining：見李宏毅教學。
* 如圖 2a 所示，SA unit 只有一種輸入 $X = [x_1;...;x_m] \in R^{m \times d_x}$。$X$ 經過 multi-head attention、layer normalization、residual attention 後，會再經過 feed forward layer、layer normalization、residual attention。
  * feed forward layer 結構為：FC(4d)→ReLU→Dropout(0.1)→FC(d)。
  * d 表示 attention 的 key 與 value 的維度。$d_{key} = d_{value} = d$
* 如圖 2b 所示，GA unit 的結構與 SA unit 大致相同，只是輸入變成兩種 $X \in R^{m \times d_x}$、$Y = [y_1;...;y_n] \in R^{n \times d_y}$ 而已。$Y$ 用作 key 與 value，而 $X$ 用作 query。
  <img src="https://i.imgur.com/BRuPZUD.png"  />
  <center>圖 2：MCA layer 的兩個元素 — SA 與 GA。</center>

### Modular Composition for VQA (MCA layer)
![](https://i.imgur.com/Q686vqB.png)
<center>圖 3：三種 MCA layer 變形。(X) 表示 image，(Y) 表示 question。</center>
* 如圖 3 所示，作者利用 SA unit 與 GA unit 建立了三種 MCA layer 的變形。
* ID(Y)-GA(X, Y)：如圖 3a 所示。question 未經任何處理 (identity mapping) 就直接與 image 做 guided attention。attention 完的結果稱作 attended image features。
* SA(Y)-GA(X, Y)：如圖 3b 所示。question 先經過 self-attention 後，再與 image 做 guided attention。
* SA(Y)-SGA(X, Y)：如圖 3c 所示。question 與 image 都先經過 self-attention 後，再做 guided attention。
* 作者使用 ID(Y)-GA(X, Y) 作為 baseline。

### Question and Image Representations
* image 先經過 Faster R-CNN (以 ResNet-101 作為骨架，並用 Visual Genome 資料集做預訓練) 後，會偵測出 $m \in [10, 100]$ 個物件。這些物件再分別經過 mean-pooling 後，轉成 $d_x$ 維的向量，$x_i \in R^{d_x}$。最終得到 image feature matrix，$X \in R^{m \times d_x}$。如圖 4.1 所示。
* question 最多只取 14 個字進入模型。每個字先經過 300-D GloVe 後，形成 $n \times 300$ 維的矩陣 ($n$ 表示總字數，$n \in [1, 14]$)，之後再經過一層擁有 $d_y$ 個 hidden units 的 LSTM。最終得到 question feature matrix，$Y \in R^{m \times d_y}$。如圖 4.1 所示。
* image 與 question 不足最大值的部份就用 zero-padding 填滿。在訓練的時候，在每個 softmax layer 之前也將其屏敝成 -∞，如此 softmax 的結果就都是 0。

### Deep Co-Attention Learning
<img src="https://i.imgur.com/peBOO4f.png" style="zoom: 67%;" />
<center>圖 4：deep Modular Co-Attention Network (MCAN) 總攬圖。</center>

![](https://i.imgur.com/ft6JFKh.png)
<center>圖 5：兩種 Deep Co-Attention Model — Stacking 與 Encoder-Decoder。</center>

* 得到 image features $X$ 與 question features $Y$ 後，就將它們送進 $L$ 層 MCA layers 做 deep co-attention learning。$MCA^{(l)}$ 的輸入為 $[X^{(l-1)}, Y^{(l-1)}]$，輸出為 $[X^{(l)}, Y^{(l)}]$。
  $$
  [X^{(l)}, Y^{(l)}] = MCA^{(l)}([X^{(l-1)}, Y^{(l-1)}])
  $$
  定義 $X^{(0)} = X,\ Y^{(0)} = Y$。
* 如圖 5 所示，以 SA(Y)-SGA(X,Y) 為例，MCA layer 總共有兩種堆疊方法：stacking 與 encoder-decoder。兩者的主要差異在於 guided attention (GA) 的部份。stacking 是每一層的 $X$ 與該層的 $Y$ 做 guided attention，而 encoder-decoder 則是每一層的 $X$ 與最後一層的 $Y$ 做。
* encoder-decoder 可以理解為對 $Y$ 做 encoding 後，得到 $Y^{(L)}$。然後再用 $Y^{(L)}$ 對 $X$ 做 decoding，得到 attended image features $X^{(L)}$。
* 整個模型共堆了 $L$ 層 MCA layer。

### Multimodal Fusion and Output Classifier
* 在 deep co-attention learning 結束之後，會得到 image features $X^{(L)} = [x_{1}^{(L)};...;x_{m}^{(L)}] \in R^{m \times d}$ 及 question features $Y^{(L)} = [y_{1}^{(L)};...;y_{n}^{(L)}] \in R^{n \times d}$。
* 將 $X^{(L)}$ 與 $Y^{(L)}$ 分別丟進 attentional reduction model (att. reduce)，其結構如下：
  * W_{y}^{T}將 $X^{(L)}$ 通過擁有兩層的 MLP，MLP 架構為：FC(d)-ReLU-Dropout(0.1)-FC(1)。
  * 再將 MLP 的結果通過一層 softmax，得到 $X^{(L)}$ 中每個元素 $x_{i}^{(L)}$ 的 attention weight $\alpha_i$，最後做 weighted sum，得到 attended feature $\tilde{x}$。公式如下 ($\alpha = [\alpha_1, \alpha_2, ..., \alpha_m] \in R^{m}$)：
    $$
    \alpha = softmax(MLP(X^{(L)})) \\
    \tilde{x} = \sum_{i=1}^{m} \alpha_i x_{i}^{(L)}
    $$
  * $Y^{(L)}$ 比照辦理。
* 計算 $\tilde{x}$ 與 $\tilde{y}$ 的 linear multimodel fusion $z \in R^{d_z}$，公式如下：
  $$
  z = LayerNorm(W_{x}^{T}\tilde{x} + W_{y}^{T}\tilde{y})
  $$
  * $W_{y}^{T}, W_{y}^{T} \in R^{d \times d_z}$
* 將 $z$ 經過 FC layer 後，轉成 $N$ 維的向量 ($N$ = candidate answers 的數量)。最後再經過 sigmoid 即為每個 candidate answer 的 probability。
* 使用 binary cross-entropy (BCE) 作為 loss function。

### Experiments
![](https://i.imgur.com/twbFR3b.png)
* 使用 VQA-v2 資料集做評估。
* 實作細節：
  * $d_x = 2048,\ d_y = 512,\ d_z = 1024$
  * $d = 512,\ h = 8,\ d_h = \frac{d}{h} = 64$
  * $N = 3129$
  * $L = 6$
  * To train the MCAN model, we use the Adam solver with $β_1 = 0.9$ and $β_2 = 0.98$.
  * The base learning rate is set to $min(2.5te^{−5} , 1e^{−4} )$, where t is the current epoch number starting from 1. After 10 epochs, the learning rate is decayed by 1/5 every 2 epochs.
  * All the models are trained up to 13 epochs with the same batch size 64.
* 實驗證明：
  * SA(Y)-SGA(X,Y) > SA(Y)-GA(X,Y) > ID(Y)-GA(X,Y)
  * encoder-decoder > stacking

### References
[28] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.