MCAN (Deep Modular Co-Attention Networks)

###### tags: `Paper Notes` # MCAN (Deep Modular Co-Attention Networks) ### Introduction * 原文網址：[Deep Modular Co-Attention Networks for Visual Question Answering](https://arxiv.org/abs/1906.10770) * Github：https://github.com/MILVLG/mcan-vqa * 發表時間：2019 年 ### Self-Attention and Guided-Attention Units * Modular Co-Attention Network (MCAN) 是由 Modular Co-Attention (MCA) layer 組成的。而 MCA layer 又是由 self-attention (SA) unit 與 guided-attention (GA) unit 組成的。 * attention / transformerTraining：見李宏毅教學。 * 如圖 2a 所示，SA unit 只有一種輸入 $X = [x_1;...;x_m] \in R^{m \times d_x}$。$X$ 經過 multi-head attention、layer normalization、residual attention 後，會再經過 feed forward layer、layer normalization、residual attention。 * feed forward layer 結構為：FC(4d)→ReLU→Dropout(0.1)→FC(d)。 * d 表示 attention 的 key 與 value 的維度。$d_{key} = d_{value} = d$ * 如圖 2b 所示，GA unit 的結構與 SA unit 大致相同，只是輸入變成兩種 $X \in R^{m \times d_x}$、$Y = [y_1;...;y_n] \in R^{n \times d_y}$ 而已。$Y$ 用作 key 與 value，而 $X$ 用作 query。 <img src="https://i.imgur.com/BRuPZUD.png" /> <center>圖 2：MCA layer 的兩個元素 — SA 與 GA。</center> ### Modular Composition for VQA (MCA layer) ![](https://i.imgur.com/Q686vqB.png) <center>圖 3：三種 MCA layer 變形。(X) 表示 image，(Y) 表示 question。</center> * 如圖 3 所示，作者利用 SA unit 與 GA unit 建立了三種 MCA layer 的變形。 * ID(Y)-GA(X, Y)：如圖 3a 所示。question 未經任何處理 (identity mapping) 就直接與 image 做 guided attention。attention 完的結果稱作 attended image features。 * SA(Y)-GA(X, Y)：如圖 3b 所示。question 先經過 self-attention 後，再與 image 做 guided attention。 * SA(Y)-SGA(X, Y)：如圖 3c 所示。question 與 image 都先經過 self-attention 後，再做 guided attention。 * 作者使用 ID(Y)-GA(X, Y) 作為 baseline。 ### Question and Image Representations * image 先經過 Faster R-CNN (以 ResNet-101 作為骨架，並用 Visual Genome 資料集做預訓練) 後，會偵測出 $m \in [10, 100]$ 個物件。這些物件再分別經過 mean-pooling 後，轉成 $d_x$ 維的向量，$x_i \in R^{d_x}$。最終得到 image feature matrix，$X \in R^{m \times d_x}$。如圖 4.1 所示。 * question 最多只取 14 個字進入模型。每個字先經過 300-D GloVe 後，形成 $n \times 300$ 維的矩陣 ($n$ 表示總字數，$n \in [1, 14]$)，之後再經過一層擁有 $d_y$ 個 hidden units 的 LSTM。最終得到 question feature matrix，$Y \in R^{m \times d_y}$。如圖 4.1 所示。 * image 與 question 不足最大值的部份就用 zero-padding 填滿。在訓練的時候，在每個 softmax layer 之前也將其屏敝成 -∞，如此 softmax 的結果就都是 0。 ### Deep Co-Attention Learning <img src="https://i.imgur.com/peBOO4f.png" style="zoom: 67%;" /> <center>圖 4：deep Modular Co-Attention Network (MCAN) 總攬圖。</center> ![](https://i.imgur.com/ft6JFKh.png) <center>圖 5：兩種 Deep Co-Attention Model — Stacking 與 Encoder-Decoder。</center> * 得到 image features $X$ 與 question features $Y$ 後，就將它們送進 $L$ 層 MCA layers 做 deep co-attention learning。$MCA^{(l)}$ 的輸入為 $[X^{(l-1)}, Y^{(l-1)}]$，輸出為 $[X^{(l)}, Y^{(l)}]$。 $$ [X^{(l)}, Y^{(l)}] = MCA^{(l)}([X^{(l-1)}, Y^{(l-1)}]) $$ 定義 $X^{(0)} = X,\ Y^{(0)} = Y$。 * 如圖 5 所示，以 SA(Y)-SGA(X,Y) 為例，MCA layer 總共有兩種堆疊方法：stacking 與 encoder-decoder。兩者的主要差異在於 guided attention (GA) 的部份。stacking 是每一層的 $X$ 與該層的 $Y$ 做 guided attention，而 encoder-decoder 則是每一層的 $X$ 與最後一層的 $Y$ 做。 * encoder-decoder 可以理解為對 $Y$ 做 encoding 後，得到 $Y^{(L)}$。然後再用 $Y^{(L)}$ 對 $X$ 做 decoding，得到 attended image features $X^{(L)}$。 * 整個模型共堆了 $L$ 層 MCA layer。 ### Multimodal Fusion and Output Classifier * 在 deep co-attention learning 結束之後，會得到 image features $X^{(L)} = [x_{1}^{(L)};...;x_{m}^{(L)}] \in R^{m \times d}$ 及 question features $Y^{(L)} = [y_{1}^{(L)};...;y_{n}^{(L)}] \in R^{n \times d}$。 * 將 $X^{(L)}$ 與 $Y^{(L)}$ 分別丟進 attentional reduction model (att. reduce)，其結構如下： * W_{y}^{T}將 $X^{(L)}$ 通過擁有兩層的 MLP，MLP 架構為：FC(d)-ReLU-Dropout(0.1)-FC(1)。 * 再將 MLP 的結果通過一層 softmax，得到 $X^{(L)}$ 中每個元素 $x_{i}^{(L)}$ 的 attention weight $\alpha_i$，最後做 weighted sum，得到 attended feature $\tilde{x}$。公式如下 ($\alpha = [\alpha_1, \alpha_2, ..., \alpha_m] \in R^{m}$)： $$ \alpha = softmax(MLP(X^{(L)})) \\ \tilde{x} = \sum_{i=1}^{m} \alpha_i x_{i}^{(L)} $$ * $Y^{(L)}$ 比照辦理。 * 計算 $\tilde{x}$ 與 $\tilde{y}$ 的 linear multimodel fusion $z \in R^{d_z}$，公式如下： $$ z = LayerNorm(W_{x}^{T}\tilde{x} + W_{y}^{T}\tilde{y}) $$ * $W_{y}^{T}, W_{y}^{T} \in R^{d \times d_z}$ * 將 $z$ 經過 FC layer 後，轉成 $N$ 維的向量 ($N$ = candidate answers 的數量)。最後再經過 sigmoid 即為每個 candidate answer 的 probability。 * 使用 binary cross-entropy (BCE) 作為 loss function。 ### Experiments ![](https://i.imgur.com/twbFR3b.png) * 使用 VQA-v2 資料集做評估。 * 實作細節： * $d_x = 2048,\ d_y = 512,\ d_z = 1024$ * $d = 512,\ h = 8,\ d_h = \frac{d}{h} = 64$ * $N = 3129$ * $L = 6$ * To train the MCAN model, we use the Adam solver with $β_1 = 0.9$ and $β_2 = 0.98$. * The base learning rate is set to $min(2.5te^{−5} , 1e^{−4} )$, where t is the current epoch number starting from 1. After 10 epochs, the learning rate is decayed by 1/5 every 2 epochs. * All the models are trained up to 13 epochs with the same batch size 64. * 實驗證明： * SA(Y)-SGA(X,Y) > SA(Y)-GA(X,Y) > ID(Y)-GA(X,Y) * encoder-decoder > stacking ### References [28] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.