BUTD (Bottom-Up and Top-Down Attention)

###### tags: `Paper Notes` # BUTD (Bottom-Up and Top-Down Attention) ### Introduction * 原文網址：[Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/abs/1707.07998) * 發表時間：2018 年 ### Model Architecture ![](https://i.imgur.com/KNT7xDr.png) 圖二：object detection示意圖。 ![](https://i.imgur.com/hEwbmFm.png") 圖四：Bottom-Up and Top-Down Attention Model for VQA 示意圖。 * 先用 Faster R-CNN 中的 RPN (bottom-up attention model) 對圖片做 object detection 後，得到 $k$ 個 selected region。 * 再對這 $k$ 個 selected region 做 mean-pooling，得到 image features，$\{ v_1, ..., v_k \}$。每個 image feature 皆為 2048 維。 * 作者使用 gated tanh 代替傳統的 ReLU、tanh： * gated tanh 用函數表示： $$ f_a: x \in R^m \to y \in R^n $$ * gated tanh 運作機制： $$ \tilde{y} = tanh(Wx + b) \\ g = \sigma(W'x + b') \\ y = \tilde{y} \circ g $$ * $W, W' \in R^{n*m}$：learned weight * $b, b' \in R^{n}$：learned biases * $\circ$：Hadamard (element-wise) product * $g$：gate * 文字處理： * 將問題中的每個字經過 word embedding 後，轉成 300 維的向量。 * 總共只取 14 個字，形成 14 * 300 的矩陣。 * 將矩陣丟入 GRU，得到 512 維的向量 $q$。 * attention weight： * 對於所有的 image features (共 $k$ 個)，$v_i$，先分別與 $q$ 做 concatenate，再經過 gated tanh 與 fully connected layer，得到每個 $v_i$ 的 attention weight，$a_i$。公式如下： $$ a_i = w_{a}^{T} f_a ([v_i, q]) $$ * 將每個 $a_i$ 分別經過 softmax 後，計算 $v_i$ 的 weighted sum，得到 weighted sum over image locations，$\hat{v}$ 。公式如下： $$ \alpha_i = softmax(a_i) \\ \hat{v} = \sum_{i=1}^{k} \alpha_i v_i $$ * distribution： * 將 $q$ 與 $\hat{v}$ 分別經過 gated tanh 後，各自轉成 512 維的向量。 * 再將 $q$ 與 $\hat{v}$ 做 dot product，得到 joint representation，$h$。 * 最後再經過 gated tanh、fully connected layer、sigmoid，得到各個 candidate answers 的 predicted scores。 $$ h = f_q(q) \circ f_v(\hat{v}) \\ p(y) = \sigma(W_o f_o(h)) $$ ### Training * 使用 Visual Genome 資料集對 bottom-up attention model 做 pretraining，以及作為 VQA 2.0 資料集的 data augumentation。 * Visual Genome 裡包含 108K 張圖片，每張圖片上都有標記出數個 object；以及 1.7M 個 visual question answers。 * bottom-up attention model pretraining：拿其中 5K 張圖片做 validation、其中 5K 張做 testing，剩下 98K 張作 training。 * 由於 Visual Genome 與 MS COCO 2014 資料集有部分重疊，作者有避免使用 MS COCO 2014 validation / testing set 中的圖片做 pretraining。(MS COCO 用於做 image captioning，與 VQA 任務無關) * 將 Visual Genome 中的 2000 個 object classes 與 500 個 attribute classes 刪到剩 1600、400 個。因為被刪掉的類別會影響效能。 * 取出 Visual Genome 中「question and answer pairs 包含在 VQA 2.0 的 candidate answers 的資料」作為 data augumentation。總共取了 485K 對資料，約 30%。 * 使用 VQA v2.0 資料集 (加上從 Visual Genome 取的資料) 訓練 VQA 模型。 * 只取資料集中出現超過 8 次的 answer 作為模型的 candidate answers。 * 使用 VQA metric 對模型做評估。 * 模型細節： > In the VQA model, we use 300 dimension word embeddings, initialized with pretrained GloVe vectors [31], and we use hidden states of dimension 512. We train the VQA model using AdaDelta [50] and regularize with early stopping. The training of the model takes in the order of 12–18 hours on a single Nvidia K40 GPU. Refer to Teney et al. [38] for further details of the VQA model implementation. ### Results * 2017 VQA Challenge 冠軍 ### Reference [31] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP), 2014. 10 [38] D. Teney, P. Anderson, X. He, and A. van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In CVPR, 2018. 5, 10