BASELINE 3: Stacked Attention Networks for Image Question Answering

# BASELINE 3: Stacked Attention Networks for Image Question Answering > Source : https://arxiv.org/abs/1511.02274 >**Novelty in this paper was using attention multiple times over and over to make a better query vector which will attend to image features.** ### Abstract : 1. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. 2. They argue that visual question answering (QA) often requires multiple steps of reasoning. Thus, they develop a multiple-layer SAN in which we query an image **multiple times** to infer the answer progressively. ### Introduction : **Generic method till this paper :** Extract a global image feature vector using a convolution neural network (CNN) and encode the corresponding question as a feature vector using a long short-term memory network (LSTM) and then combine them to infer the answer. **Problem with this :** These models often fail to give precise answers when such answers are related to a set of fine-grained regions in an image. **Major Components of SAN (Stacked Attention Network) :** 1. The image model, which uses a CNN to extract high level image representations, e.g. one vector for each region of the image. 2. The question model, which uses a CNN or a LSTM to extract a semantic vector of the question. 3. The stacked attention model, which locates, via multi-step reasoning, the image regions that are relevant to the question for answer prediction. > *The SAN first uses the **question vector** to **query** the **image vectors** in the first visual attention layer, then **combine the question vector and the retrieved image vectors** to form a refined query vector to query the image vectors again in the second attention layer. Finally, they combine the image features from the highest attention layer with the last query vector to predict the answer.* ### Related works : * [1](https://arxiv.org/abs/1410.0210) (2014) used **semantic parsers** and **image segmentation** methods to predict answers based on images and questions. * [2](https://arxiv.org/abs/1505.01121) and [3](https://arxiv.org/abs/1505.05612) methods used **encoder-decoder** framework to generate answers given images and questions. They first used a **LSTM to encode the images and questions** and then used another **LSTM to decode the answers**. They both fed the image feature to every LSTM cell. * [4](https://arxiv.org/abs/1505.02074) proposed several neural network based models, including the encoder-decoder based models that use single direction LSTMs and bi-direction LSTMs, respectively. However, the authors found the **concatenation of image features and bag of words features worked the best.** * [5](https://arxiv.org/abs/1505.00468) first encoded questions with LSTMs and then combined question vectors with image vectors by **element wise multiplication.** * [6](https://arxiv.org/abs/1506.00333) used a **CNN for question modeling** and **used convolution operations to combine question vectors and image feature vectors.** ### Stacked Attention Networks (SANs) : **The overall model :** ![](https://i.imgur.com/08HwDSs.png) **Image model :** ![](https://i.imgur.com/GhEqBog.png) 1. The image model uses a CNN to get the representation of images (Specifically, the VGGNet is used to extract the image feature map fI from a raw image I). ![](https://i.imgur.com/3iq7ms9.png) 2. We choose the features $f_I$ from the last pooling layer, which retains spatial information of the original images. 3. We first rescale the images to be 448×448 pixels, and then take the features from the last pooling layer, which therefore have a dimension of 512×14×14, as shown in figure above. 4. **14 × 14 is the number of regions in the image and 512 is the dimension of the feature vector for each region.** 5. Then for modeling convenience, we use a single layer perceptron to transform each feature vector to a new vector that has the same dimension as the question vector : ![](https://i.imgur.com/hCOKysv.png) where vI is a matrix and its i-th column vi is the visual feature vector for the region indexed by i. **Question Model :** The authors use 2 types of question model : * LSTM based question model (Self explanatory) * CNN based question model **CNN based question model :** ![](https://i.imgur.com/aZSf1t1.png) 1. Similar to the LSTMbased question model, we first embed words to vectors xt = Weqt and get the question vector by concatenating the word vectors: ![](https://i.imgur.com/DHzEySs.png) 2. Then we apply convolution operation on the word embedding vectors. 3. We use three convolution filters, which have the size of one (unigram), two (bigram) and three (trigram) respectively. 4. The t-th convolution output using window size c is given by: ![](https://i.imgur.com/fC6DGEH.png) 5. The filter is applied only to window t : t + c − 1 of size c(1, 2 or 3 as unibram, bigram or trigram). Wc is the convolution weight and bc is the bias. The feature map of the filter with convolution size c is given by: ![](https://i.imgur.com/cis5Inh.png) 6. Then we apply max-pooling over the feature maps of the convolution size c and denote it as : ![](https://i.imgur.com/NDoBytn.png) Here, by max pooling we are taking one embedding from unigram, one from bigram and one from trigram. ... 7. For convolution feature maps of different sizes c = 1, 2, 3, we concatenate them to form the feature representation vector of the whole question sentence: ![](https://i.imgur.com/o4V2JHo.png) Hence, VQ = h is the CNN based question vector. **Stacked Attention Networks :** 1. Given the image feature matrix vI and the question feature vector vQ, SAN predicts the answer via multi-step reasoning. 2. In many cases, an answer is only related to a small region of an image. Therefore, using the one global image feature vector to predict the answer could lead to suboptimal results due to the noises introduced from regions that are irrelevant to the potential answer. 3. Instead, reasoning via multiple attention layers progressively, the SAN are able to gradually filter out noises and pinpoint the regions that are highly relevant to the answer. 4. Given the image feature matrix vI and the question vector vQ, we first feed them through a single layer neural network and then a softmax function to generate the attention distribution over the regions of the image: ![](https://i.imgur.com/BfW7jF2.png) vI ∈ Rd×m d : image representation dimension m : number of image regions vQ ∈ Rd WI,A, WQ,A ∈ Rk×d WP ∈ R1×k then, pI ∈ Rm which corresponds to the attention probability of each image region given vQ. > **Note :** > **They denote by ⊕ the addition of a matrix and a vector. Since WI,AvI ∈ Rk×m and both WQ,AvQ, bA ∈ Rk are vectors, the addition between a matrix and a vector is performed by adding each column of the matrix by the vector**. 5. Based on the attention distribution, we calculate the weighted sum of the image vectors, each from a region, v˜i as in Eq. 17. We then combine v˜i with the question vector vQ to form a refined query vector u as in Eq. 18. ![](https://i.imgur.com/NceS7Jq.png) 6. `u` is a better query vector now that accounts for the visual features too. But some complicated questions might still not be answerable due to shallowness of this `u` and attention network. 7. To overcome this, `u` is modified over and over again using multiple attention layers. 8. Formally, the SANs take the following formula: for the k-th attention layer, we compute: ![](https://i.imgur.com/edTudXT.png) where u0 is initialized to be vQ. 9. Then the aggregated image feature vector is added to the previous query vector to form a new query vector: ![](https://i.imgur.com/sUel8xm.png) 10. We repeat this K times and then use the final u K to infer the answer: ![](https://i.imgur.com/hhoKhKd.png) ### Experiments : **Model configuration and training :** * For the image model, they use the VGGNet to extract features. When training the SAN, the parameter set of the CNN of the VGGNet is fixed. * They take the output from the last pooling layer as our image feature which has a dimension of 512 × 14 × 14. * For DAQUAR and COCO-QA, we set the word embedding dimension and LSTM’s dimension to be 500 in the question model. * For the CNN based question model, they set the unigram, bigram and trigram convolution filter size to be 128, 256, 256 respectively. (The combination of these filters makes the question vector size to be 640.) * For VQA dataset, since it is larger than other data sets, we double the model size of the LSTM and the CNN to accommodate the large data set and the large number of classes. * **They find that using three or more attention layers does not further improve the performance.** ### Results : It is observed that SANs with 2 stack of attentions and CNN as question encoder works better than other variants of SAN in almost all cases. ### Error analysis : The errors are grouped into four categories: * The SANs focus the attention on the wrong regions (22%) * The SANs focus on the right region but predict a wrong answer (42%) * The answer is ambiguous, the SANs give answers that are different from labels, but might be acceptable (31%). ### Conclusion : 1. The Authors have devised a novel SAN for VQA which outperforms all the existing VQA baselines. 2. Extensive experiments demonstrate that SANs actually work really well. 3. Visualization of attention helps to see where the model is going wrong and why is it going wrong.