Baseline 2: Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding:

# Baseline 2: Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding: > Source: https://arxiv.org/abs/1606.01847 - Argument/hypothesis: The multimodal pooling methods used in previous works (concatenation, elementwise product, sum, etc.) aren't as expressive a souter product of visual and textual vectors. - Problem : outer product is computationally infeasible due to high dimensionality. - This paper is trying to solve this problem using Multimodal Bilinear Pooling (MCB). - Inspiration: - Bilinear pooling have shown to be benificial for fine-grained classification for vision only tasks.([Lin, et. al.](https://arxiv.org/abs/1504.07889v4)). - [This](https://openaccess.thecvf.com/content_cvpr_2016/papers/Gao_Compact_Bilinear_Pooling_CVPR_2016_paper.pdf) shows how to efficiently compress bilinear pooling for a single modality.It views the bilinear transformation as a polynomial kernel. [Pham and Pagh](https://www.itu.dk/people/pagh/papers/tensorsketch.pdf) describe a method to approximate the polynomial kernel using Count Sketches and convolutions. ### Multimodal Compact Bilinear Pooling : ![](https://i.imgur.com/OHiBARW.png) Bilinear model: $z = W[x \otimes q]$ here, $[.]$ denotes linearizing the matrix in a vector and $\otimes$ denotes outer product. - However, high dimensionality makes it computationally infeasable. - Hence, to reduce dimensionality they use a Count sketch projection function $\psi$ which projects $v \in \mathbb{R}^n$ to $y \in \mathbb{R}^d$ - they initialize 2 vectors: $s \in \{-1,1\}^n$ and $h \in \{1, ....d\}^n$. - $s$ can be either 1 or -1 for each index. - $h$ maps each index $i$ in the input $v$ to an index $j$ to output $y$. - both $s$ and $h$ are randomly initialized from a uniform distrinution and remain constant in future. ``` y = zero vector of d dimensions for every element in v: j = h[i] y[j] += s[i].v[i] ``` - to avoid explicitly calculating the outer product they use (property of count sketches shown in [Pham and Pagh](https://www.itu.dk/people/pagh/papers/tensorsketch.pdf)): $\psi(x \otimes q, h, s) = \psi (x, h, s) \star \psi (q, h, s)$ here, $\star$ is the convolution operator. - convolution in time domain is equivalent to elementwise product in the frequency domain. $x \star q = FFT^{-1}(FFT(x') \odot FFT(q'))$ where, $\odot$ is the element-wise product - this even extends and remains efficient for more than 2 modal inputs as the combination happens as an elementwise product. ### Architecture for VQA: ![](https://i.imgur.com/vwKez39.png) 1. Image Embeddings : - Images are resized to 448 x 448. - Features are extracted using 152 layer ResNet pretrained on ImageNet. - Output after "pool5" layer taken then $L_2$ normalization on 2048-D vector is performed. 2. Question Embeddings: - word embeddings - tanh on the embeddings - then these are passed through a 2 layer LSTM with 1024 units and the outputs are concatenated to form a 2048-D vector. 3. Pooling : ![](https://i.imgur.com/3mFrYND.png) - the 2 vectors are passed through MCB - element-wise square root and L_2 normalization. - FC layer with input: 16, 000 and output 3,000 (Answer space). 4. Attention : ![](https://i.imgur.com/bHeyJtH.png) - to encorporate spatial information ,they use soft attention. - after pooling the last conv layer of ResNet( or VGG) and language representations, they used two conv layers to predict attention weight for each grid location. - then they applied softmax to produce normalized soft attention map. - then take a weighted sum of spatial vectors using attention map to create attended visual representation. - they tried multiple attention maps (like multi-headed attention, this was in 2016 ) which were concatenated before being merged with language representation 5. Answer Encoding: - for MCQ, encode the answers and use yet another MCB with attention - ![](https://i.imgur.com/UEtQsAp.png) ### Architecture for Visual Grounding: ![](https://i.imgur.com/0hxpVSl.png) - Based their architecture on fully-supervised version of GroundeR([Rohrbach et. al.](https://arxiv.org/abs/1511.03745)) - They replaced the concatenation of visual and textual representation in GroundeR with MCB. - Except this they also included a linear embedding of visual representations and $L_2$ normalization of both input modalities, instead of batchnorm. ### Evaluation of VQA: 1. Dataset : - evaluated on 2 datasets: 1. [VQA](https://arxiv.org/abs/1505.00468) dataset 2. [Visual genome dataset](https://arxiv.org/pdf/1602.07332v1.pdf): The avg question and answer lengths for visual genome is larger than VQA dataset. - They removed unnecessary words such as "a", "the", and "it is " from answers to decrease the length of the answers and extracted QA pairs whose answers are single-worded. - These were again filtered based on the ans vocabulary space created from VQA dataset. Hence, they had 1M extra image-QA triplet. 3. Visual7W : - A part of Visual Genome. - Adds a 7th W question category - However they only evaluate on Telling tasks wich involve 6W questions. - 4-choice MCQ's - They used answer encodings as described above in answer encoding section. - ablation studies reported on test-dev data split. 2. Experimental Setup : | Column 1 | Column 2 | | -------- | -------- | | Optimizer | Adam solver, $\epsilon$ = 0.0007, $\beta_1$ = 0.9 $\beta_2$ = 0.999| Regularization: | Dropout( on LSTM and fc layers) and $L_2$ normalization| |other methods| *Early stopping* if val doesn't improve in 50,000 iterations and report the best iteration on test-dev | **NOTE**: They used the same hyperparameters and training settings on Visual7W tasks as that of VQA. ### Ablation Results: ![](https://i.imgur.com/ZXNvC09.png) - They have even cleared the possiblity of getting good results due to increased parameters. - They compensated for teh parameters by stacking fcs with 4096 units and ReLU and Dropout . (shown by section 2 of Table 1) - However even with similiar parameter budget non-biinear methods could not achieve same accuracy as MCB method. - Section 2 of the table also shows that compact bilinear pooling has no impact on accuracy compared to full bilinear pooling. - They primarily use ResNet-152 but claim that MCB improves performance if VGG-19 is used.(section 3- MCB improves the performance irrespective of the image CNN used) - section 4 of the table shows that their soft attention model works best with MCB.