# Baseline 2: Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding:
> Source: https://arxiv.org/abs/1606.01847
- Argument/hypothesis: The multimodal pooling methods used in previous works (concatenation, elementwise product, sum, etc.) aren't as expressive a souter product of visual and textual vectors.
- Problem : outer product is computationally infeasible due to high dimensionality.
- This paper is trying to solve this problem using Multimodal Bilinear Pooling (MCB).
- Inspiration:
- Bilinear pooling have shown to be benificial for fine-grained classification for vision only tasks.([Lin, et. al.](https://arxiv.org/abs/1504.07889v4)).
- [This](https://openaccess.thecvf.com/content_cvpr_2016/papers/Gao_Compact_Bilinear_Pooling_CVPR_2016_paper.pdf) shows how to efficiently compress bilinear pooling for a single modality.It views the bilinear transformation as a polynomial kernel. [Pham and Pagh](https://www.itu.dk/people/pagh/papers/tensorsketch.pdf) describe a method to approximate the polynomial kernel using Count Sketches and convolutions.
### Multimodal Compact Bilinear Pooling :

Bilinear model:
$z = W[x \otimes q]$
here,
$[.]$ denotes linearizing the matrix in a vector and
$\otimes$ denotes outer product.
- However, high dimensionality makes it computationally infeasable.
- Hence, to reduce dimensionality they use a Count sketch projection function $\psi$ which projects $v \in \mathbb{R}^n$ to $y \in \mathbb{R}^d$
- they initialize 2 vectors: $s \in \{-1,1\}^n$ and $h \in \{1, ....d\}^n$.
- $s$ can be either 1 or -1 for each index.
- $h$ maps each index $i$ in the input $v$ to an index $j$ to output $y$.
- both $s$ and $h$ are randomly initialized from a uniform distrinution and remain constant in future.
```
y = zero vector of d dimensions
for every element in v:
j = h[i]
y[j] += s[i].v[i]
```
- to avoid explicitly calculating the outer product they use (property of count sketches shown in [Pham and Pagh](https://www.itu.dk/people/pagh/papers/tensorsketch.pdf)):
$\psi(x \otimes q, h, s) = \psi (x, h, s) \star \psi (q, h, s)$
here,
$\star$ is the convolution operator.
- convolution in time domain is equivalent to elementwise product in the frequency domain.
$x \star q = FFT^{-1}(FFT(x') \odot FFT(q'))$
where,
$\odot$ is the element-wise product
- this even extends and remains efficient for more than 2 modal inputs as the combination happens as an elementwise product.
### Architecture for VQA:

1. Image Embeddings :
- Images are resized to 448 x 448.
- Features are extracted using 152 layer ResNet pretrained on ImageNet.
- Output after "pool5" layer taken then $L_2$ normalization on 2048-D vector is performed.
2. Question Embeddings:
- word embeddings
- tanh on the embeddings
- then these are passed through a 2 layer LSTM with 1024 units and the outputs are concatenated to form a 2048-D vector.
3. Pooling : 
- the 2 vectors are passed through MCB
- element-wise square root and L_2 normalization.
- FC layer with input: 16, 000 and output 3,000 (Answer space).
4. Attention : 
- to encorporate spatial information ,they use soft attention.
- after pooling the last conv layer of ResNet( or VGG) and language representations, they used two conv layers to predict attention weight for each grid location.
- then they applied softmax to produce normalized soft attention map.
- then take a weighted sum of spatial vectors using attention map to create attended visual representation.
- they tried multiple attention maps (like multi-headed attention, this was in 2016 ) which were concatenated before being merged with language representation
5. Answer Encoding:
- for MCQ, encode the answers and use yet another MCB with attention
- 
### Architecture for Visual Grounding:

- Based their architecture on fully-supervised version of GroundeR([Rohrbach et. al.](https://arxiv.org/abs/1511.03745))
- They replaced the concatenation of visual and textual representation in GroundeR with MCB.
- Except this they also included a linear embedding of visual representations and $L_2$ normalization of both input modalities, instead of batchnorm.
### Evaluation of VQA:
1. Dataset :
- evaluated on 2 datasets:
1. [VQA](https://arxiv.org/abs/1505.00468) dataset
2. [Visual genome dataset](https://arxiv.org/pdf/1602.07332v1.pdf): The avg question and answer lengths for visual genome is larger than VQA dataset.
- They removed unnecessary words such as "a", "the", and "it is " from answers to decrease the length of the answers and extracted QA pairs whose answers are single-worded.
- These were again filtered based on the ans vocabulary space created from VQA dataset. Hence, they had 1M extra image-QA triplet.
3. Visual7W :
- A part of Visual Genome.
- Adds a 7th W question category
- However they only evaluate on Telling tasks wich involve 6W questions.
- 4-choice MCQ's
- They used answer encodings as described above in answer encoding section.
- ablation studies reported on test-dev data split.
2. Experimental Setup :
| Column 1 | Column 2 |
| -------- | -------- |
| Optimizer | Adam solver, $\epsilon$ = 0.0007, $\beta_1$ = 0.9 $\beta_2$ = 0.999|
Regularization: | Dropout( on LSTM and fc layers) and $L_2$ normalization|
|other methods| *Early stopping* if val doesn't improve in 50,000 iterations and report the best iteration on test-dev |
**NOTE**: They used the same hyperparameters and training settings on Visual7W tasks as that of VQA.
### Ablation Results:

- They have even cleared the possiblity of getting good results due to increased parameters.
- They compensated for teh parameters by stacking fcs with 4096 units and ReLU and Dropout . (shown by section 2 of Table 1)
- However even with similiar parameter budget non-biinear methods could not achieve same accuracy as MCB method.
- Section 2 of the table also shows that compact bilinear pooling has no impact on accuracy compared to full bilinear pooling.
- They primarily use ResNet-152 but claim that MCB improves performance if VGG-19 is used.(section 3- MCB improves the performance irrespective of the image CNN used)
- section 4 of the table shows that their soft attention model works best with MCB.