owned this note
owned this note
Published
Linked with GitHub
# VQA : RAMEN
###### tags: `VQA`, `paper`
> [paper link](https://arxiv.org/abs/1903.00366)
According to the authors of this paper, there are 2 kind of images.
Natural Scenes - containing complex images and
Synthetic Scenes - containing compostitional reasoning
Most papers perform good in only one of the two domains
The authors of this paper argue that a good VQA algorithm must perform well in both domains.
## Contributions :
1. Comparison of 5 SOTA across 8 VQA datasets
2. standardizing the components of the model like visual features and vocabularies.
3. Found that most VQA algorithms are not capable of answering real-world images and performing compositional reasoning. Poor in generalization test. Indicates that these ate still exploiting dataset biases.(Q: This may also indicate that maybe the two quesetions are from different domains. )
4. New VQA algorithm.
## Model (Recurrent Aggregation of Multimodal Embddings Network - RAMEN):
There are 3 phases:
1. Early Fusion of language and visual features:

- EF has been shown to help with compositional reasoning.
- Hence, here they have used early fusion through concatenation of spatially localized visual features and question features.
2. Learning bimodal embeddings:

- Concatenated visual+question features are passed through a shared network this produces spacially localized bimodal embeddings. This network learns the inter-relationships between visual and textutal features.
3. Recurrent aggregation of learned bimodal embeddings:

- Bimodal embeddings ae agregated using a bidirectional GRU network.
- There is no external attention is used in the model.
- The authors have not use pre-defined modules or reasoning cells (However authors claim to have demonstrated that their network is capable of compositional reasoning through their experiments.)
### Model in detail :
- Input :
- Question embedding ($q \in \mathbb{R}^d$)
- A set of N region proposals $r_i \in \mathbb{R}^m$ , where each $r_i$ has both visual apppearance features and spatial position. (Q: how do they include spatial information)
- Concatenate each proposal with question vector
- Batchnormalize
$$c_i = BatchNorm(r_i \oplus q)$$
where $\oplus$ is concatenation
- All N vectors are passed through a function $F(c_i)$ . This function is modeled using a multi-layered perceptron with residual connections.
- Perform late fusion by concatenating each bimodal embedding with original question embedding and aggregate the collection using a function $A$
$$a = A(b_i \oplus q, b2 \oplus q, ...... b_N \oplus q)$$
- $A$ is modeled using a bi-GRU. Output is considered as concatenation of forward and backward GRUs.
- $a$ is referred to as RAMEN embedding.
- This is sent to a cassification layer which predicts the answer.
## Implementation, Model and training details:
| Model Components | Description |
| -------- | -------- |
| question word embed:| 300 dim , pretrained GloVe vectors |
|hidden_size of GRU (in which question words are passed ):| 1024|
|visual features| 2048 dim CNN features produced by bottom-up architecture based on Faster R-CNN.|
|spatial info : |each proposal is divided into 16x16 grid of (x,y) coordinates, these are then flattened to form 512 dim vector. |
|$r_i$|concat of visual and spatial info to form 2560 dim |
|F (projector)| 4-layer MLP with 1024 units, with swish non-linear activation functions and res connections in layer 2, 3 and 4|
|A (Agregator)|single layer bi-GRU, 1024 dm hidden state. Concat of the forward and backward GRU form 2048 dim which is projected through 2048 dim fully connected swish layer followed by output classification layer with one unit per possible answer in dataset.|
|training data: | Adamax|
|learning rate: | gradual learning rate warm up (2.5 * epoch * $10^{-4}$) for first 4 epochs , 5*$10^{-4}$ for epochs 5 to 10, and then decay with rate of 0.25 for every 2 epochs, with early stopping|
|mini batch_size| 64|
## Other VQA Models :
1. Bottom-Up-Attention and Top-Down (UpDn):
- combine bottom up and top-down attention mechanisms.
- bottom up mechanism generates object proposals from
Faster R-CNN
- top down mechanism is task driven and uses qustion to predict attention weigtst over image regions.
2. Question-Conditioned Graph (QCG):
- Represent images as graphs.
- Object-level features obtained from bottom-up region proposals act as graph nodes
- Edges encode interaction between regions that are conditioned in the question .
- For each node, QC graph chooses a neighborhood of nodes with strongest edge conncetions. Thus forms a question specific graph structure.
- This structure is processed by patch operator to perform spatial graph convolution .
3. Bilinear Attention Network (BAN):
- Fuses visual and textual modalities by considering interactions betweeen all question words and all region proposals.
- Handels interactions between all channels.
- It can be considered generalization of low-rank bilinear pooling method that jointly represenet each channel pair.
- supports multiple glimpses of attention via connected residual connections.
- Achieves 70.35% on test-std split of VQAv2
4. Relation Network (RN) :
- Takes every pair of region proposals, embeds them and sums up all $N^2$ pair embeddings producing a vector that encodes the relationships between objects.
- Pairwise feature agregatio nmechanism enables compositional reasoning (demonstrated by its performance on CLEVR dataset.)
- Computational complexity increases quadratically with number of object.
- Recent attempts at reducing number of pairwise comparisons by reducing number of objects fed to RN.
5. Memory, Attention and Composition (MAC) network:
- Uses computational cells which automatically lern to perform attention-based reasoning.
- Learns reasoning mechanisms directly from data.
- Each cell maintains a control cell representing the reasoning operation and a memory state that is the result of the reasoning operation.
- Reports significant improvement in challenging counting and numerical coparison tasks on the CLEVR dataset.
### Standardizing the models:
- Different models use different visual features.
- This makes it difficult to tell if the performance is due to change in model or due to feature representation.
- Thus to compare the models effectively, the authors have used same visual features for all algorithms across all datasets.
- Used the 2048-dimensional 'bottom-up' CNN features produced by region propsal generator of trained Faster R-CNN model with a ResNet-101 backend.
- Number of proposals are fixed to 36 for natural images.
- This Faster R-CNN model is trained for object localization, attribute recognition, boinding box regression on Visual genome.
- SoTA methods for CLEVR are also shifting toward region proposals.
- They have trained a seperate Faster R-CNN for these models for multi-class classification and bounding box regression because Faster R-CNN trained on Visual genome didn't transfer well to CLEVR.
- To do this they estimated the bounding boxes using 3D coordinates/rotations specified in the scene annotations.
- They kept the number of CLEVR regions fixed at 15.
- They also augmented features
## Results :
### Datasets:


- Above table shows the test results of various VQA algorithms on different datasets.
#### Generalization across Question Types:
- They use TDIUC for studying generalization across question types. TDIUC has multiple accuracy metrics with mean-per-type (MPT) adn normalized mean-per-type(NMPT) compensating for biases.
- Lower MPT scores indicate that all algorithms struggle to generatlize to multiple tasks
- 'Object recognition', 'object presence' and 'scene recognition' were among easiest tasks. However, its difficult to conclude if that is due to the mere availibility of large amount of training data for the task.
- RAMEN has the smallest gap between normalized and un-normalized scores indicating resistance to answer biases. This gap is fairly large for other models and highest in case of BAN.
#### Generalization to Novel Concept Composition:
- Use CVQA and CLEVR-CoGenT-B.
- Scores in CVQA are lower than VQAv1 which means all algorithms struggle in combining concepts in new ways.
- MAC has the largest performance drop.
- To check the ability to generalize to synthetic datasets, the models were trained on CLEVR-CoGenT-A's train split and evaluated on validation set without fine-tuning.
#### Performance on VQACPv2's Changing Priors:
- All algorithms showed a large drop in performance on changeing priors.
#### Counting and Numerical Comparisons:
- For CLEVR these tasks were most difficult for the algorithms to answer.
- MAC performs best on these tasks followed by RAMEN.
- Except MAC and QCG all other algorithms have a huge discrepancy between 'less than' and 'greater than' question types.
- This is most pronounced for RN showing a difficulty in linguistic understanding.
- All algorithms struggle in counting questions even the natural images.
- All algorithms achieve under 62% on counting questions in TDIUC.
## Ablation studies:

#### Key Findings:
1. Early fusion is critical to RAMEN's performance.
2. Late fusion has little impact on CLEVR and VQAv2.
- A bi-GRU aggregationis superior to mean pooling.
- It is hypothised that recurrent aggregation aids in capturing interactions between bimodal embeddings which is critical for reasoning task. This also help remove duplicate proposal for by performing non-maximal suppression.