# Connected Works: (Method 2) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
>[source](https://arxiv.org/abs/1606.01847)
______
## Paper 1 : MUTAN: Multimodal Tucker Fusion for Visual Question Answering
>[source](https://www.semanticscholar.org/paper/MUTAN%3A-Multimodal-Tucker-Fusion-for-Visual-Question-Ben-younes-Cad%C3%A8ne/fe466e84fa2e838adc3c37ee327cd68004ae08fe)
Notes : https://hackmd.io/YNdi2oP7SnO_CcCPolixnQ?view
### Comparison with MCB :
- MCB used Count Sketch projection to reduce the dimensionality, while MUTAN uses Tucker Decomposition which reduces complexity and enables better interaction between different modalities.
- In Count Sketch projection, the combinations of dimensions from q and from v that are supposed to interact with each other are randomly sampled beforehand (through h).
- To compensate with this, they must set a high $t_o$ (typically 16,000).
- Efficient interactions in MUTAN don't require us to increase any dimensionality.
_____
## Paper 2 : High-Order Attention Models for Visual Question Answering
> [source](https://www.semanticscholar.org/paper/High-Order-Attention-Models-for-Visual-Question-Schwartz-Schwing/799537fa855caf53a6a3a7cf20301a81e90da127)
Notes : https://hackmd.io/LyNGxdyvQtW8s_jzonLnWQ?view
### Comparison with MCB :
- Used seperate unary, pairwise and ternary realtions to get attention on all three modalities : Image, question and answer.
- In the decision making layer, they've used 2 layer MCB with the idea of better interactions between those 3 modalities.
- They also tried an MCT unit, in place of 2 layer MCB which gave better results in some places
_____