Connected Works: (Method 2) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

# Connected Works: (Method 2) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding >[source](https://arxiv.org/abs/1606.01847) ______ ## Paper 1 : MUTAN: Multimodal Tucker Fusion for Visual Question Answering >[source](https://www.semanticscholar.org/paper/MUTAN%3A-Multimodal-Tucker-Fusion-for-Visual-Question-Ben-younes-Cad%C3%A8ne/fe466e84fa2e838adc3c37ee327cd68004ae08fe) Notes : https://hackmd.io/YNdi2oP7SnO_CcCPolixnQ?view ### Comparison with MCB : - MCB used Count Sketch projection to reduce the dimensionality, while MUTAN uses Tucker Decomposition which reduces complexity and enables better interaction between different modalities. - In Count Sketch projection, the combinations of dimensions from q and from v that are supposed to interact with each other are randomly sampled beforehand (through h). - To compensate with this, they must set a high $t_o$ (typically 16,000). - Efficient interactions in MUTAN don't require us to increase any dimensionality. _____ ## Paper 2 : High-Order Attention Models for Visual Question Answering > [source](https://www.semanticscholar.org/paper/High-Order-Attention-Models-for-Visual-Question-Schwartz-Schwing/799537fa855caf53a6a3a7cf20301a81e90da127) Notes : https://hackmd.io/LyNGxdyvQtW8s_jzonLnWQ?view ### Comparison with MCB : - Used seperate unary, pairwise and ternary realtions to get attention on all three modalities : Image, question and answer. - In the decision making layer, they've used 2 layer MCB with the idea of better interactions between those 3 modalities. - They also tried an MCT unit, in place of 2 layer MCB which gave better results in some places _____