Connected Works: (Method 3) STACKED ATTENTION NETWORKS

# Connected Works: (Method 3) STACKED ATTENTION NETWORKS _____ ## Paper 1 : Visual Question Answering with Question Representation Update (QRU) > [source](https://www.semanticscholar.org/paper/Visual-Question-Answering-with-Question-Update-Li-Jia/269546925f0fd457b31c13c2870343b0aed761dc) Notes : https://hackmd.io/SqMsPhYPR_2bSQfjTQQb3Q?view ### Comparison with SAN : - SAN model puts its attention on coarser regions obtained from the activation of last convolutional layer, which may include cluttered and noisy background. - This model deals with only selected region proposals. ______ ## Paper 2 : Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering >[source](https://www.semanticscholar.org/paper/Don't-Just-Assume%3B-Look-and-Answer%3A-Overcoming-for-Agrawal-Batra/90873a97aa9a43775e5aeea01b03aea54b28bfbd) Notes : https://hackmd.io/iVTm9pH7SoaBzlN1r8wY9Q?view ### Comparison with SAN : - Built over SAN - Classifies questions into yes/no and non yes/no question and works on them seperately. - Locates image patch relavant to the question, and then returns the set of visual concepts. _____ ## Paper 3 : Ask, Attend and Answer > [source](https://arxiv.org/abs/1511.05234) Notes : https://hackmd.io/@3hbXjf7KRbakHNzaj-2jqw/H11B9ogsO ### Comparison with SAN : * In SAN one question vector is used as a query to look for image regions, in this paper, each word in the question is used as a query so that we get information about how each word relates to each feature. * In SAN, the question feature is refined again and again but here, the overall visual features are updated again and again. * SAN uses only one representation for calculating attention and then summing over those region only, in this paper, two different representations are used one for calculating attention weights and the other for the weigthed average. * Stacked attention network uses convolution (also LSTM) for question but this paper uses simple embedding matrix for getting the embedding of each word. * SAN visual features are taken from VGGnet, and this paper uses GoogleNet for visual features. ----- ## Paper 4 : Dynamic Memory Networks for Visual and Textual Question Answering > [source](https://arxiv.org/abs/1603.01417) Notes : https://hackmd.io/@3hbXjf7KRbakHNzaj-2jqw/Hykjhnejd ### Comparison with SAN : * In SAN, attention is calculated using dot product but here, attention is calculated by using linear transforms. * SAN uses soft-attention mechanism where the input visual features are combined using a weigthed average of the softmax of attention weights whereas here, the weigths of the attention are used as update gates in the GRU which are fed with the image features. * The question embeddings in SAN are taken using convolution but here, the question embedding is taken from the last hidden state of a GRU. ----- ## Paper 5 : Visual7W: Grounded Question Answering in Images > [source](https://arxiv.org/abs/1511.03416) Notes : https://hackmd.io/@KshitijAmbilduke/SkRLLHWoO ### Comparison with SAN : * Attention mechanism in SAN and here is totally different, attention is incorporated in the LSTMs used for question embedding. * Question embedding in SAN is convolutional, here it is a LSTM with attention incorporated. ----- ## Paper 6 : Hierarchical Question-Image Co-Attention for Visual Question Answering >[source](https://arxiv.org/pdf/1606.00061.pdf) Notes : https://hackmd.io/@KshitijAmbilduke/BJEU9zUFd ### Comparison with SAN : * In SAN only a question vector is refined over and over again but here, a separate question and a separate image vector is refined over and over again. * In SAN only the visual features are combined according to the attention weights, but here, both visual and textual features are combined to form 2 different context vectors (v' and q'). * Question encoding here is taken in 3 ways and visual and textual context vectors (v' and q') are calculated at each step but in SAN everything is just one step. ----- ## Paper 7 : Dual Attention Network for Multimodal reasoning and Matching Notes : https://hackmd.io/@KshitijAmbilduke/BkqRzEVtd ### Comparison with SAN : * Here attention is calculated between memory vector and visual features and between memory vector and question features whereas in SAN directly attention is calculated directly between visual and question features. * Text features here are taken from LSTM and in SAN convolutions are used. ----- ## Paper 8 : High-Order Attention Models for Visual Question Answering > [Source](https://arxiv.org/abs/1711.04323) Notes : https://hackmd.io/LyNGxdyvQtW8s_jzonLnWQ?view ### Comparison with SAN : - Used seperate unary, pairwise and ternary realtions to get attention on all three modalities : Image, question and answer. - Used 2-Layer LSTM to get question vector, one which took word embeddings, other took 1D conv features. - Used MCT and MCB at to combine all the vectors from 3 modalities and get final answer. ----- ## Paper 9: An Attention Based Convolutional Neural Network for Visual Question Answering: > [Source](https://arxiv.org/pdf/1511.05960.pdf) >Notes : https://hackmd.io/rIwLt62kTkSeD0ZcV278aw?both ### Comparison with SAN: - Attention applied only once - handcrafted convlolutional kernel - takes question embeddings to visual space for attention. ----- ## Paper 10: Focused Dynamic Attention: > [Source](https://arxiv.org/pdf/1604.01485.pdf) > [Notes](https://hackmd.io/UWmL5TyUTxmzPQNe4X5IFg) ### Comparison with SAN: - SAN uses CNN to put attention oer image regions based on the question unigrams, bigrams and trigrams. - Attention mechanism in SAN does not use object bounding boxes. ----- ## Paper 11: Learning to Answer Questions From Image Using Convolutional Neural Network > [Source](https://arxiv.org/pdf/1506.00333.pdf) > [Notes](https://hackmd.io/VReJUQKJRiKypKpGBCDLqw) ### Comparison with SAN: - Like SAN, uses CNN for encoding sentences, however unlike it, it takes attention just once and interaction between the two modalities is learnt by a CNN - Also this is a paper from 2015 (before SAN). ----- ## Paper 12: Ask me Anything : Free-Form Visual Question Answering Based on Knowledge from External Sources > [Source](https://arxiv.org/abs/1511.06973.pdf) > [Notes](https://hackmd.io/@KshitijAmbilduke/B1bThMMi_) ### Comparison with SAN : * This paper doesn't use attention at all uses really straigth-forward method which is somewhat similar to human intuition. -----