# Method 1 : Bilinear Attention Networks >[paper link](https://arxiv.org/abs/1805.07932) ## Introduction : - The cost to learn attention distribution for every multimodal pair is computationally expensive. - Grounding : Connecting natural language to the references of the physical world. - Bilinear attention considers every pair of multimodal channels i.e. every question word with image regions. - If the given question involves multiple visual concepts represented by multiple words, the inference using visual attention distributions for each word can exploit relevant information better than that using single compressed attention distribution. - They propose bilinear attention networks (BAN) to use a bilinear attention distribution, on top of low-rank bilinear pooling . - They then propose a variant of multimodal residual network (MRU), to utilize multiple bilinear maps of BAN. ![](https://i.imgur.com/p5rFL68.png) ## Low Rank Bilinear Pooling - $W_i$ is replaced by multiplication of two smaller matrices, $U_iV_i^T$, where $U_i ∈ R^{Nxd}$ and $V_i ∈ R^{Mxd}$, to give regularity. - For the scalar output $f_i$ $$f_i = x^TW_iy ≈ x^TU_iV_i^Ty = 1^T(U_i^Tx ◦ V_i^Ty)$$ where $1 ∈ R^d$ is a vector of ones and ◦ denotes Hadamard product (element-wise multiplication). ## Low-rank bilinear pooling - For a vector output f, a pooling matrix P is introduced: $$f = P^T(U^Tx ◦ V^Ty)$$ where $P∈R^{dxc}$ where $U ∈ R^{Nxd}$ and $V ∈ R^{Mxd}$