# MAttNet: Modular Attention Network for Referring Expression Comprehension Original paper: [MAttNet: Modular Attention Network for Referring Expression Comprehension](http://openaccess.thecvf.com/content_cvpr_2018/html/Yu_MAttNet_Modular_Attention_CVPR_2018_paper.html) Journal/Conference: CVPR2018 Authors: Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L.Berg ## Objective This work propose a modular network for referring expression comprehension - Modular Attention Network (MAttNet) - that takes a natural language expression as input and softly decomposes it into three phrase embeddings (for subject, location, and relationship comprehension). These embeddings are used to trigger three separate visual modules which are finally combined into an overall region score based on the module weights. There are three main novelties in MAttNet: - It is designed for **general referring expressions**. - It learns to parse expressions through a **soft attention based** mechanism. - It apply different **visual attention** techniques in the subject and relationship modules to allow relevant attention on the described image portions. ## Related works - Referring Expression Comprehension - CNN-LSTM: [11, 18, 19, 20, 32] - Joint embedding model: [4, 16, 22, 26] - Modular Networks - VQA: [3] - Visual reasoning: [8, 12] - QA: [2] - Relationship modeling: [10] - Multitask reinforcement learning: [1] - Need external parser: [2, 3, 12] - **End-to-end**: [8, 10] - **Most related work**: [10] ## Dataset - RefCOCO, RefCOCO+: [13] - RefCOCOg: [19] - MSCOCO: [14] For "RefCOCO" and "RefCOCO+" there two different sets: - testA: Containing multiple "people" - testB: Containing multiple "objects" For "RefCOCOg" there are two different sets too: - val*: Data is split by objects. And the same image could appear in both training and validation. - val: Data is split by images. > Most experiments run on RefCOCOg-val ## Methodology ### Model ![](https://i.imgur.com/0ddHqVd.png) #### Language Attention Network ![](https://i.imgur.com/jRUP70Z.png) We first embed each word $u_t$ into a vector $e_t$ using an one-hot word embedding, then a bidirectional LSTM-RNN is applied to encode the whole expression. $$ \begin{aligned} e_{t} &=\operatorname{embedding}\left(u_{t}\right) \\ \overrightarrow{h}_{t} &=\operatorname{LSTM}\left(e_{t}, \overrightarrow{h}_{t-1}\right) \\ \overleftarrow{h}_{t} &=\operatorname{LSTM}\left(e_{t}, \overleftarrow{h}_{t+1}\right) \\ h_{t} &=\left[\overrightarrow{h}_{t}, \overleftarrow{h}_{t}\right] \end{aligned} $$ Given $H=\left\{h_{t}\right\}_{t=1}^{T}$ , we apply three trainable vectors $f_m$ where $m \in\{\operatorname{subj}, \operatorname{loc}, \mathrm{rel}\}$, computing the attention on each word for each module: $$ a_{m, t}=\frac{\exp \left(f_{m}^{T} h_{t}\right)}{\sum_{k=1}^{T} \exp \left(f_{m}^{T} h_{k}\right)} $$ The weighted sum of word embeddings is used as the modular phrase embedding: $$ q^{m}=\sum_{t=1}^{T} a_{m, t} e_{t} $$ We compute 3 module weights for the expression, weighting how much each module contributes to the expression-object score. We concatenate the first and last hidden vectors from $H$ which memorizes both structure and semantics of the whole expression, then use another fully-connected (FC) layer to transform it into 3 module weights: $$ \left[w_{s u b j}, w_{l o c}, w_{r e l}\right]=\operatorname{softmax}\left(W_{m}^{T}\left[h_{0}, h_{T}\right]+b_{m}\right) $$ #### Visual Modules ![](https://i.imgur.com/5VXzmIu.png) - Region proposal: Faster R-CNN - Backbone: ResNet / VGG - Features: C3, C4 feature - (optional) Segmentation: Mask R-CNN We compute the matching score for each candidate objects $o_i$ given each modular phrase embedding, i.e., $S(o_i|q_{subj}), S(o_i|q_{loc}), S(o_i|q_{rel})$. ##### Subject Module Two tasks: - Attribute prediction - Phrase-guided attentional pooling ###### Attribute Prediction Attributes are frequently used in referring expressions to differentiate between objects of the same category. While preparing the attribute labels in the training set, we first run a template parser [13] to obtain color and generic attribute words. A binary cross-entropy loss is used for multi-attribute classification: $$ L_{s u b j}^{a t t r}=\lambda_{a t t r} \sum_{i} \sum_{j} w_{j}^{a t t r}\left[\log \left(p_{i j}\right)+\left(1-y_{i j}\right) \log \left(1-p_{i j}\right)\right] $$ Where $w_{j}^{a t t r}=1 / \sqrt{\mathrm{freq}_{a t t r}}$ > Note: $w_j^{attr}$ 用來平衡 label ###### Phrase-guided Attentional Pooling We allow our subject module to localize relevant regions within a bounding box through “in-box” attention. We use a 1×1 convolution to fuse attribute blob and C4 into a subject blob $V \in R^{d \times G}$, where G=14x14. Given the subject phrase embedding $q_{subj}$, we compute its attention on each grid location: $$ \begin{aligned} H_{a} &=\tanh \left(W_{v} V+W_{q} q^{s u b j}\right) \\ a^{v} &=\operatorname{softmax}\left(w_{h, a}^{T} H_{a}\right) \end{aligned} $$ the final subject visual representation for the candidate region $o_i$: $$ \widetilde{v}_{i}^{s u b j}=\sum_{i=1}^{G} a_{i}^{v} v_{i} $$ ###### Matching Function We measure the similarity $S\left(o_{i} | q^{s u b j}\right) = F\left(\tilde{v}_{i}^{s u b j}, q^{s u b j}\right)$ between the subject representation $v􏰂_i$ and phrase embedding $q_{subj}$ using a matching function $F$ as shown in Fig. 3. The same matching function is used to compute the location score and relationship score: $S\left(o_{i} | q^{l o c}\right)$ and $S\left(o_{i} | q^{rel}\right)$ respectively. ##### Location Module ![](https://i.imgur.com/XEipCM9.png) - Absolute location representation: $$ l_{i}=\left[\frac{x_{t l}}{W}, \frac{y_{t l}}{H}, \frac{x_{b r}}{W}, \frac{y_{b r}}{H}, \frac{w \cdot h}{W \cdot H}\right] $$ - Relative location representation (same category): $$ \delta l_{i j}=\left[\frac{\left[\Delta x_{t l}\right]_{i j}}{w_{i}}, \frac{\left[\triangle y_{t l}\right]_{i j}}{h_{i}}, \frac{\left[\triangle x_{b r}\right]_{i j}}{w_{i}}, \frac{\left[\triangle y_{b r}\right]_{i j}}{h_{i}}, \frac{w_{j} h_{j}}{w_{i} h_{i}}\right] $$ Together: $$ \widetilde{l}_{i}^{\text {loc}}=W_{l}\left[l_{i} ; \delta l_{i}\right]+b_{l} $$ Location matching score: $$ S\left(o_{i} | q^{l o c}\right)=F\left(\widetilde{l}_{i}^{l o c}, q^{l o c}\right) $$ ##### Relationship Module ![](https://i.imgur.com/GqkXiC9.png) While the subject module deals with "in-box" details about the target object, some other expressions may involve its relationship with other "out-of-box" objects, e.g., "cat on chaise lounge". The relationship module is used to address these cases. As in Fig. 5, given a candidate object oi we first look for its surrounding (up-to-five) objects $o_{ij}$ regardless of their categories. We use the average-pooled C4 feature as the appearance feature $v_{ij}$ of each supporting object. > 問題:$v_{ij}$ 怎麼求?翻閱程式碼後了解到其實就是 5 個 object 的 visual embedding。見 joint_match.py 第 [93](https://github.com/lichengunc/MAttNet/blob/13fec3e47ba78577b4b43c8f0b5e8c53b5415472/lib/layers/joint_match.py#L93) 行。並沒有考慮 pair-wise 的關係。也就是說: $\forall i,j,k \in \{1,2,3,...,n \}, v_{ij} = v_{kj}$ 其中 $n$ 是周遭物體數量。 - Relative location difference: $$ \delta m_{i j}=\left[\frac{\left[\Delta x_{t l}\right]_{i j}}{w_{i}}, \frac{\left[\triangle y_{t l}\right]_{i j}}{h_{i}}, \frac{\left[\triangle x_{b r}\right]_{i j}}{w_{i}}, \frac{\left[\triangle y_{b r}\right]_{i j}}{h_{i}}, \frac{w_{j} h_{j}}{w_{i} h_{i}}\right] $$ - With appearance feature: $$ \widetilde{v}_{i j}^{r e l}=W_{r}\left[v_{i j} ; \delta m_{i j}\right]+b_{r} $$ - Highest matching score as relationship score: $$ S\left(o_{i} | q^{r e l}\right)=\max _{j \neq i} F\left(\widetilde{v}_{i j}^{r e l}, q^{r e l}\right) $$ ### Loss Function The overall weighted matching score for candidate object $o_i$ and expression $r$ is: $$ S\left(o_{i} | r\right)=w_{s u b j} S\left(o_{i} | q^{s u b j}\right)+w_{l o c} S\left(o_{i} | q^{l o c}\right)+w_{r e l} S\left(o_{i} | q^{r e l}\right) $$ We randomly sample two negative pairs $(o_i , r_j )$ and $(o_k , r_i )$, where $r_j$ is the expression describing some other object and $o_k$ is some other object in the *same image*. $$ \begin{aligned} L_{r a n k}=\sum_{i} &\left[\lambda_{1} \max \left(0, \Delta+S\left(o_{i} | r_{j}\right)-S\left(o_{i} | r_{i}\right)\right)\right.\\ &\left.+\lambda_{2} \max \left(0, \Delta+S\left(o_{k} | r_{i}\right)-S\left(o_{i} | r_{i}\right)\right)\right] \end{aligned} $$ Together: $$ L=L_{s u b j}^{a t t r}+L_{r a n k} $$ ## Result ### Referring Expression Comprehension ![](https://i.imgur.com/1LmZONd.png) ![](https://i.imgur.com/F9paVw4.png) ![](https://i.imgur.com/eYh0fSe.png) ### Segmentation from Referring Expression ![](https://i.imgur.com/X9SBa7F.png) ## Conclusion Our modular attention network addresses variance in referring expressions by attending to both relevant words and visual regions in a modular framework, and dynamically computing an overall matching score. We demonstrate our model’s effectiveness on bounding-box-level and pixel-level comprehension, significantly outperforming state-of-the-art. ## Thoughts - 這個方法只考慮兩個 object 之間的二元關係,可能沒辦法很好地處理三元以上的關係。例如:藍衣男子身後穿著紅色衣服女生左邊的人。 ![](https://i.imgur.com/wxUXBnB.png) - (?) 這個方法可能會忽略某些物體之間特殊的關係。例如:手握搖控器的男子。很可能遙控器在桌上、男子在遙控器附近,那麼模型不應該找到任何對象;照這個模型設計方式,很可能無法區分 "在旁邊" 與 "手握" 的差別,最後會找到這名男子。但這是錯誤的。 ![](https://i.imgur.com/bFYDBG1.png) ![](https://i.imgur.com/6MGjs4r.jpg) - 雖然利用 mask r-cnn 的 mask branch 可以幫我們做 instance segmentation,但是它並無法理解 referring expression 所指的對象為何,可能會發生因為遮擋而找錯對象的問題。 ![](https://i.imgur.com/AoTF6Mk.jpg) --- ## Link for code/model/dataset Demo site: [http://vision2.cs.unc.edu/refer/comprehension](http://vision2.cs.unc.edu/refer/comprehension) Source code for training/evaluation: [https://github.com/lichengunc/MAttNet](https://github.com/lichengunc/MAttNet) Datasets (RefCOCO, RefCOCO+, RefCOCOg): [https://github.com/lichengunc/refer](https://github.com/lichengunc/refer) --- ## References [1] J. Andreas, D. Klein, and S. Levine. Modular multitask re- inforcement learning with policy sketches. ICML, 2017. [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. NAACL, 2016. [3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016. [4] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regres- sion network with context policy for phrase grounding. In ICCV, 2017. [8] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017. [10] R.Hu,M.Rohrbacnh,J.Andreas,T.Darrell,andK.Saenko. Modeling relationship in referential expressions with com- positional modular networks. In CVPR, 2017. [11] R.Hu,H.Xu,M.Rohrbach,J.Feng,K.Saenko,andT.Dar- rell. Natural language object retrieval. CVPR, 2016. [12] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and ex- ecuting programs for visual reasoning. ICCV, 2017. [13] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014. [14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dolla ́r, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014. [16] J. Liu, L. Wang, and M.-H. Yang. Referring expression gen- eration and comprehension via attributes. In ICCV, 2017. [18] R.LuoandG.Shakhnarovich.Comprehension-guidedrefer- ring expressions. CVPR, 2017. [19] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. CVPR, 2016. [20] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understand- ing. In ECCV, 2016. [22] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by re- construction. In ECCV, 2016. [26] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. CVPR, 2016. [32] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV, 2016.