Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2309.17133) | [Note link](https://blog.csdn.net/m0_54237635/article/details/137411847) | [Code link](https://github.com/LinWeizheDragon/FLMR) | NeurIPS 2023 :::success **Thoughts** This study addresses two key limitations in knowledge-based visual question answering: 1. Image representations derived from image-to-text transforms may be incomplete and inaccurate. 2. Solely relying on one-dimensional embeddings for calculating relevance scores lacks sensitivity. ::: ## Abstract The task of knowledge-based visual question answering (KB-VQA) leverages external knowledge to answer questions grounded in visual content. A framework known as retrieval-augmented visual question answering (RA-VQA) can address this task effectively. This study aims to solve two limitations in RA-VQA's retriever: 1. The image representations obtained via image-to-text transforms can be incomplete and inaccurate. 2. Using only one-dimensional embeddings to calculate relevance scores lacks sensitivity. ## Background Knowledge-based visual question answering (KB-VQA) aims to read an image and answer a question related to the image content. How can the question be answered correctly? It depends on the ability to retrieve relevant information and generate the answer based on the retrieved knowledge. A framework called retrieval-augmented visual question answering (RA-VQA) is designed for this task. RA-VQA process: 1. It retrieves $K$ documents relevant to the image and the question from external knowledge. 2. It uses a large language model (LLM) to generate the answer based on these relevant passages. ### Previous Work: [RA-VQA](https://aclanthology.org/2022.emnlp-main.772.pdf) ![image](https://hackmd.io/_uploads/H1aggy6F0.png) A jointly trained framework for knowledge retrieval and answer generation: 1. Using a visual algorithm to transform visual information into language. 2. The retriever retrieves documents from a knowledge database. 3. Training the retriever $p_\theta$ and the generator $p_\phi$ using the RA-VQA loss. 4. The model will select the answer with the highest joint probability $p_\theta(z_i \mid x) p_\phi(y_i \mid x, z_i)$. ## Method Two major limitations in RA-VQA's retriever are discussed in this study: 1. Image representations are incompletely understood via image-to-text transforms. 2. Using only single embeddings to compute relevance scores between queries and documents can be lossy. This study proposes an approach called **Fine-grained Late-interaction Multi-modal Retrieval (FLMR)** to address these two limitations: 1. To better understand the image representation, this work uses a large vision model to align the image representation via image-to-text transforms. 2. Instead of using a single embedding, this work employs multi-dimensional representations to capture relevance scores in a fine-grained manner. ![image](https://hackmd.io/_uploads/BySj2p2FC.png) ### Knowledge Retrieval THis framework consists two encoders: visual mode $\mathcal{F}_V$ and language model $\mathcal{F}_L$. #### Visual Features This study uses two kinds of visual features: 1. Text-based visual captions. 2. Feature-based representations from a large visual model. For the second feature source, this study uses [VinVL](https://arxiv.org/abs/2101.00529) to locate $N_{ROI}$ (Region-of-Interest) bounding boxes. With the visual model $\mathcal{F}_V$, they obtain one global image latent $g = \mathcal{F}_V(I) \in \mathbb{R}^{d_V}$ from the image $I$. Additionally, they obtain ROI-based latents $\{ r_i = \mathcal{F}_V(I_i^p) \in \mathbb{R}^{d_V} \}_{i=1, \dots, N_{ROI}}$. #### Token-Level Embeddings Multi-dimensional embeddings are used to improve retrieval. Token-level embeddings from both text and visual data are concatenated to enhance performance. To align different modality feature, they train a mapping network $\mathcal{F}_M$. It learns to project visual feature from $\mathcal{F}_V$ into the latent space of the language model $\mathcal{F}_L$. The final query embeddings $\mathbf{Q}$: $$ \mathbf{Q} = [\mathcal{F}_L(q), \mathcal{F}_M([g, r_1, r_2, \dots, r_{N_{ROI}}])] \in \mathcal{R}^{l_Q \times d_L} $$ which $I_Q = l_q + (N_{ROI} + 1) \times N_{vt}$, $l_q$ is the length of the question $q$. #### Multi-Modal Late Interaction This stduy compute relevance score between a question-image pair $\bar{\mathbf{q}} = (q, I)$ and a document $d$: $$ r(\bar{\mathbf{q}}, d) = r((q, I), d) = \sum_{i=1}^{l_Q} \max_{j=1}^{l_D} \mathbf{Q}_i \mathbf{D}_j^\top $$ which $\mathbf{D}$ is the embeddings from knowledge base and $l_D$ is the length of document $d$. ### Answer Generation In this study, the answer generator $\mathcal{F}_A$ with parameter $\phi$ will generate an answer from the best cnadidate by joint probability of retrieval and answer generation. $$ \{ d_k \}_{k=1}^K = \mathrm{topK}_d(p_\theta(d \mid \bar{\mathbf{q}})) $$ which $\theta$ is the model parameter of $\mathcal{F}_V, \mathcal{F}_L$ and $\mathcal{F}_M$. $$ \hat{y}, \hat{d} = \underset{y, d_k}{\arg \max} \ p(y, d_k \mid \bar{\mathbf{q}}) = \underset{y, d_k}{\arg \max} \ p_\phi(y \mid \bar{\mathbf{q}}, d_k) p_\theta(d_k \mid \bar{\mathbf{q}}) $$ ### Training Loss For the **retriever**, this study uses the contrastive loss $\mathcal{L}_{CL}$: $$ \mathcal{L}_{CL} = - \sum_{q, d^\ast \in \mathcal{D}} \log \frac{\exp (r(q, d^\ast))}{\exp(r(q, d^\ast)) + \sum_{z \in \mathcal{N}(q)} \exp (r(q, z))} $$ where $d^\ast$ are considered negative examples for $q$. For the **answer generator**, they use cross-entropy loss for the generated sequences: $$ \mathcal{L} = \sum_{\bar{\mathbf{q}}, \mathcal{S} \in \mathcal{T}} \sum_{k=1}^K \log p_\phi (s_k^\ast \mid \bar{\mathbf{q}}, d_k) $$ where $\mathcal{T}$ is the whole dataset, $\mathcal{S}$ is the set of human responses, and $s_k^\ast \in \mathcal{S}$ is the most popular string that appears in the document $d_k$. ## Experiment Backbone model: - $\mathcal{F}_V$: [CLIP ViT-base](https://huggingface.co/openai/clip-vit-base-patch32) - $\mathcal{F}_L$: [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) - $\mathcal{F}_M$: 2-layer multi-layer perceptron - $\mathcal{F}_A$: [T5-large](https://huggingface.co/google-t5/t5-large) / [BLIP2-Flan-T5-XL](https://huggingface.co/Salesforce/blip2-flan-t5-xl) Below is the model performace on visual QA dataset called [OK-VQA](https://okvqa.allenai.org/). ![image](https://hackmd.io/_uploads/ByLBap2KA.png) Retrieval performance on [GoogleSearch(GS)](https://aclanthology.org/2021.emnlp-main.517/) and Wikipedia. ![image](https://hackmd.io/_uploads/SkwUpa3KC.png) Case study comparing some model variants. ![image](https://hackmd.io/_uploads/rJdMpT2KC.png)