<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2211.12561) | [Note link](https://blog.csdn.net/m0_55982600/article/details/138010394) | ICML 2023 :::success **Thoughts** This study proposes the first retrieval-augmented multimodal model capable of **retrieving and generating both text and images**. It can be applied to various downstream tasks such as image generation, text-to-image, image-to-text, and more. ::: ## Abstract Unlike multimodal models like DALL-E and CM3, this study does not store all knowledge within the model parameters. Instead, it proposes a retrieval-augmented multimodal model. This approach allows a base multimodal model (generator) to access relevant text and images fetched by a retriever from an external memory source. ## Background Multimodal models can perform tasks such as generating images from text, generating text from images, and even generating both text and images. Most of these models require a large number of parameters to store their knowledge. Recently, retrieval-augmented language models (RAG) have shown promise in natural language processing (NLP). Given an input text, a RAG model uses a **retriever** to fetch relevant documents from an external memory and a **generator** to produce predictions based on the retrieved documents. Can we apply the RAG framework to multimodal models? ## Method This study proposes the first retrieval-augmented multimodal model capable of **retrieving and generating both text and images**. ![image](https://hackmd.io/_uploads/HJVDYXjF0.png) - **Retriever**: Retrieves relevant multimodal documents from an external memory. - **Generator**: Uses the retrieved documents to make predictions for the input document. This study builds the RAG model using the CM3 Transformer architecture. :::info **Causal masked multimodal model (CM3)** It is a Transformer decoder model designed for multimodal documents. In particular, CM3 formats each multimodal document as an HTML sequence, such as `<img alt=[text] src=[image]>`, where `[text]` is a sequence of text tokens and `[image]` is a sequence of image tokens obtained by an image tokenizer. ![image](https://hackmd.io/_uploads/rywk2QjFC.png) ::: ### Dense retriever A retriever $r$ takes a query $q$ and a candidate document $m$ from the memory $\mathcal{M}$ and returns a releveance score $r(q, m)$. $$ r(q, m) = E_Q(q)^\top E_M(m) $$ where $E_Q$ and $E_M$ is encoder for query and memory document. In this study, $E_Q$ and $E_M$ are used to encode both text and image content. CLIP is used to implement the encoder. To determine the relevance score, they employ maximum inner product search and then sample the final $K$ retrieved documents from this list. ### Retrieval strategy 1. **Relevance**: The retrieved documents need to be relevant to the input sequence. 2. **Modality**: Retrieving a multimodal document that consists of both images and text leads to better generator performance. 3. **Diversity**: Ensuring diversity in retrieved documents is important. ### Multimodal generator This study use CM3 as backbone model. CM3 is fed with both input sequence $x$ and resulting sequence $(m_1, \dots, m_K)$. To train the generator, they optimize the following loss: $$ \begin{align} \mathcal{L} & = \mathcal{L}_{\mathrm{main}} + \alpha \mathcal{L}_{\mathrm{retr}} \\ & = - \log p(x \mid m_1, \dots, m_K) - \alpha \log p(m_1, \dots, m_K) \end{align} $$ ## Experiment This study use LAION as external knowledge. They test several task like: - Caption-to-image generation - Image-to-caption generation - Knowledge-intensive multimodal generation - Image infilling and editing - Controlled image generation - One-shot and few-shot image classification Caption-to-image generation performance on MS-COCO. ![image](https://hackmd.io/_uploads/ryLJxNjKC.png) Image-to-caption generation performance on MS-COCO. ![image](https://hackmd.io/_uploads/HJ9ggEjKA.png) Sample case about controllable image generation. ![image](https://hackmd.io/_uploads/Sy1ub4iYC.png) Sample case about image editing. ![image](https://hackmd.io/_uploads/BJk2W4iFR.png)