A Survey on Retrieval-Augmented Text Generation - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2202.01110) | [Note link](https://zhuanlan.zhihu.com/p/616875558) | arXiv 2022 :::success **Thoughts** In this paper, we can know some basic domain knowledge on **Retrieval-Augmented Text Generation**. I think provide **customized metrics** for retrieval and how to do **integration** between original data and retrieved data are important. ::: ## Abstract This paper aims to conduct a survey about retrieval-augmented text generation. - Highlights the generic paradigm of retrieval-augmented generation - Reviews notable approaches according to different tasks - Dialogue response generation - Machine translation - Other - Points out some promising directions on top of recent methods to facilitate future research ## Introduction Some remarkable advantages: 1. The knowledge is not necessary to be implicitly stored in model parameters, but is explicitly acquired in a plug-and-play manner, leading to great scalibility 2. Instead of generating from scratch, the paradigm generating text from some retrieved human-written reference, which potentially alleviates the difficulty of text generation ## Retrieval-Augmented Paradigm There has three major components of the retrieval-augmented generation paradigm, including 1. Retrieval source 2. Retrieval metric 3. Integration methods ![](https://hackmd.io/_uploads/HkZFK9wo3.png) ### Formulation - $\boldsymbol{x}$: input sequence - $\boldsymbol{y} = f(\boldsymbol{x})$: output sequence The retrieval-augmented generation can be further formulated as: $$ \tag{1} \boldsymbol{y} = f(\boldsymbol{x}, \boldsymbol{z}) $$ where $\boldsymbol{z} = \{ \langle \boldsymbol{x}^r, \boldsymbol{y}^r \rangle \}$ is a set of relevant instances retrieved from the original training set or external datasets. The main idea of this paradigm is that $\boldsymbol{y}^r$ may benefit the response generation, if $\boldsymbol{x}^r$ (or $\boldsymbol{y}^r$) is similar (or relevant) to the input $\boldsymbol{x}^r$. ### Retrieval Sources Retrieval memory can be retrieved from three kinds of sources. **Training Corpus** Most previous studies search the external memory from its *training corpus*. **External Data** Retrieval relevant samples from *external datasets*. **Unsupervised Data** One limitation for previous two sources is that the datasets have to be supervised datasets consisting of aligned input-output pairs. The main idea is aligning source-side sentences and the corresponding target-side translations in a dense vector space, i.e., aligning $\boldsymbol{x}$ and $\boldsymbol{y}^r$ when $\boldsymbol{x}^r$ is absent. ### Retrieval Metrics Metrics that evaluate the relevance between text are varied as well. **Sparse-vector Retrieval** Given an input sequence $\boldsymbol{x}$ and a retrieval corpus, retrieval model aims to retrieve a set of relevant examples $\boldsymbol{z} = \{ \langle \boldsymbol{x}^r, \boldsymbol{y}^r \rangle \}$ from the corpus. When a supervised corpus is used, $\{ \langle \boldsymbol{x}^r, \boldsymbol{y}^r \rangle \}$ is retrieved by measuring the similarity between $\boldsymbol{x}$ and $\boldsymbol{x}^r$. **Dense-vector Retrieval** Above method may fail to retrieve examples that are only semantically relevant. To alleviate above problem, retrieve in *dense-vector space* instead of the lexical overlap. **Task-specific Retrieval** Sometimes, the most similar one by universal textual similarity does not necessarily serve the best for downstream models. Ideally, the retrieval metric would be learned from the data in a task-dependent way: they wish to consider a memory only if it can indeed boost the quality of final generation. ### Integration **Data Augmentation** It constructs some augmented inputs by concatenating spans from $\{ \langle \boldsymbol{x}^r, \boldsymbol{y}^r \rangle \}$ with the original input $\boldsymbol{x}$. Also, the model needs to learn how to integrate the retrieved information. **Attention Mechanisms** The main idea of this fashion is adopting additional encoders to encode retrieved target sentences, and integrate them through attention. **Skeleton Extraction** In the previous two methods, the downstream generation model learns how to filter out irrelevant or even harmful information from the retrieved examples implicitly. There also exist some works that try to explicitly **extract useful information.** ## Dialogue Response Generation Dialogue systems can be grouped into two categories: - Chit-chat systems - Task-oriented systems In chit-chat dialogue system, both dialogue history and external knowledge are important. Most modern chit-chat dialogue systems can be categorized into two classes: - **Retrieval-based models**: give informative but inappropriate responses - **Generation-based models**: better generalization capacity when handling unseen dialogue contexts, the generated utterances are inclined to be dull and non-informative **Shallow Integration** It extends the standard SEQ2SEQ encoder-decoder model with an extra encoder for encoding the retrieval result. The output of the extra encoder, along with the output from the original encoder for dialogue history, is used to feed the decoder. > Challenge: How to feed into the decoder ? **Deep Integration** In this method, a general framework that first extracts a skeleton from the retrieved response and then generates the response based on the extracted skeleton. > Challenge: The generation model easily learns to ignore the retrieved response entirely and collapses to a vanilla seq2seq model. **Knowledge-Enhanced Generation** With conditioning the generation on some retrieved responses, retrieval-based dialogue systems can be used for building better generation-based models. **Limitations** 1. Multiple retrieval responses, not only one 2. Using more customized retrieval metric especially for controlled dialogue response generation 3. Enlarge the retrieval pool ## Other Tasks **Language Modelling** Leveraging information from retrieval memory could improve the performance of large pre-trained language model. ## Future Directions **Retrieval Sensitivity** Currently, retrieval augmented text generation models perform well when the retrieved examples are very similar to the query. However, they are even worse than the generation models without retrieval when the retrieval examples are less similar. **Retrieval Efficiency** The overall inference for the retrieval augmented generation models is less efficient due the considerable retrieval overhead. Trade off between retrieval memory size and retrieval efficiency. **Local vs. Global Optimization** In practice, there is an essential gap about the retrieval metric between the training and inference phrases. In the **training phase**, the loss is locally back-propagated to only a few retrieved examples while in the **inference phase** the metric is globally conducted among all examples in the memory. **Diverse & Controllable Retrieval** Future work should explore how to use customized metrics for retrieval. This can be beneficial for more controlled text generation. More desirable in the personalized dialogue generation, parallel data that contains specific terminologies is more helpful in machine translation. ## Conclusion In this paper, they surveyed recent approaches for retrieval-augmented text generation.