Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2005.11401) | [Note link](https://baoyu.io/translations/ai-paper/2005.11401-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks) | [Code link](https://github.com/huggingface/transformers/tree/main/examples/research_projects/rag) | NeurIPS 2020 :::success **Thoughts** This study develops a Retrieval-Augmented Generation (RAG) model to address limitations in language models. It combines a pre-trained transformer with a dense vector index of Wikipedia, using a neural retriever to improve task-specific performance and reduce false information. ::: ## Abstract Despite the extensive training of large language models, which results in impressive performance, they still face limitations in knowledge-intensive or task-specific tasks. This study aims to present an architecture that can fine-tune these models using retrieval-augmented generation. ## Background Nowadays, pre-trained language model may produce **hallucinations**, which means it contains false or misleading information that appears to be factual. Recently, hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories have shown promise in addressing some of these issues. Two studies, REALM and ORQA, use masked language models with a differentiable retriever, demonstrating improved results in open-domain extractive question answering. ## Method This study builds RAG models, where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. These components are combined in a probabilistic model and trained end-to-end. ![image](https://hackmd.io/_uploads/Hyjbym9KA.png) ### Retriever Purpose: Returns (top-K truncated) distributions over text passages given a query $x$. It uses maximum inner product search to calculate top-K documents with highest prior probability. $$ p_\eta (z \mid x) \propto \mathrm{exp} (\mathbf{d}(z)^\top \mathbf{q}(x) $$ which $\mathbf{d}(z) = \mathrm{BERT}_d(z)$ and $\mathbf{q}(x) = \mathrm{BERT}_q (x)$. #### RAG-Sequence It uses the same retrieved document to generate the complete sequence. #### RAG-Token It can draw a different latent document for each target token and marginalize accordingly. This allows the generator to choose content from several documents when producing an answer. Backbone model: BERT base ### Generator Purpose: Generates tokens based on the context of the previous token $y_{1:i-1}$, input $x$, and retrieved passage $z$. $$ p_\theta(y_i \mid x, z, y_{1:i-1}) $$ Backbone model: BART #### RAG-Token It still appear to be an autoregressive sequence-to-sequence generator. $$ p^\prime_\theta (y_i \mid x, y_{1:i-1}) = \sum_{z \in \mathrm{top-}k (p(\cdot \mid x))} p_\eta(z_i \mid x) p_\theta(y_i \mid x, z_i, y_{1:i-1}) $$ #### RAG-Sequence It can't be solved with a single beam search. In Thorough Decoding, beam search is performed for each document $z$, scoring hypotheses using $p_\theta(y_i | x, z, y_{1:i-1})$. If a hypothesis $y$ doesn’t appear in all beams, additional forward passes are made for each document $z$ to estimate its probability, multiplying with $p_\eta(z|x)$ and summing across beams. In Fast Decoding, to avoid the many forward passes needed for long output sequences, an approximation $p_\theta(y|x, z_i) \approx 0$ is used for hypotheses $y$ that weren't generated during beam search from $x, z_i$. This simplifies the process once the candidate set $Y$ is generated. ## Experiment The study evaluates four tasks: 1. **Open-Domain Question Answering**: Answering questions based on a broad range of topics from diverse sources. 2. **Abstractive Question Answering**: Generating answers that paraphrase or summarize information from a given context. 3. **Jeopardy Question Generation**: Creating questions in the style of the game show "Jeopardy!" based on given answers. 4. **Fact Verification**: Assessing the accuracy of factual claims by comparing them with reliable sources. ![image](https://hackmd.io/_uploads/HyUi_QctC.png) ![image](https://hackmd.io/_uploads/HkxvOmqt0.png)