<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2005.11401) | [Note link](https://baoyu.io/translations/ai-paper/2005.11401-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks) | [Code link](https://github.com/huggingface/transformers/tree/main/examples/research_projects/rag) | NeurIPS 2020
:::success
**Thoughts**
This study develops a Retrieval-Augmented Generation (RAG) model to address limitations in language models.
It combines a pre-trained transformer with a dense vector index of Wikipedia, using a neural retriever to improve task-specific performance and reduce false information.
:::
## Abstract
Despite the extensive training of large language models, which results in impressive performance, they still face limitations in knowledge-intensive or task-specific tasks. This study aims to present an architecture that can fine-tune these models using retrieval-augmented generation.
## Background
Nowadays, pre-trained language model may produce **hallucinations**, which means it contains false or misleading information that appears to be factual.
Recently, hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories have shown promise in addressing some of these issues. Two studies, REALM and ORQA, use masked language models with a differentiable retriever, demonstrating improved results in open-domain extractive question answering.
## Method
This study builds RAG models, where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. These components are combined in a probabilistic model and trained end-to-end.

### Retriever
Purpose: Returns (top-K truncated) distributions over text passages given a query $x$.
It uses maximum inner product search to calculate top-K documents with highest prior probability.
$$
p_\eta (z \mid x) \propto \mathrm{exp} (\mathbf{d}(z)^\top \mathbf{q}(x)
$$
which $\mathbf{d}(z) = \mathrm{BERT}_d(z)$ and $\mathbf{q}(x) = \mathrm{BERT}_q (x)$.
#### RAG-Sequence
It uses the same retrieved document to generate the complete sequence.
#### RAG-Token
It can draw a different latent document for each target token and marginalize accordingly.
This allows the generator to choose content from several documents when producing an answer.
Backbone model: BERT base
### Generator
Purpose: Generates tokens based on the context of the previous token $y_{1:i-1}$, input $x$, and retrieved passage $z$.
$$
p_\theta(y_i \mid x, z, y_{1:i-1})
$$
Backbone model: BART
#### RAG-Token
It still appear to be an autoregressive sequence-to-sequence generator.
$$
p^\prime_\theta (y_i \mid x, y_{1:i-1}) = \sum_{z \in \mathrm{top-}k (p(\cdot \mid x))} p_\eta(z_i \mid x) p_\theta(y_i \mid x, z_i, y_{1:i-1})
$$
#### RAG-Sequence
It can't be solved with a single beam search.
In Thorough Decoding, beam search is performed for each document $z$, scoring hypotheses using $p_\theta(y_i | x, z, y_{1:i-1})$. If a hypothesis $y$ doesn’t appear in all beams, additional forward passes are made for each document $z$ to estimate its probability, multiplying with $p_\eta(z|x)$ and summing across beams.
In Fast Decoding, to avoid the many forward passes needed for long output sequences, an approximation $p_\theta(y|x, z_i) \approx 0$ is used for hypotheses $y$ that weren't generated during beam search from $x, z_i$. This simplifies the process once the candidate set $Y$ is generated.
## Experiment
The study evaluates four tasks:
1. **Open-Domain Question Answering**: Answering questions based on a broad range of topics from diverse sources.
2. **Abstractive Question Answering**: Generating answers that paraphrase or summarize information from a given context.
3. **Jeopardy Question Generation**: Creating questions in the style of the game show "Jeopardy!" based on given answers.
4. **Fact Verification**: Assessing the accuracy of factual claims by comparing them with reliable sources.

