<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2210.01296) | [Note link](https://blog.csdn.net/qq_45668004/article/details/138465170) | [Code link](https://github.com/Edward-Sun/RECITE) | ICLR 2024
:::success
**Thoughts**
This study uses reciting relevant passages to address knowledge-intensive tasks.
However, I think that in the Passage Hint-Based Diversified Recitation section, they still rely on external corpora like Wikipedia to retrieve knowledge (hints), which helps improve performance by providing better context and accuracy.
:::
## Abstract
Unlike most retrieval methods that try to retrieve relevant documents before generating the output, this study, RECITation-augmented gEneration (RECITE), samples one or several relevant passages from the LLM's own memory and recites them.

## Background
Like previous work we've discussed, recent large language models rely on external corpora and use retrieval-augmentation to solve knowledge-intensive tasks.
This study explores another approach: **few-shot prompting**.
In task-specific NLP tasks, few-shot prompting can help LLMs perform better.
## Method
The goal of this paper is to mimic a human’s ability to recite relevant factoid knowledge before answering knowledge-intensive questions, enabling more accurate answers.
This method has two components:
1. An evidence-recitation module for reciting relevant passages.
2. A question-answering module for generating answers given the recited evidence.
How they implement?
### Prompt-based Recite-and-Answer for Question Answering
They prompt the LLM with paired exemplars of questions and recited evidence, allowing the LLM to learn in an in-context manner to generate a recitation for an arbitrary question.

They append the recited passages at the beginning of the original question-answer exemplars as a single prompt and then generate the final answer.
Since factual knowledge can appear in several places, they use a multiple-path decoding technique.
Given an arbitrary question, they use top-$k$ sampling to independently generate a few recitations and then greedily decode the answer to the question based on the sampled recitations.
The optimal answer is selected by taking a plurality/majority vote among the generated answers.
This study also applies this method to **multi-hop questions** by using top-$k$ sampling to generate multiple recitations and then performing majority voting to determine the final answers.
### Passage Hint-Based Diversified Recitation with Fine-Tuning
In this section, they aim for the evidence-recitation module to:
1. Avoid generating recitations with incorrect facts.
2. Ensure that the sampled recitations have sufficient diversity.
They find a unique passage hint for each passage by concatenating the section titles and the in-section order of each passage.
The source of these passages is well-formed text knowledge bases, such as Wikipedia.
Inspired by [question-answering with multiple retrieved passages](https://arxiv.org/abs/2007.01282), this study uses aggregated diverse recitations as a single context and generates the answer with a few additional question-answer pair demonstrations.

In the trainig prcodure, they do an additional fine-tuning stage to adapt LLMs to learn mapping from the question to the passage hint, and to the full passage merely by few-shot prompting.

Training Details:
1. They use ground-truth evidence and question pairs as the prompt.
2. They generate new questions through in-context learning for randomly sampled passages from Wikipedia pages.
3. Based on the few-shot generated questions, they train the LLM to predict both the original passage hint and the passage content.
## Experiment
### Datset
This study conducts experiments on three different question-answering datasets:
1. [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)
2. [HotpotQA](https://hotpotqa.github.io/)
3. [Natural Questions](https://github.com/google-research-datasets/natural-questions)
### Evaluation Metrics
1. **Exact Matching (EM)**: Measures the percentage of answers that match exactly with the ground truth.
2. **F1 Scores**: Measures the harmonic mean of precision and recall to evaluate the correctness of answers.
### Backbone Model
1. [PaLM](https://huggingface.co/papers/2204.02311)
2. [UL2](https://huggingface.co/google/ul2)
3. [OPT](https://huggingface.co/facebook/opt-350m)
4. [Codex](https://www.bing.com/search?q=codex+huggingface&cvid=4a5a78916c3a4b34b14c929f06effe41&gs_lcrp=EgZjaHJvbWUqBggBEAAYQDIGCAAQRRg5MgYIARAAGEAyBggCEAAYQNIBCDUzNzlqMGo0qAIAsAIA&FORM=ANAB01&PC=U531)
Below is thy performance comparison on different dataset:

The study also includes a performance comparison of PaLM-62B on the Natural Questions (NQ) dataset with different prompt strategies:
