Recitation-Augmented Language Models - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2210.01296) | [Note link](https://blog.csdn.net/qq_45668004/article/details/138465170) | [Code link](https://github.com/Edward-Sun/RECITE) | ICLR 2024 :::success **Thoughts** This study uses reciting relevant passages to address knowledge-intensive tasks. However, I think that in the Passage Hint-Based Diversified Recitation section, they still rely on external corpora like Wikipedia to retrieve knowledge (hints), which helps improve performance by providing better context and accuracy. ::: ## Abstract Unlike most retrieval methods that try to retrieve relevant documents before generating the output, this study, RECITation-augmented gEneration (RECITE), samples one or several relevant passages from the LLM's own memory and recites them. ![image](https://hackmd.io/_uploads/B1rfGNkcC.png) ## Background Like previous work we've discussed, recent large language models rely on external corpora and use retrieval-augmentation to solve knowledge-intensive tasks. This study explores another approach: **few-shot prompting**. In task-specific NLP tasks, few-shot prompting can help LLMs perform better. ## Method The goal of this paper is to mimic a human’s ability to recite relevant factoid knowledge before answering knowledge-intensive questions, enabling more accurate answers. This method has two components: 1. An evidence-recitation module for reciting relevant passages. 2. A question-answering module for generating answers given the recited evidence. How they implement? ### Prompt-based Recite-and-Answer for Question Answering They prompt the LLM with paired exemplars of questions and recited evidence, allowing the LLM to learn in an in-context manner to generate a recitation for an arbitrary question. ![image](https://hackmd.io/_uploads/SJt4MN1q0.png) They append the recited passages at the beginning of the original question-answer exemplars as a single prompt and then generate the final answer. Since factual knowledge can appear in several places, they use a multiple-path decoding technique. Given an arbitrary question, they use top-$k$ sampling to independently generate a few recitations and then greedily decode the answer to the question based on the sampled recitations. The optimal answer is selected by taking a plurality/majority vote among the generated answers. This study also applies this method to **multi-hop questions** by using top-$k$ sampling to generate multiple recitations and then performing majority voting to determine the final answers. ### Passage Hint-Based Diversified Recitation with Fine-Tuning In this section, they aim for the evidence-recitation module to: 1. Avoid generating recitations with incorrect facts. 2. Ensure that the sampled recitations have sufficient diversity. They find a unique passage hint for each passage by concatenating the section titles and the in-section order of each passage. The source of these passages is well-formed text knowledge bases, such as Wikipedia. Inspired by [question-answering with multiple retrieved passages](https://arxiv.org/abs/2007.01282), this study uses aggregated diverse recitations as a single context and generates the answer with a few additional question-answer pair demonstrations. ![image](https://hackmd.io/_uploads/rJg6Dr19A.png) In the trainig prcodure, they do an additional fine-tuning stage to adapt LLMs to learn mapping from the question to the passage hint, and to the full passage merely by few-shot prompting. ![image](https://hackmd.io/_uploads/Byg3vBk9C.png) Training Details: 1. They use ground-truth evidence and question pairs as the prompt. 2. They generate new questions through in-context learning for randomly sampled passages from Wikipedia pages. 3. Based on the few-shot generated questions, they train the LLM to predict both the original passage hint and the passage content. ## Experiment ### Datset This study conducts experiments on three different question-answering datasets: 1. [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) 2. [HotpotQA](https://hotpotqa.github.io/) 3. [Natural Questions](https://github.com/google-research-datasets/natural-questions) ### Evaluation Metrics 1. **Exact Matching (EM)**: Measures the percentage of answers that match exactly with the ground truth. 2. **F1 Scores**: Measures the harmonic mean of precision and recall to evaluate the correctness of answers. ### Backbone Model 1. [PaLM](https://huggingface.co/papers/2204.02311) 2. [UL2](https://huggingface.co/google/ul2) 3. [OPT](https://huggingface.co/facebook/opt-350m) 4. [Codex](https://www.bing.com/search?q=codex+huggingface&cvid=4a5a78916c3a4b34b14c929f06effe41&gs_lcrp=EgZjaHJvbWUqBggBEAAYQDIGCAAQRRg5MgYIARAAGEAyBggCEAAYQNIBCDUzNzlqMGo0qAIAsAIA&FORM=ANAB01&PC=U531) Below is thy performance comparison on different dataset: ![image](https://hackmd.io/_uploads/ByUPzV1cA.png) The study also includes a performance comparison of PaLM-62B on the Natural Questions (NQ) dataset with different prompt strategies: ![image](https://hackmd.io/_uploads/BJBDKSkcR.png)