# Autoregressive Search Engines: Generating Substrings as Document Identifiers
###### tags: ```筆記```, ```NLP```
> https://arxiv.org/pdf/2204.10628.pdf
## Abstract
- Motivation: Addressing knowledge-intensive language tasks requires NLP systems to not only provide correct answers but also retrieve supporting evidence from a corpus. The paper proposes using autoregressive language models for generating distinctive ngrams as identifiers for **passage-level retrieval**, offering a new approach to improve search accuracy **without imposing a predefined structure** on the search space.
- 這邊指不用DSI方法中去assign semantically structured identifiers
- Results: The proposed method, leveraging all ngrams in a passage as identifiers, outperforms prior autoregressive approaches and established retrieval solutions on the KILT benchmark by at least 10 points, setting new state-of-the-art performances on some datasets with a significantly lighter memory footprint.
## Introduction
- Knowledge-intensive tasks in NLP, such as open-domain question answering and fact-checking, often combine a search engine with a machine reader to retrieve relevant information and generate answers. The surge in autoregressive language models has boosted machine reader performance but not equally improved retrieval methods.
- The paper introduces an autoregressive model that uses all ngrams in a passage as possible identifiers, eliminating the need for a structured search space and leveraging an efficient data structure to map generated ngrams to full passages.
## Methodology
- Method Name: Search Engines with Autoregressive LMs (SEAL).
- 
- SEAL combines an autoregressive model (BART) with a compressed full-text substring index (FM-Index) for generating and scoring distinctive ngrams that are mapped to passages. This approach allows for the generation of any span from the corpus **without explicitly encoding all substrings**, using a novel scoring function that combines LM probabilities with FM-index frequencies to improve retrieval accuracy.
## Experiments
- Datasets: Evaluation conducted on **Natural Questions** (NQ) and the **[KILT benchmark](https://paperswithcode.com/dataset/kilt)**, involving various knowledge-intensive tasks across multiple datasets.
- Metrics: Accuracy@k for NQ, R-precision for KILT benchmark at passage level, and comparison of memory footprint sizes. SEAL demonstrates improved performance over established retrieval solutions and previous autoregressive approaches, achieving new state-of-the-art downstream results on multiple datasets when combined with existing reader technologies.
> The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.