Autoregressive Search Engines: Generating Substrings as Document Identifiers

# Autoregressive Search Engines: Generating Substrings as Document Identifiers ###### tags: ```筆記```, ```NLP``` > https://arxiv.org/pdf/2204.10628.pdf ## Abstract - Motivation: Addressing knowledge-intensive language tasks requires NLP systems to not only provide correct answers but also retrieve supporting evidence from a corpus. The paper proposes using autoregressive language models for generating distinctive ngrams as identifiers for **passage-level retrieval**, offering a new approach to improve search accuracy **without imposing a predefined structure** on the search space. - 這邊指不用DSI方法中去assign semantically structured identifiers - Results: The proposed method, leveraging all ngrams in a passage as identifiers, outperforms prior autoregressive approaches and established retrieval solutions on the KILT benchmark by at least 10 points, setting new state-of-the-art performances on some datasets with a significantly lighter memory footprint. ## Introduction - Knowledge-intensive tasks in NLP, such as open-domain question answering and fact-checking, often combine a search engine with a machine reader to retrieve relevant information and generate answers. The surge in autoregressive language models has boosted machine reader performance but not equally improved retrieval methods. - The paper introduces an autoregressive model that uses all ngrams in a passage as possible identifiers, eliminating the need for a structured search space and leveraging an efficient data structure to map generated ngrams to full passages. ## Methodology - Method Name: Search Engines with Autoregressive LMs (SEAL). - ![image](https://hackmd.io/_uploads/r1TzFaFRT.png) - SEAL combines an autoregressive model (BART) with a compressed full-text substring index (FM-Index) for generating and scoring distinctive ngrams that are mapped to passages. This approach allows for the generation of any span from the corpus **without explicitly encoding all substrings**, using a novel scoring function that combines LM probabilities with FM-index frequencies to improve retrieval accuracy. ## Experiments - Datasets: Evaluation conducted on **Natural Questions** (NQ) and the **[KILT benchmark](https://paperswithcode.com/dataset/kilt)**, involving various knowledge-intensive tasks across multiple datasets. - Metrics: Accuracy@k for NQ, R-precision for KILT benchmark at passage level, and comparison of memory footprint sizes. SEAL demonstrates improved performance over established retrieval solutions and previous autoregressive approaches, achieving new state-of-the-art downstream results on multiple datasets when combined with existing reader technologies. > The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.