How Does Generative Retrieval Scale to Millions of Passages?

# How Does Generative Retrieval Scale to Millions of Passages? ###### tags: ```筆記```, ```NLP``` > https://arxiv.org/pdf/2305.11841.pdf ## Abstract - Motivation: Generative retrieval, exemplified by the Differentiable Search Index, reframes information retrieval as a sequence-to-sequence task, encoding a document corpus within a Transformer model **without external indices**. Prior evaluations of generative retrieval's effectiveness were limited to corpora of around **100k** documents. - Findings: This study is the first to empirically explore generative retrieval **across varying corpus sizes**, up to 8.8M passages with models up to 11B parameters. Key findings include the critical role of synthetic queries for document representations during indexing, the limited utility of proposed architectural modifications against compute costs, and the observed ceiling on retrieval performance improvements beyond certain model sizes. ## Introduction - Background: Traditional dual encoders have dominated first-stage information retrieval by mapping queries and documents into the same embedding space. Generative retrieval proposes a unified model approach, showing promise against dual encoders in smaller corpora. - Challenge: There's a gap in understanding how generative retrieval performs on larger corpora and what model aspects are crucial when scaled. ## Methodology - Techniques Evaluated: The study assesses various generative retrieval techniques, including document identifier designs (atomic, naive semantic), document representations (document tokens, synthetic queries), and model designs (prefix-aware, weight-adaptive decoding). - Approach: Begins with small-scale experiments on known datasets, then scales up to the entire MS MARCO passage ranking task, evaluating effectiveness across different corpus sizes and model parameters. ## Experiments - Datasets: Ranges from small subsets of the **MS MARCO** passage ranking task (100k passages) to the entire set (8.8M passages), including comparisons on **Natural Questions** and **TriviaQA** datasets. - 關於MS MARCO（Microsoft Machine Reading Comprehension）dataset，每篇獨立的 document 或 passage 通常是短文本。MS MARCO的段落長度不是固定的，但一般而言，每個段落大約包含50到250個單詞。 - Metrics: Evaluates model performance using Mean Reciprocal Rank at 10 (**MRR@10**), with a focus on scaling effects on retrieval effectiveness and the impact of various generative retrieval techniques. - Mean Reciprocal Rank (MRR) 是一種評估資訊檢索系統、推薦系統或問答系統性能的指標。它尤其適用於評估系統返回的第一個正確答案的平均效率。 - MRR是對一組查詢的倒數排名（reciprocal rank）的平均值。倒數排名是系統返回的第一個正確答案的排名的倒數。如果系統返回的第一個正確答案位於第一位，那麼其倒數排名就是1；如果位於第二位，其倒數排名就是1/2；以此類推。 - ![image](https://hackmd.io/_uploads/S1dPHTK06.png) - ![image](https://hackmd.io/_uploads/SJYFrTY06.png) ## Takeaways - 核心發現包括使用合成查詢(query)作為文件表示在索引期間的中心重要性，現有建議的架構修改在考慮計算成本時的無效性，以及在檢索性能方面，盲目擴大模型參數的限制。 - 雖然發現在小型語料庫上，生成檢索與 SOTA dual encoders 競爭，但擴展到數百萬篇文章仍是一項重要且未解決的挑戰。這些發現對於社區澄清生成檢索的現狀、凸顯獨特挑戰並啟發新的研究方向具有價值。 > STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.