Transformer Memory as a Differentiable Search Index (DSI)

# Transformer Memory as a Differentiable Search Index (DSI) ###### tags: `筆記`, `NLP`, `NeurIPS 2022` > https://arxiv.org/pdf/2202.06991.pdf > [YouTube Video: Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)](https://www.youtube.com/watch?v=27rNqGrTdSI) > Google Research ## Abstract + Introduction - 傳統 IR 是 retrieve-then-rank，本文展示了如何使用單一的Transformer模型來完成資訊檢索，其中所有關於語料庫的信息都被編碼在模型的參數中。 - 本篇 DSI (Generative retrieval) 是用 T5 models 為基底 - Dense retrieval 也有 T5(or ST5) models 為基底的 - 引入了Differentiable Search Index，這是一種新的典範，它利用序列到序列(seq2seq)的訓練架構，直接將字符串查詢映射到相關的文檔標識符（docids），從而顯著簡化了檢索過程。 - ![image](https://hackmd.io/_uploads/SJu8rIhCa.png) - Source: https://speakerdeck.com/wingnus/yi-tay-google-research ## Differentiable Search Index (DSI) - DSI利用一個大型預訓練的Transformer模型，將語料庫的所有信息編碼於Transformer語言模型的參數中。 - Information retrieval comparison - ![image](https://hackmd.io/_uploads/HJAOf_DW0.png) - BM25 or TFIDF - 如果詞彙表是 ["apple", "banana", "cherry"]，而文檔d_j只包含"apple"和"cherry"，則d_j的向量表示可能是 [1.5, 0, 0.8] - Dual Encoder (DE) - 會把文檔作為encoder的輸入，可能得到 [0.85, -0.23, ..., 0.91]，維度依encoder選型決定 - Differentiable Search Index (DSI) - 本文提出三種作法，詳見"[Section 3.1.2] Document Representation" - 若為 atomic 策略，可能是 "233" ## Methodology ![image](https://hackmd.io/_uploads/SkJbe5gAa.png) - Step0: Indexing Method (How to index) - Inputs2Target - seq2seq task of doc_tokens -> docid - Targets2Inputs - Bidirectional - Span Corruption - [Section 3.1.2] Document Representation(What to index) - Direct Indexing - Take the first L tokens of a document - Set Indexing - Documents內的字取集合(set) - Inverted Index - 隨機選一段contiguous chunk of k tokens，比Direct Indexing可以看到不只前k個 - [Section 3.2] Representing Docids for Retrieval - Unstructured Atomic Identifiers - 隨機給個unique integer identifier - Naively Structured String Identifiers - 可能是基於文件的某些元資料（如標題、作者、發布日期等）來建構的，結果可能是"2020_JohnDoe_DeepLearninginNaturalLanguageProcessing" - Semantically Structured Identifiers - 呈現文檔之間的semantics關係且結構化地排列 - ![image](https://hackmd.io/_uploads/HJrLhOwW0.png) - **Retrieval**：使用autoregressive生成來完成，基於輸入查詢來解碼docids。 - Train - Method1: 先train Indexing(memorization)，再train Retrieval - Method2(本文選的方法): multi-task setup. 像T5-style co-training那樣用 task prompts 來指導模型理解和處理不同的任務 ## Experiments - 實驗表明，適當的設計選擇使得DSI在不同的文檔集上顯著優於如雙編碼器模型(Dual Encoder Models)這樣的strong baseline。 - Datasets: Utilized the **Natural Questions** (NQ) dataset to test the efficacy of DSI in a challenging retrieval task. - NQ consists of 307K query-document training pairs and 8K validation pairs, where the queries are natural language questions and the documents are Wikipedia articles. - Metrics: Evaluated the performance using **Hits@N** metrics, demonstrating significant improvements over baseline models(BM25) and showcasing strong generalization capabilities in a zero-shot setup. - Tables - ![image](https://hackmd.io/_uploads/HkoFs_D-R.png) - ![image](https://hackmd.io/_uploads/ByfjsuP-A.png) - For zero-shot retrieval, the model is only trained on the indexing task and not the retrieval task, so the model sees no labeled query -> docid data points ## Conclusion - DSI提供了一種端到端的搜索系統學習新範例，為下一代搜索技術鋪平了道路。 - 本文提出了多種文檔和docids的表示方法，探索了不同的模型架構和訓練策略，並在Natural Questions數據集上進行了實驗，證明了DSI對常見baseline的優越性。 ## Pros - **Simplification of the Retrieval Process**: DSI significantly simplifies the information retrieval process by encoding all necessary information directly into the model's parameters, eliminating the need for traditional search indexes. - **End-to-End Learning**: Offers an end-to-end trainable framework, where both the indexing and retrieval processes are integrated into the model, facilitating easier optimization and potentially better performance. - **Strong Generalization**: Demonstrates strong generalization capabilities, particularly in zero-shot settings where the model outperforms traditional IR baselines like BM25. - **Scalability with Model Size**: Benefits from the scalability of Transformer models, where performance can significantly improve with larger model sizes. - **Flexibility**: Can potentially handle a wide range of queries and document types due to the flexible nature of sequence-to-sequence models. ## Cons - **Resource Intensive**: Requires substantial computational resources for training and inference, making it less accessible for smaller organizations or individuals. - **Complexity in Index Updating**: Updating the index for new or removed documents involves retraining or fine-tuning the model, which can be computationally expensive and less straightforward compared to traditional indexing methods. - **Limited to Moderate-Sized Corpora**: While promising, the scalability of DSI to very large corpora remains an open question, with potential challenges in efficiency and performance. - **Dependence on Pretraining**: The effectiveness of DSI heavily relies on the pretraining of the underlying Transformer model, which itself is a resource-intensive process. - **Integration Challenges**: Integrating DSI into existing information retrieval systems may pose challenges, requiring significant modifications to leverage the model's capabilities. > STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.