# Transformer Memory as a Differentiable Search Index (DSI)
###### tags: `筆記`, `NLP`, `NeurIPS 2022`
> https://arxiv.org/pdf/2202.06991.pdf
> [YouTube Video: Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)](https://www.youtube.com/watch?v=27rNqGrTdSI)
> Google Research
## Abstract + Introduction
- 傳統 IR 是 retrieve-then-rank,本文展示了如何使用單一的Transformer模型來完成資訊檢索,其中所有關於語料庫的信息都被編碼在模型的參數中。
- 本篇 DSI (Generative retrieval) 是用 T5 models 為基底
- Dense retrieval 也有 T5(or ST5) models 為基底的
- 引入了Differentiable Search Index,這是一種新的典範,它利用序列到序列(seq2seq)的訓練架構,直接將字符串查詢映射到相關的文檔標識符(docids),從而顯著簡化了檢索過程。
- 
- Source: https://speakerdeck.com/wingnus/yi-tay-google-research
## Differentiable Search Index (DSI)
- DSI利用一個大型預訓練的Transformer模型,將語料庫的所有信息編碼於Transformer語言模型的參數中。
- Information retrieval comparison
- 
- BM25 or TFIDF
- 如果詞彙表是 ["apple", "banana", "cherry"],而文檔d_j只包含"apple"和"cherry",則d_j的向量表示可能是 [1.5, 0, 0.8]
- Dual Encoder (DE)
- 會把文檔作為encoder的輸入,可能得到 [0.85, -0.23, ..., 0.91],維度依encoder選型決定
- Differentiable Search Index (DSI)
- 本文提出三種作法,詳見"[Section 3.1.2] Document Representation"
- 若為 atomic 策略,可能是 "233"
## Methodology

- Step0: Indexing Method (How to index)
- Inputs2Target
- seq2seq task of doc_tokens -> docid
- Targets2Inputs
- Bidirectional
- Span Corruption
- [Section 3.1.2] Document Representation(What to index)
- Direct Indexing
- Take the first L tokens of a document
- Set Indexing
- Documents內的字取集合(set)
- Inverted Index
- 隨機選一段contiguous chunk of k tokens,比Direct Indexing可以看到不只前k個
- [Section 3.2] Representing Docids for Retrieval
- Unstructured Atomic Identifiers
- 隨機給個unique integer identifier
- Naively Structured String Identifiers
- 可能是基於文件的某些元資料(如標題、作者、發布日期等)來建構的,結果可能是"2020_JohnDoe_DeepLearninginNaturalLanguageProcessing"
- Semantically Structured Identifiers
- 呈現文檔之間的semantics關係 且 結構化地排列
- 
- **Retrieval**:使用autoregressive生成來完成,基於輸入查詢來解碼docids。
- Train
- Method1: 先train Indexing(memorization),再train Retrieval
- Method2(本文選的方法): multi-task setup. 像T5-style co-training那樣用 task prompts 來指導模型理解和處理不同的任務
## Experiments
- 實驗表明,適當的設計選擇使得DSI在不同的文檔集上顯著優於如雙編碼器模型(Dual Encoder Models)這樣的strong baseline。
- Datasets: Utilized the **Natural Questions** (NQ) dataset to test the efficacy of DSI in a challenging retrieval task.
- NQ consists of 307K query-document training pairs and 8K validation pairs,
where the queries are natural language questions and the documents are Wikipedia articles.
- Metrics: Evaluated the performance using **Hits@N** metrics, demonstrating significant improvements over baseline models(BM25) and showcasing strong generalization capabilities in a zero-shot setup.
- Tables
- 
- 
- For zero-shot retrieval, the model is only trained on the indexing task and not the retrieval task, so the model sees no labeled query -> docid data points
## Conclusion
- DSI提供了一種端到端的搜索系統學習新範例,為下一代搜索技術鋪平了道路。
- 本文提出了多種文檔和docids的表示方法,探索了不同的模型架構和訓練策略,並在Natural Questions數據集上進行了實驗,證明了DSI對常見baseline的優越性。
## Pros
- **Simplification of the Retrieval Process**: DSI significantly simplifies the information retrieval process by encoding all necessary information directly into the model's parameters, eliminating the need for traditional search indexes.
- **End-to-End Learning**: Offers an end-to-end trainable framework, where both the indexing and retrieval processes are integrated into the model, facilitating easier optimization and potentially better performance.
- **Strong Generalization**: Demonstrates strong generalization capabilities, particularly in zero-shot settings where the model outperforms traditional IR baselines like BM25.
- **Scalability with Model Size**: Benefits from the scalability of Transformer models, where performance can significantly improve with larger model sizes.
- **Flexibility**: Can potentially handle a wide range of queries and document types due to the flexible nature of sequence-to-sequence models.
## Cons
- **Resource Intensive**: Requires substantial computational resources for training and inference, making it less accessible for smaller organizations or individuals.
- **Complexity in Index Updating**: Updating the index for new or removed documents involves retraining or fine-tuning the model, which can be computationally expensive and less straightforward compared to traditional indexing methods.
- **Limited to Moderate-Sized Corpora**: While promising, the scalability of DSI to very large corpora remains an open question, with potential challenges in efficiency and performance.
- **Dependence on Pretraining**: The effectiveness of DSI heavily relies on the pretraining of the underlying Transformer model, which itself is a resource-intensive process.
- **Integration Challenges**: Integrating DSI into existing information retrieval systems may pose challenges, requiring significant modifications to leverage the model's capabilities.
> STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.