# DSI++: Updating Transformer Memory with New Documents ###### tags: `筆記`, `study notes`, `NLP` > https://arxiv.org/abs/2212.09744 ## Abstract - Motivation: Deploying Differentiable Search Indices (DSIs) models in dynamic scenarios, where the corpus changes over time, is computationally expensive due to the need for re-indexing via re-training. This work introduces DSI++, **addressing the challenge of incrementally indexing new documents to a DSI without forgetting previously indexed documents.** - Approach: Two main strategies are explored to combat forgetting: optimizing training dynamics for flatter minima and introducing a generative memory to sample pseudo-queries for documents, aiding in continual indexing. ## Introduction - Background: DSIs use Transformer memory to encode documents and directly answer queries. However, updating the index in dynamic corpora is computationally expensive. - Goal: Develop methods for effective incremental indexing using Transformer memory, thereby enabling the model to answer queries related to both new and previously indexed documents without re-training from scratch. - Continual indexing of new documents leads to catastrophic forgetting of the previously memo- rized documents - ![image](https://hackmd.io/_uploads/Hy9jyU2CT.png) ## Methodology - Sharpness-Aware Minimization (SAM): Aims to alleviate forgetting by optimizing for flatter loss basins, leading to more stable memorization of documents. - Generative Memory: Introduces a way to supplement training with pseudo-queries for both old and new documents during continual indexing to prevent forgetting and aid in retrieval tasks. ## Experiments - Datasets: Novel continual indexing benchmarks based on **Natural Questions** (NQ) and **MS MARCO** datasets were used to simulate the continual addition of documents. - Metrics: The experiments were evaluated using indexing accuracy and **Hits@10** metrics, demonstrating significant mitigation of forgetting and improvements in the retrieval task. ## Limitation - **Unpredictable Model Behavior for Conflicting Documents**: The paper points out that when a new document that **contradicts** or modifies information in a previously indexed document is added, the behavior of the model becomes unpredictable. This is an area that requires further investigation to understand how to handle such cases effectively. - **Significant Forgetting with Larger Datasets**: While the paper demonstrates the effectiveness of their proposed method (generative memory) in reducing forgetting and improving forward transfer to new documents, it acknowledges that with larger datasets, like the full MS MARCO dataset, the method still exhibits significant forgetting. This indicates a need for further improvements in the model's performance, especially when dealing with larger scales of data. ## Takeaways - SAM and generative memory significantly reduce forgetting in the DSI model during continual indexing, showing improved retrieval task performance. - 這項研究證明了在不需從頭重新訓練模型的情況下,透過針對平坦最小化的優化和生成記憶體的引入,可以有效地解決文件連續索引過程中的遺忘問題,對於動態語料庫的索引更新具有重要意義。 > STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.