# Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation (DSI-QG)
###### tags: `筆記`, `study notes`, `NLP`, `SIGIR 2023`, `Document representation`
## Abstract
- Motivation: The Differentiable Search Index (DSI) is a novel approach that integrates the indexing and retrieval processes into a single transformer model, aiming to simplify the architecture of Information Retrieval (IR) systems. However, a significant data distribution mismatch exists between the indexing and retrieval phases, especially when considering document length and language differences. This mismatch undermines the effectiveness of DSI models in both mono-lingual and cross-lingual retrieval settings.
- This paper introduces DSI-QG, an enhanced indexing framework for DSI that **mitigates the data distribution mismatch by using generated queries to represent documents during indexing**. This approach aligns the data observed by the DSI model during both indexing and retrieval phases, significantly improving its retrieval effectiveness.
## Introduction
- Information retrieval systems traditionally rely on a decoupled **index-then-retrieve pipeline**, leading to complexities in system architecture. The Differentiable Search Index (DSI) proposes an integrated approach but faces challenges related to data distribution mismatch due to the variance in document and query lengths and language discrepancies in cross-lingual retrieval.
- This research identifies and addresses these challenges with DSI-QG, which leverages query generation and ranking to create a more consistent and effective representation of documents for the indexing phase, thereby improving the performance of DSI in both mono-lingual and cross-lingual retrieval tasks.
## Methodology
- **DSI-QG Framework**: Introduces a query generation component to produce relevant queries for each document. These queries, once generated, are ranked and filtered by a cross-encoder ranker to ensure only high-quality queries are used to represent the document during the indexing phase.
- 
- This method aims to align the data type and distribution between the indexing and retrieval phases, addressing the mismatch issue and enhancing the retrieval performance of DSI models.
## Experiments
- Datasets: Empirical evaluations were conducted on mono-lingual (**NQ** 320k) and cross-lingual (**XOR QA** 100k) document retrieval datasets to assess the effectiveness of DSI-QG compared to the original DSI model and other baselines.
- Metrics: The performance was measured using Hits@1, Hits@10, and nDCG@10 metrics, showcasing DSI-QG's significant improvements over traditional DSI and other retrieval methods.
## Takeaways
- DSI-QG顯著提高了在單語言和跨語言文檔檢索任務中的表現,通過生成查詢來代表文檔在索引階段,解決了數據分布不匹配的問題。
- 這項研究不僅提高了DSI模型的檢索效率,也為信息檢索系統的架構簡化提供了新的方向。
- 未來的工作可以探索進一步優化查詢生成和排名機制,以及擴展到更多語言和更大規模的數據集上。
> STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.