Unsupervised Dense Information Retrieval with Contrastive Learning

# Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever) ###### tags: `筆記`, `study notes`, `NLP` ## Abstract - Motivation: Information retrieval systems traditionally rely on lexical similarities, suffering from a lexical gap that hinders generalization. Recently, **dense retrievers** using neural networks have shown superior performance **but require large annotated datasets**, limiting their applicability in unsupervised settings or languages other than English. This work explores unsupervised dense retrievers using contrastive learning, aiming to match or surpass the performance of BM25 in unsupervised scenarios across various languages and settings. - Approach: The study introduces an unsupervised method to train dense retrievers through contrastive learning, leveraging techniques like random cropping and MoCo to generate and utilize positive and negative pairs from unaligned text documents effectively. ## Introduction - Information retrieval is crucial for various NLP tasks but faces challenges due to the lexical gap and the requirement of large annotated datasets for training neural models. The emergence of dense retrievers offers a promising direction, yet their dependency on supervised learning limits their broader application. - This paper proposes an unsupervised training framework for dense retrievers, leveraging contrastive learning to enhance retrieval performance without reliance on annotated data. It demonstrates competitive results against BM25 and effective pre-training for subsequent fine-tuning. ## Methodology - Contriever - Method Name: Unsupervised Contrastive Learning for Dense Information Retrieval. - ![image](https://hackmd.io/_uploads/HyeVEZy1A.png) - The methodology employs contrastive learning with innovations like random cropping for generating positive pairs and MoCo for handling negative pairs, trained on a mixture of Wikipedia and CCNet data. It explores unsupervised learning as a scalable alternative to supervised training, emphasizing the generation of effective positive and negative pairs from large text corpora. ## Experiments - Datasets: Utilizes a diverse set of datasets, including Wikipedia, CCNet, BEIR benchmark, and multilingual datasets like Mr. TyDi for evaluation across different retrieval tasks and languages. - MS MARCO - **NaturalQuestions** (NQ) and **TriviaQA** for evaluating open-domain question answering capabilities. - **Wikipedia** and **CCNet** for training, leveraging a mix of these sources to capture a wide range of textual data. - Metrics: Evaluates models based on Recall@100 and nDCG@10 across various datasets, comparing against BM25 and other state-of-the-art retrieval methods to demonstrate the effectiveness of the proposed unsupervised training approach. ## Takeaways - 這項工作展示了透過對比學習訓練無監督密集信息檢索模型的潛力，證明了即使在沒有註釋數據的情況下，也能達到與BM25相當或更好的性能。 - 透過對比學習，模型能夠有效利用大規模未標記的文本數據，展現了在多語言檢索任務中的應用潛力，特別是對於資源較少的語言。 - 研究還發現，對比學習作為預訓練步驟，能夠為後續的微調提供強大的基礎，進一步提高模型在特定檢索任務上的性能。 > STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.