ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

--- ###### tags: `information retrieval` --- # ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Omar Khattab, 2020, SIGIR # Contribution 1. We propose **late interaction**(§3.1) as a paradigm for efficient and effective neural ranking. 2. We present ColBERT (§3.2 & 3.3), a highly-effective model that employs novel BERT-based query and document en-coders within the late interaction paradigm. 3. We show how to leverage ColBERT both for re-ranking ontop of a term-based retrieval model (§3.5) and for searchinga full collection using vector similarity indexes (§3.6 4. We evaluate ColBERT on MS MARCO and TREC CAR, tworecent passage search collections. # 想改善 + bert 模型幾乎是使用了 100-1000 倍的運算資源相較於非 LM ![](https://i.imgur.com/5oY8nF6.png) + 希望提出能比 bert + pointwise 模型快，又具備一定準度的架構 # 方法 + (C\) 效果很好但非常花時間。作者把 (C\) 結合 (B\) 的概念 (word-to-word) 結合變成 ColBERT，能先產生好 doc, query 每個 token embedding 後再去計算相似度。 ![](https://i.imgur.com/itgBgkR.png) + 提出了 late intereaction 概念，建立 Query 和 Document 各自的 Embedding，在訓練模型時相關的 <q, doc> pair 之間的關聯相近。 + Q 和 R **共享 encoder**，差別在於輸入時格式不同 + 關聯分數計算: 對於每個 $v\in Embedding(Query)$ ，去和每個 $w \in Embedding(Document)$ 計算 cosine similarity 後相加起來。 + **Query Eencoder** + [CLS] [Q] query [SEP] [MASK] [MASK] + 固定長度，多則刪除，少則補上 [MASK] + **Document Encoder** + [CLS] [D] document [SEP] + 過濾掉(filter)標點符號的 embedding ![](https://i.imgur.com/BXw5NkD.png) ![](https://i.imgur.com/E9qSWCh.png) 最終的距離公式如下 ![](https://i.imgur.com/CHjorFq.png) + 在 model 的最後輸出有一層的 linear 來限制維度來減少運算時間 # Result 分數接近 bert base ，但效率提升超多 ![](https://i.imgur.com/mWTh4eN.png) ![](https://i.imgur.com/Wn7xdR4.png)