Rank Retrieval

# Rank Retrieval ## Boolean Model - doxument : a set of keywords - Queries : connect keyword AND,OR,NOT - Output : Document is **relevant or not** ### boolean model's problem - 太嚴格，AND means all and OR means any - 難以處理複雜的requets - hard to rank, control the document retrieved ## TFIDF $$w_{t,d} = (1+logtf_{t,d})\times log_{10}\frac{N}{df_t}$$ - $w_{t,d}$ : t 在 d 重要度 ### Term Frequency - tf : = frequency of term t in document d ![](https://i.imgur.com/OVniweP.png) #### $log_{10}{tf}_{t,d}$ : 不要讓tf上升那麼快 ex. D1 = {(USA,1),(UK,1)} D2 = {(UK,2)} q = {(UK,1),(USA,1)} q 應該要和D1比較像 - no log sim(q,D1) = 2 = 1*1 + 1*1 sim(q,D2) = 2 = 2*1 + 0*1 - log sim(q,D1) = 2 = 1*1 + 1*1 sim(q,D2) = **1.3** = **log2\*1** + 0*1 ### Document frequency - df (N) is the number of documents that contain the term ## BM25 ### TF problem - 偏好長文 ### TF in BM25 $$0 < \frac{tf_{(t,d)}}{tf_{(t,d)}+k\times(1-b +b\times \frac{dl}{adl})} < 1$$ - 正規化 : 範圍變成 (0,1) - K : 控制$tf_{(t,d)}$的重要性 - K $\uparrow$ , $tf_{(t,d)}\downarrow$ - 長文用大K, 短文用小K - $\frac{tf_{(t,d)}}{tf_{(t,d)}+k\times \frac{dl}{adl}}$ - b : 控制是否懲罰長文 - 某些文章(ex. 法律文件)是越長越重要，短文不重要 - b = 0 : 不懲罰 ## Vector Space Retreival - document = (wieght1, weight2,...) - wieght1: 字T在document的重要度 ### 文章的相似度 - Euclidean Distance - 算d1, d2的在空間的實際距離 - 成分差不多但長短的距離不同->會被當作不像 ![](https://i.imgur.com/SNpDAJj.png) - Inner Product - 作內積 $sim(d_j,q) = d_j \cdot q$ - 會偏好有一堆垃圾字的文章 - **Cosine Similarity** - 改算空間中的夾角 - 垃圾字 : 會有字向量差很遠 $$sim(d_{j},q) = d_j\cdot q = \sum_{i=1}^tw_{ij}w_{wq}$$ ## Rank Retreival ### Speeding up vector space ranking 為每一個字詞記錄一個數值，這個數值稱為maximum contribution。一個字詞的maximum contribution記下了這個字詞在所有文件中可能獲得的「最高」分數 skip pointer - effecient ranking - 有效率去計算cosine - 選出前K大的 #### TAAT = "Term At A Time" - 每次計算query term的posting list 中的文檔的相似度 1. 查詢和query term 對應的 posting 2. 便利每個query term的posting list 3. Store partial query-document scores in accumulators(儲存器) 4. Select top k results to return - Evaluate documents one query term at a time - 一般來說從最罕見的詞開始 - 缺點 : - 占用記憶體空間 - Less disk access without reading all post lists - 每次都需要遍歷posting list，但是把所有有可能文章的都存起來了 ##### impact-ordered postigs for TAAT - sort posting list by $w_{t,d}$ - 從高idf看到低idf - Now: not all postings in a common order! - Early termination: - $w_{t,d}$drops below some running threshold #### DAAT = Document At A Time - DAAT算法會遍歷每個文檔，同時對每個文檔計算它與查詢的相似度得分，計算過程中考慮了查詢中的所有查詢詞，直到處理完所有文檔 - 一次計算和一個document的相似度 - Small memory footprint 只要存前K各(good) - Must read through all postings (bad) - 美次計算文檔的值是否大於前K名 ##### WAND (Weighted AND) scoring for DAAT Processing - keep kth hightest score - 去掉 doc that below the threshold - 照dovID排 - Upper bound: 好像是要看右邊的是不是比這個大，如果比較大現在這個就可以先drop調 ![](https://i.imgur.com/Z7zVLOJ.png) - 如果UB小於第二大的UB，就跳到第二大UB的位置 ## non-safe ranking safe ranking : - 一定要回傳使用者要求的前K筆的資料 non-safe ranking : - 可能排名會錯，Ex.使用者要求的前K筆回傳的包含K+20名的答案 - 允許一定的失誤去增加效率 - idea1 1 - 只算高的idf的query term - ex. 放棄in, the - ideal 2 - 至少要有n個query term - doc id 至少要出現在m個list - 缺點: 無法作前處理 ### Boom Filer