Boolean Retreival

# Boolean Retreival ###### tags: `school meterial`, `IR`,`public` ## **Term-Ducunent Incudence Matric** and **Iverted Indec** - boolean queries: Exact Match - AND OR NOT : join terms - ducment $\rightarrow$ set word >- term-document incidence(出現頻率) matices > - 1 if document contains word, 0 otherwise >- incidence vector > - each term : 0/1 vector >![](https://i.imgur.com/lZe5IAj.jpg) --- - bigger colections 6 bytes(1 word) <small>$一百萬個document, each with about 1000 word\\ 6 bytes(1 word)*1000 * 10000000 = 6$ GB(10^9)</small> - vocab : 500K - build the matric - onl6 record 1 posotions ### Iverted index ![](https://i.imgur.com/E9yCXIg.jpg) - Incverd index constriction $Document \rightarrow_{(Tonkenizer)} \rightarrow Toeken steam \\ \rightarrow_{(Linguistic modules )}\rightarrow modified token\\ \rightarrow_{(Indexer)} \rightarrow Inverted index \rightarrow \rightarrow$ ![](https://i.imgur.com/Lg0Bm53.jpg) #### <big>**steps**</big> - Tonkenizer - string to word token - Normalization - map text and question term the same form - make the format same - Stemming - different forms of root can match - Stop word - 去掉(或不)太常見的word, ex. the ### Indexer steps 1. token sequence sequence of (modified token, dosument ID) pairs 2. Sort - by tenm - by docID 3. 4. Dictionary & Postings - 把同個句子的term連在一起 - 切成**dictionary(term and count)** and **posting(List of docID)** - Doc.frequency ![](https://i.imgur.com/GfftXtC.png) ## Query Processing with Inverted Index ### AND - 之前方式，取出兩者的dic, 一個一個比 - O(x+y)(list length) - **skip pointer - 兩個查詢要移動時，一次移動多格 ### Tradeoff - more skip -> shoter skip spans(時間) -> 很多比較 - ferer skip -> long skip spans ->很少能成功的跳過要跳過的數字 - maybe $sprt{L}$ ### merge - though the 2 posting(sort by docID) -> O(x,y) ### Boolean queries: More general merges ex. USA AND NOT Taiwan USA OR NOR Taiwan - O(x+y) 使用posting - USA AND NOT Taiwan - 找USA list 中沒有提到Taiwan的 - O(N) (N是doc的總數) - USA OR NOR Taiwan - search range is bound N ## Query Optimization - 如果有多個a AND b AND ... - 從最短的開始 - 如果有多個(a OR b) AND (c OR d) ... - get freqence for all term - 估計每個OR的size - increasing order ## phrase queries 1. Biword indexes - 把連續的pair of terms也算進dic - ISSEUS - loger phases - storage - more complicated form 2. Positional indexs a. 反向索引+term在位子資訊(document第幾個字出現) b. merge(同doc) 3. Proximity queries