# Boolean Retreival
###### tags: `school meterial`, `IR`,`public`
## **Term-Ducunent Incudence Matric** and **Iverted Indec**
- boolean queries: Exact Match
- AND OR NOT : join terms
- ducment $\rightarrow$ set word
>- term-document incidence(出現頻率) matices
> - 1 if document contains word, 0 otherwise
>- incidence vector
> - each term : 0/1 vector
>
---
- bigger colections
6 bytes(1 word)
<small>$一百萬個document, each with about 1000 word\\
6 bytes(1 word)*1000 * 10000000 = 6$ GB(10^9)</small>
- vocab : 500K
- build the matric
- onl6 record 1 posotions
### Iverted index

- Incverd index constriction
$Document \rightarrow_{(Tonkenizer)} \rightarrow Toeken steam \\
\rightarrow_{(Linguistic modules
)}\rightarrow modified token\\
\rightarrow_{(Indexer)} \rightarrow Inverted index \rightarrow \rightarrow$

#### <big>**steps**</big>
- Tonkenizer
- string to word token
- Normalization
- map text and question term the same form
- make the format same
- Stemming
- different forms of root can match
- Stop word
- 去掉(或不)太常見的word, ex. the
### Indexer steps
1. token sequence
sequence of (modified token, dosument ID) pairs
2. Sort
- by tenm
- by docID
3.
4. Dictionary & Postings
- 把同個句子的term連在一起
- 切成**dictionary(term and count)** and **posting(List of docID)**
- Doc.frequency

## Query Processing with Inverted Index
### AND
- 之前方式,取出兩者的dic, 一個一個比
- O(x+y)(list length)
- **skip pointer
- 兩個查詢要移動時,一次移動多格
### Tradeoff
- more skip -> shoter skip spans(時間) -> 很多比較
- ferer skip -> long skip spans ->很少能成功的跳過要跳過的數字
- maybe $sprt{L}$
### merge
- though the 2 posting(sort by docID) -> O(x,y)
### Boolean queries: More general merges
ex.
USA AND NOT Taiwan
USA OR NOR Taiwan
- O(x+y) 使用posting
- USA AND NOT Taiwan
- 找USA list 中沒有提到Taiwan的
- O(N) (N是doc的總數)
- USA OR NOR Taiwan
- search range is bound N
## Query Optimization
- 如果有多個a AND b AND ...
- 從最短的開始
- 如果有多個(a OR b) AND (c OR d) ...
- get freqence for all term
- 估計每個OR的size
- increasing order
## phrase queries
1. Biword indexes
- 把連續的pair of terms也算進dic
- ISSEUS
- loger phases
- storage
- more complicated form
2. Positional indexs
a. 反向索引+term在位子資訊(document第幾個字出現)
b. merge(同doc)
3. Proximity queries