Information Retrieval

# Information Retrieval ## Lecture 1 Introduction Only > Hello, I'm a frogeater > blablabla > blablabla > ## Lecture 2 ### Precision $\text{Precision} = \frac{\text{returned relevant documents}}{\text{returned documents}}$ We call the relevant documents: **true positives** $\text{Recall} = \frac{\text{returned relevant documents}}{\text{relevant documents}}$ Getting high precision is _easy_ at the cost of low recall and vice versa. This is a main compromise in data retrieval. ### Invertet Index Storing a matrice (we call this an _adjecency matrix_) with all the documents and terms leads to the _99% problem_, where the matrice is sparce. We solve this by only listing the documents containing each term. We call this an inverted index. #### Intersection Algorithm - if match - add to intersection set - move all pointers - else - move lower pointer #### Union Algorithm - if match - add to union set - move all pointers - else - add lower pointer to union set - move lower pointer ## Lecture 2 ### Index Construction 1. collect documents 2. tokenize - more difficult in some languages (e.g. if the meaning of the words change depending on the context) 4. lingusitc preprocessing 5. build the index (posting list) ### Equivalenence Classes Group equivalent terms, we however also group the same terms with different meanings (e.g. windowns, Windows) ### Index Expansion Index expansion takes more space but queries are faster. (Preprocess) ### Normalization U.S.A -> USA ### Expansion Not fully symatric... ![](https://i.imgur.com/R2llWW8.png) ## Stemming differentli -> different ### Skip Pointers To traverse long posing lists, we can build skip pointer, to jump a few entries. A typical length would be _square root of the length of the current posinting list_. ### Bi-word indices Always store the two successive terms. This, however, also adds the number of false positives. We solve this by using bigger touples ### Definitions - Type: An equialence class of terms. Types are associated with posting list in the standard inverted index. (e.g. computer, words like _computing_ would thend be converted into that type) ## Lecture 3 ### Search Structures #### Hash Table - bad on range queries - no perfect (collisions) - large space requirements (to avoid collisions ) #### Binary Tree 1. Terms can be stored on leaves and nodes. ![](https://i.imgur.com/Ohir8zh.png) 2. Only terms on the levaes ![](https://i.imgur.com/Z31dWgm.png) 2x. Use intervals ![](https://i.imgur.com/NWTBUQe.png) ![](https://i.imgur.com/rJJwhrq.png) The problem with the binary tree: - has to be rebalanced - can get to large for memory - a lot of seeks (at every node) Solution -> B+-tree #### B+-Tree - each noted has between $d$ and $2d + 1$ children(!!!) - all leaves at the same level We can also build a revered index for _wildcard queries_. A reversed index works be inversing all the terms. ![](https://i.imgur.com/bzAcZGr.png) #### Permuterm index ![](https://i.imgur.com/4peqBpz.png) #### k-gram Ideally 2 to 4-grams ### Input Correction #### Levenshtein Distance ![](https://i.imgur.com/033V9EC.png) #### K-grams K-grams attribute for misspelling to some extent. ![](https://i.imgur.com/dyP4fDT.png) this can lead to false possitives however: ### Jaccard Distance When resolving spelling mistakes using k-gram index, we get false positives: ![](https://i.imgur.com/58D8Had.png) Using the Jaccard Distance we can solve this problem. The Jaccard Distance beeing the intersection, deviced by the union (for image it would be 3/12 = 0.25) ### Summery 1. get k-grams from the query 2. llok them up in the k-gram index then - compute jaccard distances - keep terms with small jaccard distances (i.e. high jaccard coefficient) or - compute edit distances - keep terms with small edit distances ### Soundex Representation 1. Keep the first letter 2. Map all other letters like this ![](https://i.imgur.com/ZzVobJO.png) 3. Remove all repetitions 4. Remove all zeros 5. Trim and pad to 4 characters ![](https://i.imgur.com/KlBqzha.png) ## Lecture 5 ### Hardware Bottleneck is usually the harddrive. We _batch porcessing_ to reduce the amount of seeks: #### BSBI (block sort) We shard the collection into smalles junks, query per junk and then merge the results. $O(T log T) = O(T log T) + O(T)$ because of the sorting, but it feels slower, because the second step takes longer in hardware. #### SPIMI (single pass) Improves on the complexity of BSBI, with $O(T)$. This is done by directly building the inverted index, so terms nolonger have to be looked up. #### MapReduce What actual search engines are using. ![](https://i.imgur.com/NTLq6cQ.png) ### Auxiliary Index If the index has to update frequently, we can build a second index, which new changes and periodically (if the auxiliary index gets to big) integarate them into the main index. For deleted documents we can use an invalidation bit vector. Merging the auxiliary index gets more complex each time. The first rount it takes $O(n)$, then it doubles until we reach $O(T)$. Resulting in the overall complexity of $O(n + 2n + ... + \frac{T}{n} * n) = O(n\frac{\frac{T}{n}(\frac{T}{n} + 1)}{2}) = O(\frac{T^2}{n})$. So can we do better? #### Lograithmic Merging Instead of allways merging the auxiliary index, with the main index, we build up two indexes on the disk and only merge them once they have the same size (i.e. the size of the main index doubles on every merge). Complexity: $O(T \log (\frac{T}{n}))$ ## Lecture 6 ### Blocking A way of index compression. ![](https://i.imgur.com/ThZfAg7.png) ![](https://i.imgur.com/NfEEApJ.png) If we sort the term, we can further optimize the index size using _front coding_. ![](https://i.imgur.com/fKVuXKu.png) ## Intro - Create indexes, to avoid scan through text - If stored in a matrix we have a minimum 99.8% 0's - Only store datapoints -> Inverted Index (allowing fast full text search) ### Biword Index Add each word-pair (that is two words in consecutive order in the text) to the index. Input: Hello Index World Index: Hello Index, Index World Query: 'Hello Index' AND 'Index World' Very simple, but queries can result false positives. ### Positional Index ![](https://i.imgur.com/8xhn2Eb.png) - we store each word and its occurance count - for each word we store an array containing the documentIDs nd the occurancy count in that document, followed by an array containing the positions To execute a multi word query we then - fetch all the words in our query - intersect the document lists (i.e. only keep documents contianing all the keywords) - check if the keywords are in order \*the occurencies might be omitted ### k-gram indexes - good for spell checking - use $ as seperator - dictionary: token to (its) k-grams - postingslist k-gram to tokens ## Index Compression We compress indexed to - save disc space: 1:4 compression ratios are easy to achieve - faster transfer from disk to memory: today copy compressed data and decompress it, is usually faster than copy uncompressed data. Index Compression significantly improves the efficiency of SPIMI and Logarithmic Merging. We can reduce the index size by - case-folding - stemming (17%) - remove 30 most common words (30%) [stop words: it, is, the...] however compressen can help to to nolonger have to remove stopwords. We can not compress the postings, only the dictionary. Therefore index compression only helps with SPIMI and Logarithmic Mergic, but not with BSBi or MapReduce. We would like to compress the posting list. Instead of storing all positions of a term, we only store the distance. Frequent term, will therefore use less space to be stored. The trick is to use, parametries encoding: ![](https://i.imgur.com/iTH4jOd.png) This requires blocks of size $n$ for every block, we would like to improve on that. ### Unary Code Encode $n$ by $n$ repeating $n$'s followed by $0$. Very easy to encode and decode, but space inefficient. ### ɣ-codes ![](https://i.imgur.com/V1tB6B8.png) 1. Convert the number to binary (10011) 2. Choke off the leading 1 (0011) 3. We then encode the length using the unary code (11110) 4. We then merge the two (111100011) ### VB Codes Variable byte encoding like _ɣ-encoding_ is better than fixed-length encoding, because the distribution of gap sizes is skewed. ## Zipf's law ![](https://i.imgur.com/RIfyB60.png) ![](https://i.imgur.com/yx9tGqh.jpg) If we sort according to occurancy, there are less document id's further down... distributed accoring to Zipf's law $\text{Frequency} = \frac{k}{\text{Rank}}$. We can normalzie this: ![](https://i.imgur.com/ZwIGv7g.png) We can create Blocks: ![](https://i.imgur.com/UwjUgiw.png) ![](https://i.imgur.com/muO0SIL.png) ## Lecture 7 ### Ranked Retrieval Term frequency is a bad idea, as terms might be concentrated in just a few documents. ![](https://i.imgur.com/ZsWxpq0.png) ![](https://i.imgur.com/js1y1Cc.png) We can weight _title_, _abstract_ and _body_ seperatly. ## Lecture 8 #### Term frequency We count the occurrence of the term in each document of the collection. This can lead to returning suboptimal result, as we don't consider how ofter the term occurs in each document. ![](https://i.imgur.com/O9Bzkrm.png) In this example we query for `foo bar foobar`, using the term frequency we would return document A, that never contains `bar`. #### Collection frequency We only count how often the term occurs in the collection. ![](https://i.imgur.com/WMcHz0N.png) Using this approach we might think `foo` is rare, although `bar` is the rarest term. #### Document frequency We count the number of documents containing the term. ![](https://i.imgur.com/3diL3YA.png) #### Inverse document frequency $\log{\frac{total number of documents}{document frequency}}$ #### tf-idf (Tearm frequency - Inverse Document Frequency) ![](https://i.imgur.com/43LLjad.png) ![](https://i.imgur.com/P5aRdJF.png) ### Vector space model We regard the documents as bags of words. Using tf-ids we can build a vector representing multiple data points. Using the bag of words model and vectors to represent documents, we can use the _inner product_ of the document vectors to measure similarity of documents. ![](https://i.imgur.com/cEpap5I.png) We can create similar vectors for queries and then compare them in the vector space. ![](https://i.imgur.com/waxLoq5.png) ### Metadata The bast way to perform queries on metadata is to hash the indices or fill them into B-tree. We would use the B-tree if we're interested in the range queries. ### Machine Learning We can use ML to optimize the weights to score documents in zone quires. ### Standard Inverted Index ![](https://i.imgur.com/qjUZHuS.png) ![](https://i.imgur.com/siOOr9N.png) #### Evaluating a Query Theorie: We create a vector for each document, we create a vector for the query. If the cosign between a query vector and a document vector is close to one (the angle is very small) the are very similar. ![](https://i.imgur.com/WNdq3DM.png) Our challenges are: 1. loatsof documents 2. many dimensions (one per term) The queries are, however, generally very sparse, thus we only have to consider a very very small subset of dimensions. It would be practical to have the tf-idf stored in the inverted index, for each document. The tf-idf is a float, which is problematic in a index, as they are not as easy to compress (gamma-codes etc.). The trick is to store the idf per term (one float per term) and only the tf (integer) per document. We can also use the inner product between vectors... ![](https://i.imgur.com/klRMXgA.png) ![](https://i.imgur.com/xPDAkLy.png) ...to calculate the tf-ids. Putting it all together: VSM ![](https://i.imgur.com/V5zC9LM.jpg) We then throw the results into the baskets, one per document. We can do this process term-at-a-time or document-at-a-time. We then can sort the buckets, using a priority queue $O(2n)$. Improvements: - dropping the query length / it is constant anyway - dividing by the document length after score accumulation has completed. that way we only need to divide once for each document - Replacing all term frequencies with one on the query side. (words are rarely repeated on the query side) ### Term Frequency Depending on the use case different algorithms are used. #### Natural Term Frequency scaling ![](https://i.imgur.com/jIjYmlK.png) // TODO: page 118 from the book #### Sublinear Term Frequency Scaling ![](https://i.imgur.com/6UOZMMd.png) Reasoning, the more a term occurs the less important it is. #### Augmented Term Frequency Scaling ![](https://i.imgur.com/7SciYXN.png) #### Boolean term frequency scaling ![](https://i.imgur.com/KPdVwCT.png) #### Log-average term frequency scaling ![](https://i.imgur.com/qUkMkdQ.png) ### Document Frequency #### Natural document frequency scaling ![](https://i.imgur.com/kQ69471.png) #### Inverted document frequency scaling ![](https://i.imgur.com/kEfjJE8.png) #### Prob idf document frequency scaling ![](https://i.imgur.com/g2aKvDy.png) ### Normalization - not normalize at all - cosine normalization (divide by the Euclidean norm) - byte-size normalization (1/char length^a) - pivoted normalization ### Optimization Posting lists can get extremely large. The Idea is to limit search on a limited set of documents. ![](https://i.imgur.com/0qoCdsv.png) - reduce terms (e.g. stop words, words with little information) - remove documents which only contain a few terms We can still use a larger set of documents, if we can't find enough matches by optimization. ### Clustering ![](https://i.imgur.com/TcvdaVY.png) We can randomly pick documents ($\sqrt{n}$) - we call them leaders - and for all the other documents we compute the nearest leader. When we run a query, we simply find the nearest leader and then select that cluster. ### Tiered Indices We sort and shard, to build (per term) tier lists. Each color being one tier. ![](https://i.imgur.com/ruyqZHA.png) ### Query Interface - boolean queries -> std. inverted index - phrase queries -> bi-word index or positional index - set of words -> vector space model - zone queries - wild card quieres ![](https://i.imgur.com/Ocq9noo.png) ### Overall Architecture ![](https://i.imgur.com/wbwRT3l.jpg) ## Lecture 9 ### Probability theory We have elementary events $\gamma \in \omega$, with the sum of the probability of all events being one $\sum_{\gamma \in \omega} p(\gamma) = 1$. Odds are the probability of an event, divided by the probability of the complements. #### Conditional Probability ![](https://i.imgur.com/TVp8qS3.png) We use Bayes' rule for information retrieval. ![](https://i.imgur.com/OUIYZuO.png) posterior and prior prob. ![](https://i.imgur.com/lIbPrvk.png) // TODO: Lots of formulas ## Lecture 10 - Term: Equivalence class of terms. Types are associated which posting lists in the standard inverted index. - Dewey Decimal Classification: A number-based system used by many libraries to sort and retrieve their books. - False positives can be decreased by using phrase indices with bigger of words. - We do not use fixed-length encoding on gaps, because the distribution of gap sizes is skewed. - Inexact top-k retrieval makes sense because using cosine computations on vector representations as a similarity measure is already an approximation of reality, so going inexact does not hurt the results so much. - K-gram indices are indexing on sequences of k characters and are used for tolerant retrieval. - K-word indices are indexing sequences of k terms and are used for phrase search. - Precision and recall on ranked retrieval can be measured by going down the results and draw a curve plotting precision against an increasing recal. Capacity: e.g. X words in a book Throughput: e.g. X words per minute Latency: e.g. X seconds to lookup in a book ![](https://i.imgur.com/8KkKkhA.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.