Firebolt Frog
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Information Retrieval ## Lecture 1 Introduction Only > Hello, I'm a frogeater > blablabla > blablabla > ## Lecture 2 ### Precision $\text{Precision} = \frac{\text{returned relevant documents}}{\text{returned documents}}$ We call the relevant documents: **true positives** $\text{Recall} = \frac{\text{returned relevant documents}}{\text{relevant documents}}$ Getting high precision is _easy_ at the cost of low recall and vice versa. This is a main compromise in data retrieval. ### Invertet Index Storing a matrice (we call this an _adjecency matrix_) with all the documents and terms leads to the _99% problem_, where the matrice is sparce. We solve this by only listing the documents containing each term. We call this an inverted index. #### Intersection Algorithm - if match - add to intersection set - move all pointers - else - move lower pointer #### Union Algorithm - if match - add to union set - move all pointers - else - add lower pointer to union set - move lower pointer ## Lecture 2 ### Index Construction 1. collect documents 2. tokenize - more difficult in some languages (e.g. if the meaning of the words change depending on the context) 4. lingusitc preprocessing 5. build the index (posting list) ### Equivalenence Classes Group equivalent terms, we however also group the same terms with different meanings (e.g. windowns, Windows) ### Index Expansion Index expansion takes more space but queries are faster. (Preprocess) ### Normalization U.S.A -> USA ### Expansion Not fully symatric... ![](https://i.imgur.com/R2llWW8.png) ## Stemming differentli -> different ### Skip Pointers To traverse long posing lists, we can build skip pointer, to jump a few entries. A typical length would be _square root of the length of the current posinting list_. ### Bi-word indices Always store the two successive terms. This, however, also adds the number of false positives. We solve this by using bigger touples ### Definitions - Type: An equialence class of terms. Types are associated with posting list in the standard inverted index. (e.g. computer, words like _computing_ would thend be converted into that type) ## Lecture 3 ### Search Structures #### Hash Table - bad on range queries - no perfect (collisions) - large space requirements (to avoid collisions ) #### Binary Tree 1. Terms can be stored on leaves and nodes. ![](https://i.imgur.com/Ohir8zh.png) 2. Only terms on the levaes ![](https://i.imgur.com/Z31dWgm.png) 2x. Use intervals ![](https://i.imgur.com/NWTBUQe.png) ![](https://i.imgur.com/rJJwhrq.png) The problem with the binary tree: - has to be rebalanced - can get to large for memory - a lot of seeks (at every node) Solution -> B+-tree #### B+-Tree - each noted has between $d$ and $2d + 1$ children(!!!) - all leaves at the same level We can also build a revered index for _wildcard queries_. A reversed index works be inversing all the terms. ![](https://i.imgur.com/bzAcZGr.png) #### Permuterm index ![](https://i.imgur.com/4peqBpz.png) #### k-gram Ideally 2 to 4-grams ### Input Correction #### Levenshtein Distance ![](https://i.imgur.com/033V9EC.png) #### K-grams K-grams attribute for misspelling to some extent. ![](https://i.imgur.com/dyP4fDT.png) this can lead to false possitives however: ### Jaccard Distance When resolving spelling mistakes using k-gram index, we get false positives: ![](https://i.imgur.com/58D8Had.png) Using the Jaccard Distance we can solve this problem. The Jaccard Distance beeing the intersection, deviced by the union (for image it would be 3/12 = 0.25) ### Summery 1. get k-grams from the query 2. llok them up in the k-gram index then - compute jaccard distances - keep terms with small jaccard distances (i.e. high jaccard coefficient) or - compute edit distances - keep terms with small edit distances ### Soundex Representation 1. Keep the first letter 2. Map all other letters like this ![](https://i.imgur.com/ZzVobJO.png) 3. Remove all repetitions 4. Remove all zeros 5. Trim and pad to 4 characters ![](https://i.imgur.com/KlBqzha.png) ## Lecture 5 ### Hardware Bottleneck is usually the harddrive. We _batch porcessing_ to reduce the amount of seeks: #### BSBI (block sort) We shard the collection into smalles junks, query per junk and then merge the results. $O(T log T) = O(T log T) + O(T)$ because of the sorting, but it feels slower, because the second step takes longer in hardware. #### SPIMI (single pass) Improves on the complexity of BSBI, with $O(T)$. This is done by directly building the inverted index, so terms nolonger have to be looked up. #### MapReduce What actual search engines are using. ![](https://i.imgur.com/NTLq6cQ.png) ### Auxiliary Index If the index has to update frequently, we can build a second index, which new changes and periodically (if the auxiliary index gets to big) integarate them into the main index. For deleted documents we can use an invalidation bit vector. Merging the auxiliary index gets more complex each time. The first rount it takes $O(n)$, then it doubles until we reach $O(T)$. Resulting in the overall complexity of $O(n + 2n + ... + \frac{T}{n} * n) = O(n\frac{\frac{T}{n}(\frac{T}{n} + 1)}{2}) = O(\frac{T^2}{n})$. So can we do better? #### Lograithmic Merging Instead of allways merging the auxiliary index, with the main index, we build up two indexes on the disk and only merge them once they have the same size (i.e. the size of the main index doubles on every merge). Complexity: $O(T \log (\frac{T}{n}))$ ## Lecture 6 ### Blocking A way of index compression. ![](https://i.imgur.com/ThZfAg7.png) ![](https://i.imgur.com/NfEEApJ.png) If we sort the term, we can further optimize the index size using _front coding_. ![](https://i.imgur.com/fKVuXKu.png) ## Intro - Create indexes, to avoid scan through text - If stored in a matrix we have a minimum 99.8% 0's - Only store datapoints -> Inverted Index (allowing fast full text search) ### Biword Index Add each word-pair (that is two words in consecutive order in the text) to the index. Input: Hello Index World Index: Hello Index, Index World Query: 'Hello Index' AND 'Index World' Very simple, but queries can result false positives. ### Positional Index ![](https://i.imgur.com/8xhn2Eb.png) - we store each word and its occurance count - for each word we store an array containing the documentIDs nd the occurancy count in that document, followed by an array containing the positions To execute a multi word query we then - fetch all the words in our query - intersect the document lists (i.e. only keep documents contianing all the keywords) - check if the keywords are in order \*the occurencies might be omitted ### k-gram indexes - good for spell checking - use $ as seperator - dictionary: token to (its) k-grams - postingslist k-gram to tokens ## Index Compression We compress indexed to - save disc space: 1:4 compression ratios are easy to achieve - faster transfer from disk to memory: today copy compressed data and decompress it, is usually faster than copy uncompressed data. Index Compression significantly improves the efficiency of SPIMI and Logarithmic Merging. We can reduce the index size by - case-folding - stemming (17%) - remove 30 most common words (30%) [stop words: it, is, the...] however compressen can help to to nolonger have to remove stopwords. We can not compress the postings, only the dictionary. Therefore index compression only helps with SPIMI and Logarithmic Mergic, but not with BSBi or MapReduce. We would like to compress the posting list. Instead of storing all positions of a term, we only store the distance. Frequent term, will therefore use less space to be stored. The trick is to use, parametries encoding: ![](https://i.imgur.com/iTH4jOd.png) This requires blocks of size $n$ for every block, we would like to improve on that. ### Unary Code Encode $n$ by $n$ repeating $n$'s followed by $0$. Very easy to encode and decode, but space inefficient. ### ɣ-codes ![](https://i.imgur.com/V1tB6B8.png) 1. Convert the number to binary (10011) 2. Choke off the leading 1 (0011) 3. We then encode the length using the unary code (11110) 4. We then merge the two (111100011) ### VB Codes Variable byte encoding like _ɣ-encoding_ is better than fixed-length encoding, because the distribution of gap sizes is skewed. ## Zipf's law ![](https://i.imgur.com/RIfyB60.png) ![](https://i.imgur.com/yx9tGqh.jpg) If we sort according to occurancy, there are less document id's further down... distributed accoring to Zipf's law $\text{Frequency} = \frac{k}{\text{Rank}}$. We can normalzie this: ![](https://i.imgur.com/ZwIGv7g.png) We can create Blocks: ![](https://i.imgur.com/UwjUgiw.png) ![](https://i.imgur.com/muO0SIL.png) ## Lecture 7 ### Ranked Retrieval Term frequency is a bad idea, as terms might be concentrated in just a few documents. ![](https://i.imgur.com/ZsWxpq0.png) ![](https://i.imgur.com/js1y1Cc.png) We can weight _title_, _abstract_ and _body_ seperatly. ## Lecture 8 #### Term frequency We count the occurrence of the term in each document of the collection. This can lead to returning suboptimal result, as we don't consider how ofter the term occurs in each document. ![](https://i.imgur.com/O9Bzkrm.png) In this example we query for `foo bar foobar`, using the term frequency we would return document A, that never contains `bar`. #### Collection frequency We only count how often the term occurs in the collection. ![](https://i.imgur.com/WMcHz0N.png) Using this approach we might think `foo` is rare, although `bar` is the rarest term. #### Document frequency We count the number of documents containing the term. ![](https://i.imgur.com/3diL3YA.png) #### Inverse document frequency $\log{\frac{total number of documents}{document frequency}}$ #### tf-idf (Tearm frequency - Inverse Document Frequency) ![](https://i.imgur.com/43LLjad.png) ![](https://i.imgur.com/P5aRdJF.png) ### Vector space model We regard the documents as bags of words. Using tf-ids we can build a vector representing multiple data points. Using the bag of words model and vectors to represent documents, we can use the _inner product_ of the document vectors to measure similarity of documents. ![](https://i.imgur.com/cEpap5I.png) We can create similar vectors for queries and then compare them in the vector space. ![](https://i.imgur.com/waxLoq5.png) ### Metadata The bast way to perform queries on metadata is to hash the indices or fill them into B-tree. We would use the B-tree if we're interested in the range queries. ### Machine Learning We can use ML to optimize the weights to score documents in zone quires. ### Standard Inverted Index ![](https://i.imgur.com/qjUZHuS.png) ![](https://i.imgur.com/siOOr9N.png) #### Evaluating a Query Theorie: We create a vector for each document, we create a vector for the query. If the cosign between a query vector and a document vector is close to one (the angle is very small) the are very similar. ![](https://i.imgur.com/WNdq3DM.png) Our challenges are: 1. loatsof documents 2. many dimensions (one per term) The queries are, however, generally very sparse, thus we only have to consider a very very small subset of dimensions. It would be practical to have the tf-idf stored in the inverted index, for each document. The tf-idf is a float, which is problematic in a index, as they are not as easy to compress (gamma-codes etc.). The trick is to store the idf per term (one float per term) and only the tf (integer) per document. We can also use the inner product between vectors... ![](https://i.imgur.com/klRMXgA.png) ![](https://i.imgur.com/xPDAkLy.png) ...to calculate the tf-ids. Putting it all together: VSM ![](https://i.imgur.com/V5zC9LM.jpg) We then throw the results into the baskets, one per document. We can do this process term-at-a-time or document-at-a-time. We then can sort the buckets, using a priority queue $O(2n)$. Improvements: - dropping the query length / it is constant anyway - dividing by the document length after score accumulation has completed. that way we only need to divide once for each document - Replacing all term frequencies with one on the query side. (words are rarely repeated on the query side) ### Term Frequency Depending on the use case different algorithms are used. #### Natural Term Frequency scaling ![](https://i.imgur.com/jIjYmlK.png) // TODO: page 118 from the book #### Sublinear Term Frequency Scaling ![](https://i.imgur.com/6UOZMMd.png) Reasoning, the more a term occurs the less important it is. #### Augmented Term Frequency Scaling ![](https://i.imgur.com/7SciYXN.png) #### Boolean term frequency scaling ![](https://i.imgur.com/KPdVwCT.png) #### Log-average term frequency scaling ![](https://i.imgur.com/qUkMkdQ.png) ### Document Frequency #### Natural document frequency scaling ![](https://i.imgur.com/kQ69471.png) #### Inverted document frequency scaling ![](https://i.imgur.com/kEfjJE8.png) #### Prob idf document frequency scaling ![](https://i.imgur.com/g2aKvDy.png) ### Normalization - not normalize at all - cosine normalization (divide by the Euclidean norm) - byte-size normalization (1/char length^a) - pivoted normalization ### Optimization Posting lists can get extremely large. The Idea is to limit search on a limited set of documents. ![](https://i.imgur.com/0qoCdsv.png) - reduce terms (e.g. stop words, words with little information) - remove documents which only contain a few terms We can still use a larger set of documents, if we can't find enough matches by optimization. ### Clustering ![](https://i.imgur.com/TcvdaVY.png) We can randomly pick documents ($\sqrt{n}$) - we call them leaders - and for all the other documents we compute the nearest leader. When we run a query, we simply find the nearest leader and then select that cluster. ### Tiered Indices We sort and shard, to build (per term) tier lists. Each color being one tier. ![](https://i.imgur.com/ruyqZHA.png) ### Query Interface - boolean queries -> std. inverted index - phrase queries -> bi-word index or positional index - set of words -> vector space model - zone queries - wild card quieres ![](https://i.imgur.com/Ocq9noo.png) ### Overall Architecture ![](https://i.imgur.com/wbwRT3l.jpg) ## Lecture 9 ### Probability theory We have elementary events $\gamma \in \omega$, with the sum of the probability of all events being one $\sum_{\gamma \in \omega} p(\gamma) = 1$. Odds are the probability of an event, divided by the probability of the complements. #### Conditional Probability ![](https://i.imgur.com/TVp8qS3.png) We use Bayes' rule for information retrieval. ![](https://i.imgur.com/OUIYZuO.png) posterior and prior prob. ![](https://i.imgur.com/lIbPrvk.png) // TODO: Lots of formulas ## Lecture 10 - Term: Equivalence class of terms. Types are associated which posting lists in the standard inverted index. - Dewey Decimal Classification: A number-based system used by many libraries to sort and retrieve their books. - False positives can be decreased by using phrase indices with bigger of words. - We do not use fixed-length encoding on gaps, because the distribution of gap sizes is skewed. - Inexact top-k retrieval makes sense because using cosine computations on vector representations as a similarity measure is already an approximation of reality, so going inexact does not hurt the results so much. - K-gram indices are indexing on sequences of k characters and are used for tolerant retrieval. - K-word indices are indexing sequences of k terms and are used for phrase search. - Precision and recall on ranked retrieval can be measured by going down the results and draw a curve plotting precision against an increasing recal. Capacity: e.g. X words in a book Throughput: e.g. X words per minute Latency: e.g. X seconds to lookup in a book ![](https://i.imgur.com/8KkKkhA.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully