123 lala
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # IR > [toc] ### Term Dependencies 1. Capturing term dependencies is a way beyond traditional IR models. Please list two methods to ==model term dependencies==. Any two methods are welcome. Ans:同第四題, term correlation matrix or minterm 3. Term independence is a strong assumption in classic IR modeling. Please explain how generalized vector model relaxes the assumption Ans:同第四題,將term的representation用minterm表示,讓term之間有關連性。 5. ![](https://i.imgur.com/q1Y14o1.png) Ans:basic vector model 就是假設index term都是各自獨立的,generalized vector model是用minterm來解決,公式與說明同第四題。 7. ![](https://i.imgur.com/8QJk41Z.png) Ans ![](https://i.imgur.com/M2Ps1gR.png) ![](https://i.imgur.com/XHHX0m5.png) ### document representation, query representation, and a ranking function. 1. ![](https://i.imgur.com/25gubXf.png) Ans 不是很確定probabilistic model中query和document的representation,因為只要計算ni和N就可以算Rank。 ![](https://i.imgur.com/BCkG9LG.jpg) 3. ![](https://i.imgur.com/gbEVnV4.png) Ans 同第一題BM25是probabilistic model,不確定query和document的representation ![](https://i.imgur.com/p5RgEv7.jpg) 5. An IR model is specified by document representation, query representation, and a ranking function. Please describe these three parts for BM25 model Ans:同第二題 7. Representations of queries and documents and computation of their relationship degree are key components in an IR framework. Please compare counting-based IR model and prediction-based IR model from aspects of representations and similarity computation. You can select any one model to describe your answers ### latent semantic indexing model 1. Please explain how latent semantic indexing model maps terms and documents into the same vector space Ans: by using Singular Value Decomposition (SVD) Model M=KSD ;where * K is a term-term Matrix, * S is a diagonal matrix of singular value, * D is a document-document matrix. 3. In latent semantic indexing, we try to map both terms and documents into lower dimensional space, and perform the similarity computation on that space. Please show how to compute term vectors, document vectors, and vectors of input queries in the reduced space 4. ![](https://i.imgur.com/BjoPAb7.png) 6. ![](https://i.imgur.com/QzHL7ef.png) Ans ![](https://i.imgur.com/ko3Gyes.png) ### counting-based IR model and prediction-based IR model 1. Representations of queries and documents and computation of their relationship degree are key components in an IR framework. Please compare counting-based IR model and prediction-based IR model from aspects of representations and similarity computation. You can select any one model to describe your answers 2. Counting-based language model is based on Markov assumption with a restriction of history length k. Besides, the history is read from left to right. That is, the context is also restricted. Please propose a neural language model to relax these two restrictions. 3. ![](https://i.imgur.com/z0ew0fP.png) ### two sides ? > [Looks like these are the same since "filtering = routing" according to q.3] 1. Information retrieval and information filtering are two sides of the same coin.Please describe how information retrieval concepts can be applied to information filtering. 2. Searching and routing are two sides of a same coin. In the class, you have learned lessons to evaluate the performance of a searching system. Please describe the concept of routing first, and discuss how to evaluate the performance of a routing system. 3. ![](https://i.imgur.com/2OXylTR.png) ### tf-idf 1. ![](https://i.imgur.com/eBB4obI.png) Ans ![](https://i.imgur.com/B45ugNi.jpg) ### query likelihood, document likelihood and model comparsion 1. ![](https://i.imgur.com/JB2aAz7.png)![](https://i.imgur.com/FwyjMd5.png) Ans ![](https://i.imgur.com/63M8Gx6.jpg) ### Probabilistic Model #### BM25 1. ![](https://i.imgur.com/q5GBB3t.png) Ans ![](https://i.imgur.com/lsAu6Un.png) #### language model 1. ![](https://i.imgur.com/dLopS9G.png) Ans ![](https://i.imgur.com/u7kF9mn.png) --- * Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream. In the first case, the users of the system see what is left after the data is removed; in the later case, they see the data that is extracted. --- # Notes * **Information retrieval** is about returning the information that is relevant for a specific query or field of interest of the user. While **information extraction** is more about extracting general knowledge (or relations) from a set of documents or information. * Indx Term ![](https://i.imgur.com/07YInWo.png) * IR model * ![](https://i.imgur.com/oNmsrPa.png) * ![](https://i.imgur.com/nSgNjgN.png) * ![](https://i.imgur.com/pAdTXVH.png) * ![](https://i.imgur.com/mmkmoOq.png) #### Classic IR Model ##### Boolean Model ![](https://i.imgur.com/n0OSELL.png) ![](https://i.imgur.com/V0gyHX8.png) ##### Term Weight * Term Weighting was introduced to measure the importance of an index term in a document or a query, respectively. * For classic IR, the index term weights are assumed to be mutually independent (ex: Classic Vector Model), to take into account term-term correlations, we can compute a correlation matrix to get the correlation among terms. ![](https://i.imgur.com/5vlsAhl.png) * How to measure importance of an index term? -> TF-IDF * tf: (occurences of term k / Document) * idf: (number of documents / number of documents that include term k) -> term with higher idf is more important. * (tf * idf) is a classic term weighting strategy * ![](https://i.imgur.com/Y6uCDVs.png) * ![](https://i.imgur.com/q8QZ0te.png) * ![](https://i.imgur.com/0AQGsMl.png) * ![](https://i.imgur.com/67mLV9T.png) * Document length normalization * ![](https://i.imgur.com/ghTdwQP.png) ##### Vector Model ![](https://i.imgur.com/z300s8S.png) ![](https://i.imgur.com/GgQY40C.png) ![](https://i.imgur.com/0Swnfur.png) ![](https://i.imgur.com/VUV2q7W.png) ![](https://i.imgur.com/bj9bRrz.png) ##### Classic Probabilistic Model ![](https://i.imgur.com/hNR5vft.png) ![](https://i.imgur.com/ytqzODL.png) ![](https://i.imgur.com/hDLbp5r.png) ![](https://i.imgur.com/gZZKAse.png) 經過一堆看不懂的推導得到 ![](https://i.imgur.com/j11pI4r.png) ![](https://i.imgur.com/NtBhjjH.png) ![](https://i.imgur.com/sQ8LSLs.png) ![](https://i.imgur.com/BS0ZTBx.png) ![](https://i.imgur.com/2J6k3Su.png) To do: initialization? --- ### Advanced Model: consider term dependencies #### Set Theoretic IR Models ##### Set-based Model 用term set 替代 index term, when a document includes all index terms in a termset, the document includes this termset. * term set is used to capture term dependencies. * ![](https://i.imgur.com/Tgs9yeF.png) To reduce number of termset: only frequent term set is used. * ![](https://i.imgur.com/H5zU00Y.png) frequent: 組起來的termset 中的 index term 數量 >= threshold * ![](https://i.imgur.com/1amasdD.png) * $$ S_{abd} = \{d_1\}, N_i = 1, \text{not frequent}$$ * ![](https://i.imgur.com/TsZIeec.png) * ![](https://i.imgur.com/8PGzVLg.png) * closed termsets: frequent term set還是太多了,想辦法再減少 -> keep the largest ![](https://i.imgur.com/QL92fAV.png) ##### Extended Boolean Model 透過把vector model 特性跟 boolean algebra 結合來解決傳統boolean model 沒有ranking, 缺乏partial matching and term weighting的問題 ![](https://i.imgur.com/dx1Vf20.png) ![](https://i.imgur.com/K4dRNHg.png) ![](https://i.imgur.com/HPM8qgF.png) ![](https://i.imgur.com/t8AJw5E.png) 根據document中對index term x 跟 y的權重定義出d_j的向量, x, y權重計算方法:![](https://i.imgur.com/qiE0hSw.png) ![](https://i.imgur.com/Gg4NSxK.png) 如何計算document an query simularity? :跟0,0越遠越好, 1,1越近越好 ![](https://i.imgur.com/zUjkN7U.png) * 多維度的時候-> p-norm ![](https://i.imgur.com/MBLo7bG.png) p = 1 的時候等於是取平均值 -> vector like-> recall ![](https://i.imgur.com/guUj3nI.png) p = infinity: fuzzy like * 如此一來,operator的順序 matters ![](https://i.imgur.com/lZZFFZq.png) ##### Fuzzy Set Model A document without query term coule also be higly realative to a query! (ex: it has a term related to query term) -> to solve this issue -> Fuzzy! The degree of membership is between 0 and 1, not distinct, but smoothly ![](https://i.imgur.com/qL8noKb.png) * how to get fuzzy set? -> by term-term correlation matrix -> find intersection * ![](https://i.imgur.com/LP6H0u5.png) * d_j 相對於 k_i 會有一個 degree of membership -> c_il 是 index term i and l的相關程度,那d_j跟k_i的degree就是1 - 連乘(k_i 跟 index term k_l 的不相關程度), 想像如果有一個k_l跟k_i很有關係, c_il = 1, 那d_j跟k_i的degree就是1, 如此就可以透過fuzzy的概念來讓即使沒有含有k_i的d_j也能跟query有關係 * ![](https://i.imgur.com/mnjko2o.png) --- #### Algebraic IR Models ##### Generalized Vector Model Use minterm to capture term dependency 整體與Vector Space Model相似, 但拿掉term vector間為independent的限制. 把term vector 用minterm展開, 而minterm vector間為independent 用minterm represents a document, 再透過具相同minterm的document計算c_ir,最後用c_ir and minterm represent index term vector. finally, k_i, k_j is not orthogonal. when we calcuate simularity, k_i * k_j != 0 anymore. -> term dependency ! ![](https://i.imgur.com/a2a1nek.png) ![](https://i.imgur.com/LPk8fp3.png) k_i跟k_j同時出現在很多的文章出現的話,他們的關聯程度應該是很高的 ![](https://i.imgur.com/jQLJ9Lm.png) ![](https://i.imgur.com/n5WoHNP.png) ![](https://i.imgur.com/LWwCsqs.png) ![](https://i.imgur.com/hJZ8cJE.png) ##### Latent Semantic Indexing generalized vector model is high dimentional and sparse whereas latent semantic is low dimentinal and dense. 不應該用intex term 處理,而需要mapping到某個空間去做計算 we need concept retrieval not index term retrieval. A query is represented by a pseudo document. ![](https://i.imgur.com/tk2Neod.png) Through SVD, we can get term matrix and document matrix. selecting S -> reduce concept space ![](https://i.imgur.com/08NJKWk.png) ![](https://i.imgur.com/nzxQC0P.png) --- ![](https://i.imgur.com/iT7DeTE.png) #### Probabilistic Model ##### BM25 improvement in term weighting. Classic probabilistic model covers only "idf", but not tf, and document length norm. BM15: term frequency BM11: document length norm BM25: BM11 + BM15 ![](https://i.imgur.com/xrQ5Hf5.png) ![](https://i.imgur.com/DH05Vdj.png) ![](https://i.imgur.com/idypq4U.png) ##### Language Model ![](https://i.imgur.com/Yu1mLMX.png) * The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often * 一個distribution用來估算某個字串出現的機率跟下一個token是什麼 * 每個doc有一個model-> model產生該query的機率來當作query跟doc的相關程度 * ![](https://i.imgur.com/SLpxc1b.png) unigram->independent * ![](https://i.imgur.com/tYkJdPa.png) 取log,分為q在d跟不在d * ![](https://i.imgur.com/PNFAlNC.png) 後半段(only not include) 有很多方法可以算ki在collection的機率 * ![](https://i.imgur.com/EX1bJyR.png) * ![](https://i.imgur.com/0tVEvxI.png) 前半段(include / not include) * ![](https://i.imgur.com/IQyqr3A.png) 還要一個smoothing 避免0的情形 * ![](https://i.imgur.com/liUv0WK.png) smoothing的方法有很多 * ![](https://i.imgur.com/4AJXdHn.png) 經過一堆吐血的推導,終於得到的ranking function * ![](https://i.imgur.com/LyiXqrn.png) ##### Another Language Model based on bernoulli process * ![](https://i.imgur.com/tSAgfZK.png) 一樣有很多方法可以算ki在collection的機率, 阿我們選下面那個 * ![](https://i.imgur.com/YVscmsu.png) based on above, the probability that term k_i will be produced by a random draw taken from d_j * ![](https://i.imgur.com/0rq7xDE.png) 加上考慮k_i出現在document的frequency * ![](https://i.imgur.com/Cs2Z9Ji.png) * ![](https://i.imgur.com/Xzg6PnX.png) 終於得到的ranking function * ![](https://i.imgur.com/FC6MCEN.png) ##### Query Model vs Document Model * 前面用的是Document Model產生query (1, query likelihood),也可以反過來用Query Model產生document (2, document likelihood), 還可以比較兩個model(3) * ![](https://i.imgur.com/kj4NTKv.png) * ![](https://i.imgur.com/rnl5kBD.png) * ![](https://i.imgur.com/UYyTEzm.png) * ![](https://i.imgur.com/74oY9dD.png) * mQ and mD generate 出 term t的 KL divergence * How can one do relevance feedback using language modeling approach? * ![](https://i.imgur.com/jNERmxe.png) * Expansion-based: 先算出query likelihood, 根據result 取rank前幾名的文章形成feedback docs, 再利用feedback doc來 expand query. -> 權重可能改變, query中的term 可能改變,but doc model是不變的 * Model-based: 先算出mQ跟mD的的KL divergence, 根據result 取排序前幾名的feedback docs, 利用feedback docs更新Query Model. * Expansion-based 是更新query, Model-based是更新query model. * How to update query model? * ![](https://i.imgur.com/4T7nSZO.png) * Translation Model d 不是直接generate q, 是先經過一個translation process. 相關性就可以透過non-query term來處理 --- 累了,這邊隨便整理orz #### Neural IR * Translation Model: ![](https://i.imgur.com/gBHZOLI.png) BM25, LM , Translaiton model都沒有考慮到term 跟term order相鄰的關係 -> 1. Dependency model 2. Relevance Model ![](https://i.imgur.com/VrlMYqu.png) #### Neural Approach to IR * Neural Approach can be used in different points ![](https://i.imgur.com/irHG0ZF.png) * ![](https://i.imgur.com/aeNV0r5.png) * ![](https://i.imgur.com/FtfwNBT.png) * ![](https://i.imgur.com/dYmJHCY.png) ##### Term Representations * ![](https://i.imgur.com/s4kGvUL.png) * ![](https://i.imgur.com/1IezUrR.png) * ![](https://i.imgur.com/FzN1GxU.png) * Observed * Typical (seattle and sydney are cities) V.S Topical (seattle and seaharks are related to football) * ![](https://i.imgur.com/K7BuGQs.png) * ![](https://i.imgur.com/8DTG2qe.png) * Latent * LSA (svd) * PLSA * LDA * word2vec: to encode the context around the word rather than the word itself. 一個詞的語意是被其他鄰近且一起出現的詞所定義. word2vec has IN and OUT embeddings. * ![](https://i.imgur.com/MLPZjYt.png) * Paragraph2vec * ![](https://i.imgur.com/cMG1f8M.png) ##### Term Embeddings for IR IR 需要做 quert-document matching, term embeddings 適合用在inexact matching IR 也需要做feedback update, term embeddings可以用來做query expansion. * ![](https://i.imgur.com/l30iKfq.png) * ![](https://i.imgur.com/ya9j4KJ.png) 這一個建議query用in, doc 用out * ![](https://i.imgur.com/l50x2u6.png) 還有各式各樣inin outout inout mix的方法 Word Embeddings is a technique where individual words are represented as real-valued vectors in a lower-dimensional space and captures inter-word semantics. ### Relevance Feedback AND Query Expansion 1. feedback 有什麼可以用來expand query的? 怎麼用? 2. explicit: directly from user, relevance or click 3. implicit: from system * ![](https://i.imgur.com/bFVetph.png) #### Explicit Relevance Feedback * user 直接幫忙report哪些是相關的doc, 哪些是不相關 * 希望新的query找出更多相關的doc * The Rocchio Method: 接近相關,遠離不相關 * ![](https://i.imgur.com/nB3jc6k.png) * ![](https://i.imgur.com/yj2UqMt.png) * A Probabilistic Method * Evaluation:residual collection: * ![](https://i.imgur.com/KnKB94d.png) #### Explicit Feedback Through Clicks * eye tracking * relevance judgements * click as user preferences * ![](https://i.imgur.com/75cWK8n.png) * ![](https://i.imgur.com/Kn9Mds8.png) * ![](https://i.imgur.com/fGQuuLc.png) * ![](https://i.imgur.com/mSLQzVz.png) #### Implicit Feedback Through Local Analysis 1. Local clustering * ![](https://i.imgur.com/c3Z17th.png) * ![](https://i.imgur.com/J2yt3fh.png) 3. Local

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully