IR - HackMD

# IR > [toc] ### Term Dependencies 1. Capturing term dependencies is a way beyond traditional IR models. Please list two methods to ==model term dependencies==. Any two methods are welcome. Ans：同第四題， term correlation matrix or minterm 3. Term independence is a strong assumption in classic IR modeling. Please explain how generalized vector model relaxes the assumption Ans：同第四題，將term的representation用minterm表示，讓term之間有關連性。 5. ![](https://i.imgur.com/q1Y14o1.png) Ans：basic vector model 就是假設index term都是各自獨立的，generalized vector model是用minterm來解決，公式與說明同第四題。 7. ![](https://i.imgur.com/8QJk41Z.png) Ans ![](https://i.imgur.com/M2Ps1gR.png) ![](https://i.imgur.com/XHHX0m5.png) ### document representation, query representation, and a ranking function. 1. ![](https://i.imgur.com/25gubXf.png) Ans 不是很確定probabilistic model中query和document的representation，因為只要計算ni和N就可以算Rank。 ![](https://i.imgur.com/BCkG9LG.jpg) 3. ![](https://i.imgur.com/gbEVnV4.png) Ans 同第一題BM25是probabilistic model，不確定query和document的representation ![](https://i.imgur.com/p5RgEv7.jpg) 5. An IR model is specified by document representation, query representation, and a ranking function. Please describe these three parts for BM25 model Ans：同第二題 7. Representations of queries and documents and computation of their relationship degree are key components in an IR framework. Please compare counting-based IR model and prediction-based IR model from aspects of representations and similarity computation. You can select any one model to describe your answers ### latent semantic indexing model 1. Please explain how latent semantic indexing model maps terms and documents into the same vector space Ans： by using Singular Value Decomposition (SVD) Model M=KSD ；where * K is a term-term Matrix, * S is a diagonal matrix of singular value, * D is a document-document matrix. 3. In latent semantic indexing, we try to map both terms and documents into lower dimensional space, and perform the similarity computation on that space. Please show how to compute term vectors, document vectors, and vectors of input queries in the reduced space 4. ![](https://i.imgur.com/BjoPAb7.png) 6. ![](https://i.imgur.com/QzHL7ef.png) Ans ![](https://i.imgur.com/ko3Gyes.png) ### counting-based IR model and prediction-based IR model 1. Representations of queries and documents and computation of their relationship degree are key components in an IR framework. Please compare counting-based IR model and prediction-based IR model from aspects of representations and similarity computation. You can select any one model to describe your answers 2. Counting-based language model is based on Markov assumption with a restriction of history length k. Besides, the history is read from left to right. That is, the context is also restricted. Please propose a neural language model to relax these two restrictions. 3. ![](https://i.imgur.com/z0ew0fP.png) ### two sides ? > [Looks like these are the same since "filtering = routing" according to q.3] 1. Information retrieval and information filtering are two sides of the same coin.Please describe how information retrieval concepts can be applied to information filtering. 2. Searching and routing are two sides of a same coin. In the class, you have learned lessons to evaluate the performance of a searching system. Please describe the concept of routing first, and discuss how to evaluate the performance of a routing system. 3. ![](https://i.imgur.com/2OXylTR.png) ### tf-idf 1. ![](https://i.imgur.com/eBB4obI.png) Ans ![](https://i.imgur.com/B45ugNi.jpg) ### query likelihood, document likelihood and model comparsion 1. ![](https://i.imgur.com/JB2aAz7.png)![](https://i.imgur.com/FwyjMd5.png) Ans ![](https://i.imgur.com/63M8Gx6.jpg) ### Probabilistic Model #### BM25 1. ![](https://i.imgur.com/q5GBB3t.png) Ans ![](https://i.imgur.com/lsAu6Un.png) #### language model 1. ![](https://i.imgur.com/dLopS9G.png) Ans ![](https://i.imgur.com/u7kF9mn.png) --- * Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream. In the first case, the users of the system see what is left after the data is removed; in the later case, they see the data that is extracted. --- # Notes * **Information retrieval** is about returning the information that is relevant for a specific query or field of interest of the user. While **information extraction** is more about extracting general knowledge (or relations) from a set of documents or information. * Indx Term ![](https://i.imgur.com/07YInWo.png) * IR model * ![](https://i.imgur.com/oNmsrPa.png) * ![](https://i.imgur.com/nSgNjgN.png) * ![](https://i.imgur.com/pAdTXVH.png) * ![](https://i.imgur.com/mmkmoOq.png) #### Classic IR Model ##### Boolean Model ![](https://i.imgur.com/n0OSELL.png) ![](https://i.imgur.com/V0gyHX8.png) ##### Term Weight * Term Weighting was introduced to measure the importance of an index term in a document or a query, respectively. * For classic IR, the index term weights are assumed to be mutually independent (ex: Classic Vector Model), to take into account term-term correlations, we can compute a correlation matrix to get the correlation among terms. ![](https://i.imgur.com/5vlsAhl.png) * How to measure importance of an index term? -> TF-IDF * tf: (occurences of term k / Document) * idf: (number of documents / number of documents that include term k) -> term with higher idf is more important. * (tf * idf) is a classic term weighting strategy * ![](https://i.imgur.com/Y6uCDVs.png) * ![](https://i.imgur.com/q8QZ0te.png) * ![](https://i.imgur.com/0AQGsMl.png) * ![](https://i.imgur.com/67mLV9T.png) * Document length normalization * ![](https://i.imgur.com/ghTdwQP.png) ##### Vector Model ![](https://i.imgur.com/z300s8S.png) ![](https://i.imgur.com/GgQY40C.png) ![](https://i.imgur.com/0Swnfur.png) ![](https://i.imgur.com/VUV2q7W.png) ![](https://i.imgur.com/bj9bRrz.png) ##### Classic Probabilistic Model ![](https://i.imgur.com/hNR5vft.png) ![](https://i.imgur.com/ytqzODL.png) ![](https://i.imgur.com/hDLbp5r.png) ![](https://i.imgur.com/gZZKAse.png) 經過一堆看不懂的推導得到 ![](https://i.imgur.com/j11pI4r.png) ![](https://i.imgur.com/NtBhjjH.png) ![](https://i.imgur.com/sQ8LSLs.png) ![](https://i.imgur.com/BS0ZTBx.png) ![](https://i.imgur.com/2J6k3Su.png) To do: initialization? --- ### Advanced Model: consider term dependencies #### Set Theoretic IR Models ##### Set-based Model 用term set 替代 index term, when a document includes all index terms in a termset, the document includes this termset. * term set is used to capture term dependencies. * ![](https://i.imgur.com/Tgs9yeF.png) To reduce number of termset: only frequent term set is used. * ![](https://i.imgur.com/H5zU00Y.png) frequent: 組起來的termset 中的 index term 數量 >= threshold * ![](https://i.imgur.com/1amasdD.png) * $$ S_{abd} = \{d_1\}, N_i = 1, \text{not frequent}$$ * ![](https://i.imgur.com/TsZIeec.png) * ![](https://i.imgur.com/8PGzVLg.png) * closed termsets: frequent term set還是太多了，想辦法再減少 -> keep the largest ![](https://i.imgur.com/QL92fAV.png) ##### Extended Boolean Model 透過把vector model 特性跟 boolean algebra 結合來解決傳統boolean model 沒有ranking, 缺乏partial matching and term weighting的問題 ![](https://i.imgur.com/dx1Vf20.png) ![](https://i.imgur.com/K4dRNHg.png) ![](https://i.imgur.com/HPM8qgF.png) ![](https://i.imgur.com/t8AJw5E.png) 根據document中對index term x 跟 y的權重定義出d_j的向量, x, y權重計算方法：![](https://i.imgur.com/qiE0hSw.png) ![](https://i.imgur.com/Gg4NSxK.png) 如何計算document an query simularity? ：跟0,0越遠越好, 1,1越近越好 ![](https://i.imgur.com/zUjkN7U.png) * 多維度的時候-> p-norm ![](https://i.imgur.com/MBLo7bG.png) p = 1 的時候等於是取平均值 -> vector like-> recall ![](https://i.imgur.com/guUj3nI.png) p = infinity: fuzzy like * 如此一來，operator的順序 matters ![](https://i.imgur.com/lZZFFZq.png) ##### Fuzzy Set Model A document without query term coule also be higly realative to a query! (ex: it has a term related to query term) -> to solve this issue -> Fuzzy! The degree of membership is between 0 and 1, not distinct, but smoothly ![](https://i.imgur.com/qL8noKb.png) * how to get fuzzy set? -> by term-term correlation matrix -> find intersection * ![](https://i.imgur.com/LP6H0u5.png) * d_j 相對於 k_i 會有一個 degree of membership -> c_il 是 index term i and l的相關程度，那d_j跟k_i的degree就是1 - 連乘(k_i 跟 index term k_l 的不相關程度), 想像如果有一個k_l跟k_i很有關係, c_il = 1, 那d_j跟k_i的degree就是1, 如此就可以透過fuzzy的概念來讓即使沒有含有k_i的d_j也能跟query有關係 * ![](https://i.imgur.com/mnjko2o.png) --- #### Algebraic IR Models ##### Generalized Vector Model Use minterm to capture term dependency 整體與Vector Space Model相似, 但拿掉term vector間為independent的限制. 把term vector 用minterm展開, 而minterm vector間為independent 用minterm represents a document, 再透過具相同minterm的document計算c_ir，最後用c_ir and minterm represent index term vector. finally, k_i, k_j is not orthogonal. when we calcuate simularity, k_i * k_j != 0 anymore. -> term dependency ! ![](https://i.imgur.com/a2a1nek.png) ![](https://i.imgur.com/LPk8fp3.png) k_i跟k_j同時出現在很多的文章出現的話，他們的關聯程度應該是很高的 ![](https://i.imgur.com/jQLJ9Lm.png) ![](https://i.imgur.com/n5WoHNP.png) ![](https://i.imgur.com/LWwCsqs.png) ![](https://i.imgur.com/hJZ8cJE.png) ##### Latent Semantic Indexing generalized vector model is high dimentional and sparse whereas latent semantic is low dimentinal and dense. 不應該用intex term 處理，而需要mapping到某個空間去做計算 we need concept retrieval not index term retrieval. A query is represented by a pseudo document. ![](https://i.imgur.com/tk2Neod.png) Through SVD, we can get term matrix and document matrix. selecting S -> reduce concept space ![](https://i.imgur.com/08NJKWk.png) ![](https://i.imgur.com/nzxQC0P.png) --- ![](https://i.imgur.com/iT7DeTE.png) #### Probabilistic Model ##### BM25 improvement in term weighting. Classic probabilistic model covers only "idf", but not tf, and document length norm. BM15: term frequency BM11: document length norm BM25: BM11 + BM15 ![](https://i.imgur.com/xrQ5Hf5.png) ![](https://i.imgur.com/DH05Vdj.png) ![](https://i.imgur.com/idypq4U.png) ##### Language Model ![](https://i.imgur.com/Yu1mLMX.png) * The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often * 一個distribution用來估算某個字串出現的機率跟下一個token是什麼 * 每個doc有一個model-> model產生該query的機率來當作query跟doc的相關程度 * ![](https://i.imgur.com/SLpxc1b.png) unigram->independent * ![](https://i.imgur.com/tYkJdPa.png) 取log，分為q在d跟不在d * ![](https://i.imgur.com/PNFAlNC.png) 後半段(only not include) 有很多方法可以算ki在collection的機率 * ![](https://i.imgur.com/EX1bJyR.png) * ![](https://i.imgur.com/0tVEvxI.png) 前半段(include / not include) * ![](https://i.imgur.com/IQyqr3A.png) 還要一個smoothing 避免0的情形 * ![](https://i.imgur.com/liUv0WK.png) smoothing的方法有很多 * ![](https://i.imgur.com/4AJXdHn.png) 經過一堆吐血的推導，終於得到的ranking function * ![](https://i.imgur.com/LyiXqrn.png) ##### Another Language Model based on bernoulli process * ![](https://i.imgur.com/tSAgfZK.png) 一樣有很多方法可以算ki在collection的機率, 阿我們選下面那個 * ![](https://i.imgur.com/YVscmsu.png) based on above, the probability that term k_i will be produced by a random draw taken from d_j * ![](https://i.imgur.com/0rq7xDE.png) 加上考慮k_i出現在document的frequency * ![](https://i.imgur.com/Cs2Z9Ji.png) * ![](https://i.imgur.com/Xzg6PnX.png) 終於得到的ranking function * ![](https://i.imgur.com/FC6MCEN.png) ##### Query Model vs Document Model * 前面用的是Document Model產生query (1, query likelihood)，也可以反過來用Query Model產生document (2, document likelihood), 還可以比較兩個model(3) * ![](https://i.imgur.com/kj4NTKv.png) * ![](https://i.imgur.com/rnl5kBD.png) * ![](https://i.imgur.com/UYyTEzm.png) * ![](https://i.imgur.com/74oY9dD.png) * mQ and mD generate 出 term t的 KL divergence * How can one do relevance feedback using language modeling approach? * ![](https://i.imgur.com/jNERmxe.png) * Expansion-based: 先算出query likelihood, 根據result 取rank前幾名的文章形成feedback docs, 再利用feedback doc來 expand query. -> 權重可能改變, query中的term 可能改變，but doc model是不變的 * Model-based: 先算出mQ跟mD的的KL divergence, 根據result 取排序前幾名的feedback docs, 利用feedback docs更新Query Model. * Expansion-based 是更新query, Model-based是更新query model. * How to update query model? * ![](https://i.imgur.com/4T7nSZO.png) * Translation Model d 不是直接generate q, 是先經過一個translation process. 相關性就可以透過non-query term來處理 --- 累了，這邊隨便整理orz #### Neural IR * Translation Model: ![](https://i.imgur.com/gBHZOLI.png) BM25, LM , Translaiton model都沒有考慮到term 跟term order相鄰的關係 -> 1. Dependency model 2. Relevance Model ![](https://i.imgur.com/VrlMYqu.png) #### Neural Approach to IR * Neural Approach can be used in different points ![](https://i.imgur.com/irHG0ZF.png) * ![](https://i.imgur.com/aeNV0r5.png) * ![](https://i.imgur.com/FtfwNBT.png) * ![](https://i.imgur.com/dYmJHCY.png) ##### Term Representations * ![](https://i.imgur.com/s4kGvUL.png) * ![](https://i.imgur.com/1IezUrR.png) * ![](https://i.imgur.com/FzN1GxU.png) * Observed * Typical (seattle and sydney are cities) V.S Topical (seattle and seaharks are related to football) * ![](https://i.imgur.com/K7BuGQs.png) * ![](https://i.imgur.com/8DTG2qe.png) * Latent * LSA (svd) * PLSA * LDA * word2vec: to encode the context around the word rather than the word itself. 一個詞的語意是被其他鄰近且一起出現的詞所定義. word2vec has IN and OUT embeddings. * ![](https://i.imgur.com/MLPZjYt.png) * Paragraph2vec * ![](https://i.imgur.com/cMG1f8M.png) ##### Term Embeddings for IR IR 需要做 quert-document matching, term embeddings 適合用在inexact matching IR 也需要做feedback update, term embeddings可以用來做query expansion. * ![](https://i.imgur.com/l30iKfq.png) * ![](https://i.imgur.com/ya9j4KJ.png) 這一個建議query用in, doc 用out * ![](https://i.imgur.com/l50x2u6.png) 還有各式各樣inin outout inout mix的方法 Word Embeddings is a technique where individual words are represented as real-valued vectors in a lower-dimensional space and captures inter-word semantics. ### Relevance Feedback AND Query Expansion 1. feedback 有什麼可以用來expand query的？怎麼用？ 2. explicit: directly from user, relevance or click 3. implicit: from system * ![](https://i.imgur.com/bFVetph.png) #### Explicit Relevance Feedback * user 直接幫忙report哪些是相關的doc, 哪些是不相關 * 希望新的query找出更多相關的doc * The Rocchio Method: 接近相關，遠離不相關 * ![](https://i.imgur.com/nB3jc6k.png) * ![](https://i.imgur.com/yj2UqMt.png) * A Probabilistic Method * Evaluation：residual collection: * ![](https://i.imgur.com/KnKB94d.png) #### Explicit Feedback Through Clicks * eye tracking * relevance judgements * click as user preferences * ![](https://i.imgur.com/75cWK8n.png) * ![](https://i.imgur.com/Kn9Mds8.png) * ![](https://i.imgur.com/fGQuuLc.png) * ![](https://i.imgur.com/mSLQzVz.png) #### Implicit Feedback Through Local Analysis 1. Local clustering * ![](https://i.imgur.com/c3Z17th.png) * ![](https://i.imgur.com/J2yt3fh.png) 3. Local

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.