Classic IR Model

# Classic IR Model ###### tags: `Information Retrieval and Extraction` ## Basic Concepts * $t$ be the number of index terms in the document collection * $k_i$ be a generic index term The vocabulary $V = \{k_1, . . . , k_t\}$ is the set of all distinct index terms in the collection <img src="https://i.imgur.com/s73I9gI.png" width="500"/> * Bag of words : **word ordering is ignored** ### Term-Document Matrix A term-document relation between $k_i$ and $d_j$ can be quantified by the frequency of the term in the document. <center> <img width="250" src="https://i.imgur.com/Bmqk02Q.png"> </center> * where each $f_{i,j}$ element stands for the frequency of term $k_i$ in document $d_j$ ### Logical view of a document ![](https://i.imgur.com/kBA5fve.png) ## Boolean model Simple model based on set theory and boolean algebra. * example of query : $q = k_a ∧ (k_b ∨ ¬k_c)$ consider query $q = k_a ∧ (k_b ∨ ¬k_c)$, vocabulary $V =\{k_a,k_b,k_c\}$, a query $q$ rewritten as a disjunction of those components is called the **disjunct normal form** $q_{DNF} = (1,1,1)∨(1,1,0)∨(1,0,0)$ <center> <img src="https://i.imgur.com/VvndGCA.png" width="350"> </center> ### similarity The similarity of the document dj to the query q is defined as : \begin{equation} sim(d_j,q) = \begin{cases} 1, & \text{if}\ ∃c(q) | c(q) = c(dj) \\ 0, & \text{otherwise} \end{cases} \end{equation} * The Boolean model predicts that each document is either relevant or non-relevant ![](https://i.imgur.com/qTFywly.jpg) ### Drawbacks of the Boolean Model * Retrieval based on binary decision criteria with no notion of partial matching * No ranking of the documents is provided (absence of a grading scale) * The Boolean queries formulated by the users are most often too simplistic ## Term Weighting The weights $w_{i,j}$ can be computed using the **frequencies of occurrence** of the terms within documents. There are properties of an index term which are useful for evaluating the importance of the term in a document. Define $\overrightarrow{d_j} = (w_{1,j} w_{2,j},...,w_{t,j} )$ as a weighted vector that contains the weight $w_{i,j}$ of each term $k_i ∈ V$ in the document $d_j$ <center> <img src="https://i.imgur.com/Ye7Lw4g.png" width="300"> </center> --- let $f_{i,j}$ be the frequency of occurrence of index term $k_i$ in the document $d_j$ The **total frequency** of occurrence \begin{equation}F_i = \sum_{j=1}^N f_{i,j}\end{equation} * where N is the number of documents in the collection The **document frequency** $n_i$ of a term $k_i$ is the number of documents in which it occurs For instance, term $do$ be collected below <center> <img src="https://i.imgur.com/IXaBo2c.png" width="550"> </center> ### Term-term correlation matrix * For classic information retrieval models, the index term weights are assumed to be **mutually independent** Let $\overrightarrow{M_{i,j}}$ be a term-document matrix $t × N$ where $m_{ij} = w_{ij}$. The matrix $\overrightarrow{C}$ = $\overrightarrow{M} \cdot \overrightarrow{M^t}$ is a term-term correlation matrix. ![](https://i.imgur.com/qWaZAdq.png) ## TF-IDF Weights The TF-IDF algorithm is used to weigh a keyword in any content and assign importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the document, which is referred to as corpus. [1] ### Term Frequency (TF) A variant of $tf$ weight used in the literature is \begin{equation} tf_{i,j} = \begin{cases} 1+log_2f_{i,j}, & \text{if}\ f_{i,j} > 0 \\ 0 & \text{otherwise} \end{cases} \end{equation} ![](https://i.imgur.com/gmPbIkl.jpg) ### Inverse Document Frequency (IDF) Select important terms from the document to represent the entire document * The more index terms are assigned to a document, the higher is the probability of retrieval for that document * Terms are distributed in a text according to Zipf’s Law * $IDF$ provides a foundation for modern term weighting schemes and is used for ranking in almost all IR systems Let $k_i$ be the term with the $r^{th}$ largest document frequency, i.e., $n(r) = n_i$. \begin{equation}idf_i = log\frac{N}{n_i}\end{equation} where $idf_i$ is called the inverse document frequency of term $k_i$ ![](https://i.imgur.com/3gf47EX.jpg) ### TF-IDF weighting scheme \begin{equation} w_{i,j} = \begin{cases} (1+logf_{i,j}) \times log\frac{N}{n_i}, & \text{if}\ f_{i,j} > 0 \\ 0 & \text{otherwise} \end{cases} \end{equation} ![](https://i.imgur.com/oaf7zPz.png) ### Variants of TF ![](https://i.imgur.com/ap1X9Mu.png) ### Variants of IDF ![](https://i.imgur.com/Xzxof6u.png) ### Recommended TF-IDF weighting schemes ![](https://i.imgur.com/bpwmaEK.png) ### TF-IDF Properties Consider the tf, idf, and tf-idf weights for the Wall Street Journal reference collection. \begin{equation}tf_i = 1 + log\sum_{j=1}^Nf_{i, j}\qquad idf_i = log\frac{N}{n_i}\end{equation} Plotting $tf$ and $idf$ in logarithmic scale yields ![](https://i.imgur.com/uquhxxs.jpg) ### Document Length Normalization Document sizes might vary widely. This is a problem because longer documents are more likely to be retrieved by a given query. To compensate for this undesired effect, we can divide the rank of each document by its length. Methods of document length normalization depend on the representation adopted for the documents : * **Size in bytes** : consider that each document is represented simply as a stream of bytes * **Number of words** : each document is represented as a single string, and the document length is the number of words in it * **Vector norms** : documents are represented as vectors of weighted terms Documents represented as **vectors of weighted terms** * For each term $k_i$ of a document $d_j$ is associated the term vector component $w_{i,j} × \overrightarrow{k_i}$ <center> <img src="https://i.imgur.com/ZR0eSKu.jpg" width="500"> </center> document length : $|\overrightarrow{d_j}|=\sqrt{\sum_{i}^tw_{i,j}^2}$ ![](https://i.imgur.com/1hcBdvi.png) where vector norm is cacualted by $TF$-$IDF$ weight ## The Vector Model This is accomplished by assigning **non-binary weights** to index terms in queries and in documents. And, term weights are used to compute a **degree of similarity** between a query and each document. * The weight $w_{i,j}$ associated with a pair $(k_i,d_j)$ is positive and non-binary * The index terms are assumed to be all **mutually independent** Similarity between a document $d_j$ and a query $q$ ![](https://i.imgur.com/pTyrIW2.png) \begin{equation}sim(d_i,j) = \frac{\sum_{i=1}^tw_{i,j}\times w_{i,q}}{\sqrt{\sum_{i=1}^tw_{i,j}^2}\times \sqrt{\sum_{i=1}^tw_{i,q}^2}}\end{equation} Since $w_{i,j} >0$ and $w_{i,q} >0$, we have $0\leq sim(dj,q)\leq 1$ ### Weights in the Vector model Weights in the Vector model are basically $tf-idf$ weights \begin{equation}w_{i,q} = (1+log f_{i,q})\times log\frac{N}{n_i}\end{equation} \begin{equation}w_{i,j} = (1+log f_{i,j})\times log\frac{N}{n_i}\end{equation} Document ranks computed by the Vector model for the query **“to do”** (see tf-idf weight values in Slide 43) ![](https://i.imgur.com/HUL2TAV.png) ### Advantages * partial matching allows retrieval of docs that approximate the query conditions * cosine ranking formula sorts documents according to a degree of similarity to the query ### Disadvantages * It assumes independence of index terms ## Probabilistic Model The probabilistic model captures the IR problem using a probabilistic framework. Given a user query, there is an ideal answer set for this query. we could use this idas answer set to retrieve the relevant documents. ![](https://i.imgur.com/N38TgIh.png) ### Probabilistic Ranking Principle * Assumes that this probability depends on the query and document representations only * The ideal answer set $R$ should maximize the probability of relevance #### How to Rankong * $R$ be the set of relevant documents to query $q$ * $\bar{R}$ be the set of non-relevant documents to query $q$ * $P (R|\overrightarrow{d_{j,q}})$ be the probability that $d_j$ is relevant to the query $q$ * $P (\bar{R}|\overrightarrow{d_{j,q}})$ be the probability that $d_j$ is non-relevant to $q$ \begin{equation} sim(d_j, q) = \dfrac{P (R|\overrightarrow{d_{j,q}})}{P (\bar{R}|\overrightarrow{d_{j,q}})} \end{equation} #### Proof ![](https://i.imgur.com/OKk5d3q.png) Using Bayes’ rule \begin{equation} sim(d_j, q) = \dfrac{P (\overrightarrow{d_{j}}|R,q)\times P(R,q)}{P (\overrightarrow{d_{j}}|\bar{R},q)\times P(\bar{R},q)}\propto \dfrac{P (\overrightarrow{d_{j}}|R,q)}{P (\overrightarrow{d_{j}}|\bar{R},q)} \end{equation} <img src="https://i.imgur.com/mLxubde.png" width="400"> Assuming that the weights $w_{i,j}$ are all binary and assuming independence among the index terms: <center> <img src="https://i.imgur.com/HYLtIN8.png" width="540"> </center> <img style="margin-top:2%;" src="https://i.imgur.com/eYIKl3m.png" width="400"> To simplify our notation, * $p_{iR} = P(k_i|R,q)$ * $q_{iR} = P(k_i|\bar{R},q)$ since, * $P(k_i|R,q) + P(\bar{k_i}|R,q) =1$ * $P(k_i|\bar{R},q) + P(\bar{k_i}|\bar{R},q) = 1$ we can write, <center> <img src="https://i.imgur.com/Ufp0cku.png" width="540"> </center> Taking logarithms, <center> <img src="https://i.imgur.com/xSJcNXy.png" width="520"> </center> Summing up terms that cancel each other, <center> <img src="https://i.imgur.com/qwh2Zb1.png" width="520"> </center> Using logarithm operations, <center> <img src="https://i.imgur.com/QXeqW7C.png" width="520"> </center> Further, assuming that $\forall k \notin q, p_{iR} = q_{i,R}$, and converting the log products into sums of logs, <center> <img src="https://i.imgur.com/DTLkrl0.png" width="570"> </center> ### Term Incidence Contingency Table * $N$ be the number of documents in the collection * $n_i$ be the number of documents that contain term $k_i$ * $R$ be the total number of relevant documents to query $q$ * $r_i$ be the number of relevant documents that contain term $k_i$ ![](https://i.imgur.com/4cjDyjw.png) * $p_{iR} = \frac{r_i}{R}$ * $q_{iR} = \frac{n_i - r_i}{N-R}$ ![](https://i.imgur.com/Gx2peJc.png) where $k_i[q,d_j]$ is a short notation for $k_i \in q \land k_i \in d_j$ For handling small values of $r_i$, we add 0.5 to each of the terms in the formula above, **Robertson-Sparck Jones Equation :** ![](https://i.imgur.com/dPCxl7v.png) ### Initial Ranking Because we have to adjust the idea answer set by interacting with the user at the beginning. The previous equation cannot be computed without estimates of $r_i$ and $R$. So, we assume $R=r_i=0$, ![](https://i.imgur.com/mSdY02I.png) This equation provides an $idf-like$ ranking computation.This means we can apply this equation to initial retrieval. #### Example ![](https://i.imgur.com/ycQjOwh.png) One possible artifact to contain the effect of **negative weights** is to change the previous equation to: ![](https://i.imgur.com/tULN0e7.png) ![](https://i.imgur.com/j26c619.png) #### Improving the Initial Ranking Consider the equation ![](https://i.imgur.com/XAFp0nu.png) Estimates based on assumptions: * $p_{iR} = 0.5$ * $Q_{iR} = \frac{n_i}{N}$ where $n_i$ is the number of docs that contain $k_i$ * Use this initial guess to retrieve an initial ranking * Improve upon this initial ranking We can obtain, ![](https://i.imgur.com/6WyRudT.png) #### Pipeline We can attempt to improve this initial ranking as follows, * $D$ : set of docs initially retrieved * $D_i$ : subset of docs retrieved that contain $k_i$ Reevaluate estimates: * $p_{iR} = \frac{D_i}{D}$ * $q_{iR} = \frac{n_i-D_i}{N-D}$ **Step 1** : Apply initial equation to select the top 10-20 ranked documents **Step 2** : inspect them to gather new estimates for $r_i$ and $R$ **Step 3** : remove the 10-20 documents used from the collection **Step 4** : return the query with the estimates obtained for $r_i$ and $R$ Repeat until the output is what we want ### Advantages * Docs ranked in decreasing order of probability of relevance ### Disadvantages * need to guess initial estimates for $p_{iR}$ * method does not take into account $tf$ factors * the lack of document length normalization ## Reference [1] https://www.onely.com/blog/what-is-tf-idf/