# Set Theoretic IR Models ###### tags: `Information Retrieval and Extraction` ## Set Theoretic IR Models In order to solve the classical model assuming that term and term are independent. But **Set Model** can introduce **collinearity** into the ranking. ## Set-Based Model * The fundamental idea is to use mutual dependencies among index terms to improve results * Term dependencies are captured through **termsets**, which are sets of correlated terms ### Termsets Termset is a concept used in place of the index terms. * termset $S_i = \{k_a, k_b, ..., k_n\}$ : a subset of the terms in the collection * If all index terms in $S_i$ occur in a document $d_j$ then we say that the termset $S_i$ occurs in $d_j$ * There are $2^t$ termsets that might occur in the documents of a collection, where $t$ is the vocabulary size * the actual number of termsets in a collection is far smaller than $2^t$ #### For example the set $V_S = \{S_1, S_2, ..., S_{2^{t}}\}$ is the vocabulary-set of the collection ![](https://i.imgur.com/fkCxFGj.png) To simplify notation, let us define ![](https://i.imgur.com/c3M3miu.png) let the letters $a...n$ refer to the index terms $k_a...k_n$ , respectively ![](https://i.imgur.com/Lv5dGCx.png) Consider the query $q$ as **“to do be it”**, i.e. $q = \{a,b,d,n\}$. For this query, the vocabulary-set is as below ![](https://i.imgur.com/4WVfPAm.png) Notice that there are 11 termsets that occur in our collection, out of the maximum of 15 termsets that can be formed with the terms in $q$ --- ### Frequent termsets Taht can be used to reduce the number of termsets to consider with long queries * A termset composed of $n$ terms is called an $n$-termset * Let $N_i$ be the number of documents in which $S_i$ occurs * An $n$-termset $S_i$ is said to be **frequent** if $Ni \ge$ a given threshold Let the threshold on the frequency of termsets be 2. To compute all frequent termsets for the query $q = \{a, b, d, n\}$ we proceed as follows : 1. Compute the frequent 1-termsets and their inverted lists: * $S_a = \{d_1, d_2\}$ * $S_b = \{d_1, d_3, d_4\}$ * $S_d = \{d_1, d_2, d_3, d_4\}$ 2. Combine the inverted lists to compute frequent 2-termsets: * $S_{ad} = \{d_1, d_2\}$ * $S_{bd} = \{d_1, d_3, d_4\}$ 3. Since there are no frequent 3- termsets, stop there are only 5 frequent termsets in our collection ### Ranking Computation The ranking computation is based on the vector model, but adopts termsets instead of index terms. Given a query q, * $\{S1, S2, . . .\}$ be the set of all termsets originated from $q$ * $\mathcal{N_i}$ be the number of documents in which termset $S_i$ occurs * $N$ be the total number of documents in the collection * $F_{i,j}$ be the frequency of termset $S_i$ in document $d_j$ For each pair $[S_i,d_j]$ we compute a weight $W_{i,j}$ given by \begin{equation} \mathcal{W_{i,j}} = \begin{cases} (1+log_2\mathcal{F_{i,j}})log_2(1+\frac{N}{\mathcal{N_i}}), & \text{if}\ \mathcal{F_{i,j}}>0 \\ 0, & \mathcal{F_{i,j}}=0 \end{cases} \end{equation} #### Document weight ![](https://i.imgur.com/lgfsxFp.jpg) <br><br> A document $d_j$ and a query $q$ are represented as vectors in a $2^t$-dimensional space of termsets ![](https://i.imgur.com/ip68XVp.png) The rank of $d_j$ to the query $q$ is computed as follows ![](https://i.imgur.com/mlwVOzv.png) The document norm $|\overrightarrow{d}|$ is hard to compute in the space of termsets. Thus, **its computation is restricted to 1-termsets** Let $q = \{a,b,d,n\}$ and $d_1$, document norm in terms of 1-termsets is given by ![](https://i.imgur.com/z6r1WhM.png) #### Similirity ![](https://i.imgur.com/vvogqfO.jpg) where $F_{i,j}$ be the frequency of termset $S_i$ in query $q$, log is base on 2 ### Closed Termsets frequent termsets allows simplifying the ranking computation. Yet, there are many frequent termsets in a large collection The **closure of a termset** $S_i$ is the set of all frequent termsets that co-occur with $S_i$ in the same set of docs ![](https://i.imgur.com/qTWNoVK.jpg) #### Ranking Computation ![](https://i.imgur.com/evt2ckk.jpg) if we restrict the ranking computation to closed termsets, we can expect a reduction in query time ## Extended Boolean Model In the Boolean model, **no ranking** of the answer set is generated. Extended Boolean Model combine characteristics of the **Vector model** and **Boolean algebra** which has **partial matching** and **term weighting**. Consider a conjunctive Boolean query $q = k_x \land k_y$, we can plot queries and docs in a two-dimensional space ![](https://i.imgur.com/ZTcSfeD.jpg) A document $d_j$ is positioned in this space through the adoption of weights $w_{x,j}$ and $w_{y,j}$ These weights can be computed as normalized tf-idf factors as follows \begin{equation} w_{x,j} = \frac{f_{x,j}}{max_x \cdot f_{x,j}} \times \frac{idf_{x}}{max_x \cdot idf_{x}} \end{equation} where $f_{x,j}$ is the frequency of term $k_x$ in document $d_j$ $idf_x$ is the inverse document frequency of term $k_x$, as before ![](https://i.imgur.com/E1kS5QO.png) ### Ranking Computation 1. **2-dimensional** \begin{equation} sim({q_{or},d}) = \sqrt{\frac{{w_{x,j}^2}+{w_{y,j}}^2}{2}} \end{equation} \begin{equation} sim({q_{and},d}) = 1 - \sqrt{\frac{(1-{w_{x,j})^2}+(1-{w_{y,j})^2}}{2}} \end{equation} 2. **m-dimensional with p-distances** \begin{equation} sim({q_{or},d}) = (\frac{{w_{1,j}^p}+{w_{2,j}}^p+...+{w_{m,j}^p}}{m})^{\frac{1}{p}} \end{equation} \begin{equation} sim({q_{and},d}) = 1 - (\frac{(1-{w_{1,j})^p}+(1-{w_{2,j})^p}+...+(1-{w_{m,j})^p}}{m})^{\frac{1}{p}} \end{equation} 3. **Special case** if $p=1$ then (vector-like) * $sim({q_{or},d}) = sim({q_{and},d}) = \frac{w_{1,j}+w_{2,j}+...+w_{m,j}}{m}$ If $p = ∞$ then (Fuzzy like) * $sim({q_{or},d}) =max(w_{i,j})$ * $sim({q_{and},d}) =min(w_{i,j})$ 4. **Practical examples** consider query $q=(k_1 \land^p k_2) \vee^p k_3$ \begin{equation} sim({q,d}) = (\frac{(1 - (\frac{(1-{w_{1,j})^p}+(1-{w_{2,j})^p}}{2})^{\frac{1}{p}})^p+{w_{3,j}}^p}{2})^{\frac{1}{p}} \end{equation} ### Remark * Computation is somewhat complex * distributivity operation does not hold for ranking computation: * $q_1 = (k_1 \vee k_2) \land k_3$ * $q_2 = (k_1 \land k_3) \vee (k_2 \land k_3)$ * $sim(q_1, d_j) \ne sim(q_2,d_j)$ ## Fuzzy Set Model This is base on **Fuzzy Set Theory**. Each query term defines a fuzzy set. The elements in this set are the relevant scores of this query and each document Let * $U$ be the document collection * Suppose * two query terms $q_1 = k_a, q_2 = k_b$ and there documents $d_1, d_2, d_3$. * Fuzzy set $A = \mu_A$, Fuzzy set $B = \mu_B$. * $\bar{A}$ be the complement of $A$ relative to $U$ \begin{equation} \mu_A = (d_1, d_2, d_3) = (0, 0.5, 1)\\ \mu_B = (d_1, d_2, d_3) = (0.1, 0.7, 1) \end{equation} Then, \begin{equation} \mu_\bar{A}(d_2) = 1-\mu_A(d2)=1-0.5=0.5\\ \mu_{A\cup B}(d_2) = max(\mu_A(d_2), \mu_B(d_2)) = max(0.5, 0.7) = 0.7\\ \mu_{A\cap B}(d_2) = min(\mu_A(d_2), \mu_B(d_2)) = max(0.5, 0.7) = 0.5\\ \end{equation} ### Fuzzy Information Retrieval A thesaurus can be constructed by defining a term-term correlation matrix $C$ Each element of $C$ defines a normalized correlation factor $c_{i,l}$ between two terms $k_i$ and $k_l$ \begin{equation} c_{i,l} = \frac{n_{i,l}}{n_i+n_l-n_{i,l}} \end{equation} where * $n_i$: number of docs which contain $k_i$ * $n_l$: number of docs which contain $k_l$ * $n_{i,l}$: number of docs which contain both $k_i$ and $k_l$ In this fuzzy set, a document dj has a degree of membership $μ_{i,j}$ given by ![](https://i.imgur.com/95WZIud.jpg) if $k_i$ is in $d_j$ then $c_{i,j} = c_{i,i} = 1 \Rightarrow \prod_{k_l \in d_j} (1-c_{i,l}) = 0 \Rightarrow \mu_{i,j}=1$ if $k_i$ is not in $d_j$ then we will caculate the term correlation between $k_i$ and $k_l$. When $k_j$ is closely related to $k_i$ $c_{i,j} \simeq 1 \Rightarrow \prod_{k_l \in d_j} (1-c_{i,l}) \simeq 0 \Rightarrow \mu_{i,j} \simeq 1$ ### Example ![](https://i.imgur.com/QfZFJSM.png) Consider the query $q = k_a ∧ (k_b ∨ ¬k_c)$. The disjunctive normal form of $q$ is composed of 3 conjunctive components ($cc$), as follows: $\vec{qdnf} = (1,1,1) + (1,1,0) + (1,0,0) = cc1 + cc2 + cc3$ Let * $D_a$, $D_b$ and $D_c$ be the fuzzy sets associated with the terms $k_a$, $k_b$ and $k_c$. * $μ_{a,j}$, $μ_{b,j}$, and $μ_{c,j}$ be the degrees of memberships of document $d_j$ in the fuzzy sets $D_a$, $D_b$, and $D_c$. Then \begin{equation} cc_1 = μ_{a,j}μ_{b,j}μ_{c,j}\\ cc_2 = μ_{a,j}μ_{b,j}(1-μ_{c,j})\\ cc_3 = μ_{a,j}(1-μ_{b,j})(1-μ_{c,j}) \end{equation} <br> $\mu_{q,j} = \mu_{cc_1+cc_2+cc_3,j}\\= 1-\prod_{i=1}^3 (1-\mu_{cc_i,j})\\ = 1-(1-μ_{a,j}μ_{b,j}μ_{c,j})\times (1-(μ_{a,j}μ_{b,j}(1-μ_{c,j}))\times (1-(μ_{a,j}(1-μ_{b,j})(1-μ_{c,j}))$ ### Remark * Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory * They provide an interesting framework which naturally embodies the notion of term dependencies