# Set Theoretic IR Models
###### tags: `Information Retrieval and Extraction`
## Set Theoretic IR Models
In order to solve the classical model assuming that term and term are independent. But **Set Model** can introduce **collinearity** into the ranking.
## Set-Based Model
* The fundamental idea is to use mutual dependencies among index terms to improve results
* Term dependencies are captured through **termsets**, which are sets of correlated terms
### Termsets
Termset is a concept used in place of the index terms.
* termset $S_i = \{k_a, k_b, ..., k_n\}$ : a subset of the terms in the collection
* If all index terms in $S_i$ occur in a document $d_j$ then we say that the termset $S_i$ occurs in $d_j$
* There are $2^t$ termsets that might occur in the documents of a collection, where $t$ is the vocabulary size
* the actual number of termsets in a collection is far smaller than $2^t$
#### For example
the set $V_S = \{S_1, S_2, ..., S_{2^{t}}\}$ is the vocabulary-set of the collection

To simplify notation, let us define

let the letters $a...n$ refer to the index terms $k_a...k_n$ , respectively

Consider the query $q$ as **“to do be it”**, i.e. $q = \{a,b,d,n\}$. For this query, the vocabulary-set is as below

Notice that there are 11 termsets that occur in our collection, out of the maximum of 15 termsets that can be formed with the terms in $q$
---
### Frequent termsets
Taht can be used to reduce the number of termsets to consider with long queries
* A termset composed of $n$ terms is called an $n$-termset
* Let $N_i$ be the number of documents in which $S_i$ occurs
* An $n$-termset $S_i$ is said to be **frequent** if $Ni \ge$ a given threshold
Let the threshold on the frequency of termsets be 2. To compute all frequent termsets for the query $q = \{a, b, d, n\}$ we proceed as follows :
1. Compute the frequent 1-termsets and their inverted lists:
* $S_a = \{d_1, d_2\}$
* $S_b = \{d_1, d_3, d_4\}$
* $S_d = \{d_1, d_2, d_3, d_4\}$
2. Combine the inverted lists to compute frequent 2-termsets:
* $S_{ad} = \{d_1, d_2\}$
* $S_{bd} = \{d_1, d_3, d_4\}$
3. Since there are no frequent 3- termsets, stop
there are only 5 frequent termsets in our collection
### Ranking Computation
The ranking computation is based on the vector model, but adopts termsets instead of index terms.
Given a query q,
* $\{S1, S2, . . .\}$ be the set of all termsets originated from $q$
* $\mathcal{N_i}$ be the number of documents in which termset $S_i$ occurs
* $N$ be the total number of documents in the collection
* $F_{i,j}$ be the frequency of termset $S_i$ in document $d_j$
For each pair $[S_i,d_j]$ we compute a weight $W_{i,j}$ given by
\begin{equation}
\mathcal{W_{i,j}} =
\begin{cases}
(1+log_2\mathcal{F_{i,j}})log_2(1+\frac{N}{\mathcal{N_i}}), & \text{if}\ \mathcal{F_{i,j}}>0 \\
0, & \mathcal{F_{i,j}}=0
\end{cases}
\end{equation}
#### Document weight

<br><br>
A document $d_j$ and a query $q$ are represented as vectors in a $2^t$-dimensional space of termsets

The rank of $d_j$ to the query $q$ is computed as follows

The document norm $|\overrightarrow{d}|$ is hard to compute in the space of termsets. Thus, **its computation is restricted to 1-termsets**
Let $q = \{a,b,d,n\}$ and $d_1$, document norm in terms of 1-termsets is given by

#### Similirity

where $F_{i,j}$ be the frequency of termset $S_i$ in query $q$, log is base on 2
### Closed Termsets
frequent termsets allows simplifying the ranking computation. Yet, there are many frequent termsets in a large collection
The **closure of a termset** $S_i$ is the set of all frequent termsets that co-occur with $S_i$ in the same set of docs

#### Ranking Computation

if we restrict the ranking computation to closed termsets, we can expect a reduction in query time
## Extended Boolean Model
In the Boolean model, **no ranking** of the answer set is generated. Extended Boolean Model combine characteristics of the **Vector model** and **Boolean algebra** which has **partial matching** and **term weighting**.
Consider a conjunctive Boolean query $q = k_x \land k_y$, we can plot queries and docs in a two-dimensional space

A document $d_j$ is positioned in this space through the adoption of weights $w_{x,j}$ and $w_{y,j}$
These weights can be computed as normalized tf-idf factors as follows
\begin{equation}
w_{x,j} = \frac{f_{x,j}}{max_x \cdot f_{x,j}} \times \frac{idf_{x}}{max_x \cdot idf_{x}}
\end{equation}
where $f_{x,j}$ is the frequency of term $k_x$ in document $d_j$
$idf_x$ is the inverse document frequency of term $k_x$, as before

### Ranking Computation
1. **2-dimensional**
\begin{equation}
sim({q_{or},d}) = \sqrt{\frac{{w_{x,j}^2}+{w_{y,j}}^2}{2}}
\end{equation}
\begin{equation}
sim({q_{and},d}) = 1 - \sqrt{\frac{(1-{w_{x,j})^2}+(1-{w_{y,j})^2}}{2}}
\end{equation}
2. **m-dimensional with p-distances**
\begin{equation}
sim({q_{or},d}) = (\frac{{w_{1,j}^p}+{w_{2,j}}^p+...+{w_{m,j}^p}}{m})^{\frac{1}{p}}
\end{equation}
\begin{equation}
sim({q_{and},d}) = 1 - (\frac{(1-{w_{1,j})^p}+(1-{w_{2,j})^p}+...+(1-{w_{m,j})^p}}{m})^{\frac{1}{p}}
\end{equation}
3. **Special case**
if $p=1$ then (vector-like)
* $sim({q_{or},d}) = sim({q_{and},d}) = \frac{w_{1,j}+w_{2,j}+...+w_{m,j}}{m}$
If $p = ∞$ then (Fuzzy like)
* $sim({q_{or},d}) =max(w_{i,j})$
* $sim({q_{and},d}) =min(w_{i,j})$
4. **Practical examples**
consider query $q=(k_1 \land^p k_2) \vee^p k_3$
\begin{equation}
sim({q,d}) = (\frac{(1 - (\frac{(1-{w_{1,j})^p}+(1-{w_{2,j})^p}}{2})^{\frac{1}{p}})^p+{w_{3,j}}^p}{2})^{\frac{1}{p}}
\end{equation}
### Remark
* Computation is somewhat complex
* distributivity operation does not hold for ranking computation:
* $q_1 = (k_1 \vee k_2) \land k_3$
* $q_2 = (k_1 \land k_3) \vee (k_2 \land k_3)$
* $sim(q_1, d_j) \ne sim(q_2,d_j)$
## Fuzzy Set Model
This is base on **Fuzzy Set Theory**. Each query term defines a fuzzy set. The elements in this set are the relevant scores of this query and each document
Let
* $U$ be the document collection
*
Suppose
* two query terms $q_1 = k_a, q_2 = k_b$ and there documents $d_1, d_2, d_3$.
* Fuzzy set $A = \mu_A$, Fuzzy set $B = \mu_B$.
* $\bar{A}$ be the complement of $A$ relative to $U$
\begin{equation}
\mu_A = (d_1, d_2, d_3) = (0, 0.5, 1)\\
\mu_B = (d_1, d_2, d_3) = (0.1, 0.7, 1)
\end{equation}
Then,
\begin{equation}
\mu_\bar{A}(d_2) = 1-\mu_A(d2)=1-0.5=0.5\\
\mu_{A\cup B}(d_2) = max(\mu_A(d_2), \mu_B(d_2)) = max(0.5, 0.7) = 0.7\\
\mu_{A\cap B}(d_2) = min(\mu_A(d_2), \mu_B(d_2)) = max(0.5, 0.7) = 0.5\\
\end{equation}
### Fuzzy Information Retrieval
A thesaurus can be constructed by defining a term-term correlation matrix $C$
Each element of $C$ defines a normalized correlation factor $c_{i,l}$ between two terms $k_i$ and $k_l$
\begin{equation}
c_{i,l} = \frac{n_{i,l}}{n_i+n_l-n_{i,l}}
\end{equation}
where
* $n_i$: number of docs which contain $k_i$
* $n_l$: number of docs which contain $k_l$
* $n_{i,l}$: number of docs which contain both $k_i$ and $k_l$
In this fuzzy set, a document dj has a degree of membership $μ_{i,j}$ given by

if $k_i$ is in $d_j$ then $c_{i,j} = c_{i,i} = 1 \Rightarrow \prod_{k_l \in d_j} (1-c_{i,l}) = 0 \Rightarrow \mu_{i,j}=1$
if $k_i$ is not in $d_j$ then we will caculate the term correlation between $k_i$ and $k_l$. When $k_j$ is closely related to $k_i$ $c_{i,j} \simeq 1 \Rightarrow \prod_{k_l \in d_j} (1-c_{i,l}) \simeq 0 \Rightarrow \mu_{i,j} \simeq 1$
### Example

Consider the query $q = k_a ∧ (k_b ∨ ¬k_c)$. The disjunctive normal form of $q$ is composed of 3 conjunctive components
($cc$), as follows:
$\vec{qdnf} = (1,1,1) + (1,1,0) + (1,0,0) = cc1 + cc2 + cc3$
Let
* $D_a$, $D_b$ and $D_c$ be the fuzzy sets associated with the terms $k_a$, $k_b$ and $k_c$.
* $μ_{a,j}$, $μ_{b,j}$, and $μ_{c,j}$ be the degrees of memberships of document $d_j$ in the fuzzy sets $D_a$, $D_b$, and $D_c$.
Then
\begin{equation}
cc_1 = μ_{a,j}μ_{b,j}μ_{c,j}\\
cc_2 = μ_{a,j}μ_{b,j}(1-μ_{c,j})\\
cc_3 = μ_{a,j}(1-μ_{b,j})(1-μ_{c,j})
\end{equation}
<br>
$\mu_{q,j}
= \mu_{cc_1+cc_2+cc_3,j}\\= 1-\prod_{i=1}^3 (1-\mu_{cc_i,j})\\ = 1-(1-μ_{a,j}μ_{b,j}μ_{c,j})\times (1-(μ_{a,j}μ_{b,j}(1-μ_{c,j}))\times (1-(μ_{a,j}(1-μ_{b,j})(1-μ_{c,j}))$
### Remark
* Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory
* They provide an interesting framework which naturally embodies the notion of term dependencies