Algebraic IR Models

# Algebraic IR Models ###### tags: `Information Retrieval and Extraction` ## Generalized Vector Model Classic models enforce independence of index terms. For instance, in the Vector model A set of term vectors $\{\vec{k_1},\vec{k_2}, . . .,\vec{k_t}\}$ are linearly independent ($\forall_{i,j} \Rightarrow \vec{k_i}\cdot \vec{k_j} = 0$) ### Key Idea Let $w_{i,j}$ be the weight associated with $[k_i,d_j]$ and $V = \{k_1, k_2, ..., k_t\}$ be the set of all terms. If the $w_{i,j}$ weights are binary, all patterns of occurrence of terms within docs can be represented by minterms: ![](https://i.imgur.com/hISLwDA.png) For instance, $m_2$ indicates documents in which solely the term $k_1$ occurs ### Forming the Term Vectors Let $on(i, m_r)$ return the weight {0, 1} of the index term $k_i$ in the minterm $m_r$. The vector associated with the term ki is computed as: ![](https://i.imgur.com/mt4gDrk.png) ### Dependency between Index Terms A degree of correlation between the terms ki and kj can now be computed as: ![](https://i.imgur.com/dma2CbB.png) ### Example ![](https://i.imgur.com/uG0vZ4L.png) #### Computation of $c_{i,r}$ ![](https://i.imgur.com/OYNnA0H.png) #### Computation of $\vec{k_i}$ ![](https://i.imgur.com/ccMDARa.png) #### Computation of Document Vectors ![](https://i.imgur.com/nNkVpwd.png) ### Remark * Model considers correlations among index terms * Computation costs are higher ## Latent Semantic Indexing The idea here is to map documents and queries into a dimensional space composed of concepts ### Move back Classic IR might lead to poor retrieval (poor performance) due to: * unrelated documents might be included in the answer set * relevant documents that do not contain at least one index term are not retrieved * **Reasoning**: retrieval based on index terms is vague and noisy ### Key idea Let * $t$: total number of index terms * $N$: number of documents * $\mathbf{M}$ = $[m_{ij}]$: term-document matrix $t \times N$ To each element of $\mathbf{M}$ is assigned a weight $w_{i,j}$ associated with the term-document pair $[k_i, d_j]$ * The weight $w_{i,j}$ can be based on a $tf$-$idf$ weighting scheme The matrix $\mathbf{M} = [m_{ij}]$ can be decomposed into three components using singular value decomposition \begin{equation} \mathbf{M} = \mathbf{K}\cdot \mathbf{S}\cdot \mathbf{D}^{\mathbf{T}} \end{equation} where * $\mathbf{K}$ is the matrix of eigenvectors derived from $\mathbf{C} = \mathbf{M} \cdot \mathbf{M}^{\mathbf{T}}$ (term matrix $t\times t$) * $\mathbf{D}^{\mathbf{T}}$ is the matrix of eigenvectors derived from $\mathbf{M}^{\mathbf{T}}\cdot \mathbf{M}$ (document matrix $N\times N$) * $\mathbf{S}$ is an $r \times r$ diagonal matrix of singular values where $r = min(t,N)$ is the rank of $\mathbf{M}$ Let ${\mathbf{M}}^{\mathbf{T}}$ = $[m_{ij}]$ be given by ![](https://i.imgur.com/yKZ0ZXY.png) ### Reduce Model In the matrix $\mathbf{S}$, consider that only the $s$ largest singular values are selected. Keep the corresponding columns in $\mathbf{K}$ and $\mathbf{D}^{\mathbf{T}}$ The resultant matrix is called $\mathbf{M_s}$ and is given by \begin{equation} \mathbf{M_s} = \mathbf{K_s}\cdot \mathbf{S_s}\cdot \mathbf{D_s}^{\mathbf{T}} \end{equation} where $s$, $s < r$, is the dimensionality of a reduced concept space ### Latent Ranking The parameter $s$ should be * large enough to allow fitting the characteristics of the data * small enough to filter out the non-relevant representational details The relationship between any two documents in s can be obtained from the $\mathbf{M_s}^{\mathbf{T}}\cdot \mathbf{M_s}$ matrix given by ![](https://i.imgur.com/BkpAHKL.png) In the above matrix, the $(i, j)$ element quantifies the relationship between documents $d_i$ and $d_j$ ![](https://i.imgur.com/h93tVC4.png) ### Example **Step 1**. A collection of documents has its content terms tabulated to give a frequency matrix, which is taken as $\mathbf{X}$. ![](https://i.imgur.com/06OMg7t.jpg) **Step 2**. A k-dimensional SVD decomposition of $\mathbf{X}$ is computed yielding matrices $\mathbf{T}$, $\mathbf{S}$, and $\mathbf{D}$. ![](https://i.imgur.com/l3hU4Dv.jpg) The rows of $\mathbf{T}$ and $\mathbf{D}$ are taken as index vectors for corresponding terms and documents, respectively. **Step 3**. The diagonal elements of $\mathbf{S}$ (or $\mathbf{S}^{\frac{1}{2}}$, as needed), are taken as component-wise weights in ensuing similarity calculations. ![](https://i.imgur.com/JJoGuP2.jpg) **Step 4**. A query, treated as vector of term frequencies (albeit very sparse), is converted to a pseudo-document, $\mathbf{D_q}$, in the factor space following equation. \begin{equation} \mathbf{D_q} = \mathbf{X_q}^t\cdot \mathbf{T}\cdot \mathbf{S}^{-1} \end{equation} ![](https://i.imgur.com/DvruYi2.png) **Step 5**. This query factor-vector is then compared to the factor-vectors of all the documents, and the documents ordered according to the results. ![](https://i.imgur.com/TMun43i.jpg) ### Remark * Latent semantic indexing provides an interesting conceptualization of the IR problem