# Algebraic IR Models
###### tags: `Information Retrieval and Extraction`
## Generalized Vector Model
Classic models enforce independence of index terms. For instance, in the Vector model A set of term vectors $\{\vec{k_1},\vec{k_2}, . . .,\vec{k_t}\}$ are linearly independent ($\forall_{i,j} \Rightarrow \vec{k_i}\cdot \vec{k_j} = 0$)
### Key Idea
Let $w_{i,j}$ be the weight associated with $[k_i,d_j]$ and $V = \{k_1, k_2, ..., k_t\}$ be the set of all terms. If the $w_{i,j}$ weights are binary, all patterns of occurrence of terms within docs can be represented by minterms:

For instance, $m_2$ indicates documents in which solely the term $k_1$ occurs
### Forming the Term Vectors
Let $on(i, m_r)$ return the weight {0, 1} of the index term $k_i$ in the minterm $m_r$. The vector associated with the term ki is computed as:

### Dependency between Index Terms
A degree of correlation between the terms ki and kj can now be computed as:

### Example

#### Computation of $c_{i,r}$

#### Computation of $\vec{k_i}$

#### Computation of Document Vectors

### Remark
* Model considers correlations among index terms
* Computation costs are higher
## Latent Semantic Indexing
The idea here is to map documents and queries into a dimensional space composed of concepts
### Move back
Classic IR might lead to poor retrieval (poor performance) due to:
* unrelated documents might be included in the answer set
* relevant documents that do not contain at least one index term are not retrieved
* **Reasoning**: retrieval based on index terms is vague and noisy
### Key idea
Let
* $t$: total number of index terms
* $N$: number of documents
* $\mathbf{M}$ = $[m_{ij}]$: term-document matrix $t \times N$
To each element of $\mathbf{M}$ is assigned a weight $w_{i,j}$ associated with the term-document pair $[k_i, d_j]$
* The weight $w_{i,j}$ can be based on a $tf$-$idf$ weighting scheme
The matrix $\mathbf{M} = [m_{ij}]$ can be decomposed into three components using singular value decomposition
\begin{equation}
\mathbf{M} = \mathbf{K}\cdot \mathbf{S}\cdot \mathbf{D}^{\mathbf{T}}
\end{equation}
where
* $\mathbf{K}$ is the matrix of eigenvectors derived from $\mathbf{C} = \mathbf{M} \cdot \mathbf{M}^{\mathbf{T}}$ (term matrix $t\times t$)
* $\mathbf{D}^{\mathbf{T}}$ is the matrix of eigenvectors derived from $\mathbf{M}^{\mathbf{T}}\cdot \mathbf{M}$ (document matrix $N\times N$)
* $\mathbf{S}$ is an $r \times r$ diagonal matrix of singular values where $r = min(t,N)$ is the rank of $\mathbf{M}$
Let ${\mathbf{M}}^{\mathbf{T}}$ = $[m_{ij}]$ be given by

### Reduce Model
In the matrix $\mathbf{S}$, consider that only the $s$ largest singular values are selected. Keep the corresponding columns in $\mathbf{K}$ and $\mathbf{D}^{\mathbf{T}}$
The resultant matrix is called $\mathbf{M_s}$ and is given by
\begin{equation}
\mathbf{M_s} = \mathbf{K_s}\cdot \mathbf{S_s}\cdot \mathbf{D_s}^{\mathbf{T}}
\end{equation}
where $s$, $s < r$, is the dimensionality of a reduced concept space
### Latent Ranking
The parameter $s$ should be
* large enough to allow fitting the characteristics of the data
* small enough to filter out the non-relevant representational details
The relationship between any two documents in s can be obtained from the $\mathbf{M_s}^{\mathbf{T}}\cdot \mathbf{M_s}$ matrix given by

In the above matrix, the $(i, j)$ element quantifies the relationship between documents $d_i$ and $d_j$

### Example
**Step 1**. A collection of documents has its content terms tabulated to give a frequency matrix, which is taken as $\mathbf{X}$.

**Step 2**. A k-dimensional SVD decomposition of $\mathbf{X}$ is computed yielding matrices $\mathbf{T}$, $\mathbf{S}$, and $\mathbf{D}$.

The rows of $\mathbf{T}$ and $\mathbf{D}$ are taken as index vectors for corresponding terms and documents, respectively.
**Step 3**. The diagonal elements of $\mathbf{S}$ (or $\mathbf{S}^{\frac{1}{2}}$, as needed), are taken as component-wise weights in ensuing similarity calculations.

**Step 4**. A query, treated as vector of term frequencies (albeit very sparse), is converted to a pseudo-document, $\mathbf{D_q}$, in the factor space following equation.
\begin{equation}
\mathbf{D_q} = \mathbf{X_q}^t\cdot \mathbf{T}\cdot \mathbf{S}^{-1}
\end{equation}

**Step 5**. This query factor-vector is then compared to the factor-vectors of all the documents, and the documents ordered according to the results.

### Remark
* Latent semantic indexing provides an interesting conceptualization of the IR problem