# Neural Information Retrieval
###### tags: `Information Retrieval and Extraction`
## Notation


## Neural Approach to IR
### Document Ranking
* Generate a representation of the query that specifies the information need
* Generate a representation of the document that captures the distribution over information contained
* Match the query and the document representations to estimate their mutual relevance

### Neural Approaches

### At the Point of Matching
Learning to rank using manually designed features

<br>
DNN models to estimate relevance based on patterns of exact query term matches in the documents

### At the Point of Query/Doc Representation
Learn embedding and use them within traditional IR models or in conjunction with simple similarity metrics such as cosine similarity

### Query Expansion Using Neural Embedding

# Unsupervised learning of term representations
## Term Representation
* Terms are the smallest unit of representation for indexing and retrieval.
* Many IR models, both non-neural and neural, focus on learning good vector representations of terms.
* Different vector representation exhibit different levels of generalization.
* Different representation schemes derive different notions of similarity between terms.
* Operate over fixed-size vocabularies?
* Properties of compositionality (terms $\Rightarrow$ passages $\Rightarrow$ documents)
### Local Representations

### Distributed Representation

### Feature-based Distributed Representations

#### Example

### Remark


## Observed vs. Latent

### Compositionality
* Distributed representations of items are derived from local or distributed representation of its parts.
* A document can be represented by the sum of the on-hot vectors or embeddings corresponding to the terms in the document.
* Distributed bag-of-terms representation
* The character trigraph representation of terms is an aggregation over the one- hot representations of the constituent trigraphs.
### Notions of Similarity

### Remark

## Observed Feature Spaces

### Example

## Latent Feature Space

### Latent Semantic Analysis (LSA)
Perform singular value decomposition (SVD) on a term-document (or term-passage) matrix to obtain its low-rank approximation.

### Probabilistic Latent Semantic Analysis (PLSA)
Learn low-dimensional representations of terms and documents by modelling their co-occurrence p(t,d) as follows.

After learning the parameters of the model, a term ti can be represented as a distribution over the latent topics

### Latent Dirichlet Allocation (LDA)
Each document is represented by a Dirichlet prior instead of a fixed variable.

## Neural Term Embedding
* Models are trained by setting up a prediction task.
* Instead of factorizing the term-feature matrix, neural models are trained to predict the term from its features.
* The model learns dense low-dimensional representations in the process of minimizing the prediction error.
## word2vec
優點:
1. 由于 Word2vec 会考虑上下文,跟之前的 Embedding 方法相比,效果要更好(但不如 18 年之后的方法)
2. 比之前的 Embedding方 法维度更少,所以速度更快
3. 通用性很强,可以用在各种 NLP 任务中
缺點:
1. 由于词和向量是一对一的关系,所以多义词的问题无法解决。
2. Word2vec 是一种静态的方式,虽然通用性强,但是无法针对特定任务做动态优化
### Skip-gram
用当前词来预测上下文。相当于给你一个词,让你猜前面和后面可能出现什么词。

### CBOW (Continuous Bag of Word)

## GloVe
* 預測字跟字同時出現的次數(co-occurrence count)來訓練。也是滿早期的 model,在小一點的 dataset 也能有效訓練。
* Replace the cross-entropy error with a squared-error and apply a saturation function f(...) over the actual co-occurrence frequencies.

## Paragraph2vec
* Train on term-document co-occurrences.
* Predict a term given the ID of a document or a passage that contains the term.
* In some variants, neighboring terms are also provided as input.
* Training on term-document pairs is to learn an embedding that is more aligned with a topical notion of term-term similarity, which is often more appropriate for IR tasks.
# Term Embeddings for IR
## Inexact Matching
* Term embeddings can be useful for inexact matching
* Deriving latent vector representations of the query and the document text for matching
* As a mechanism for selecting additional terms for query expansion
* Query-document matching
* Compare the query with the document directly in the embedding space
* Query expansion
* Use embeddings to generate suitable query expansion candidates from a global vocabulary and then perform retrieval based on the expanded query
### Query-Document Matching
* Deriving a dense vector representation for the query and the document from the embeddings of the individual terms in the corresponding texts.
* Aggregate the term embeddings
* Average Word (or Term) Embeddings (AWE)
* Non-linear combinations of term vectors
* Compare the query and the document embeddings
* Cosine similarity
* Choice of term embeddings
* LSA, word2vec, GloVe
#### Choice of Term Embeddings for Retrieval
* Models, such as LSA and Paragraph2vec, that consider term-document pairs generally capture **topical similarities** in the learnt vector space.
* Word2vec and GloVe embeddings may incorporate a mixture of topical and typical notions of relatedness.
* The inter-term relationships modelled in the latent spaces of word2vec and GloVe may be closer to **type-based similarities** when trained with short window sizes or on short text, such as on keyword queries.
### IN-OUT Similarity
When using word2vec embeddings for estimating the relevance of a document to a query, it is more appropriate to compute the IN-OUT similarity between the query and the document terms.
* The query terms should be represented using the IN embeddings and the document terms using the OUT embeddings.
* The difference between IN-IN or IN-OUT similarities between terms
* Related model
* Dual Embedding Space Model (DESM)
* Neural Translation Language Model (NTLM)
* Generalized Language Model (GLM)
* Word Mover’s Distance (WMD)
* Non-linear Word Transportation (NWT)
* Saliency-Weighted Semantic Network (SWSN)
## Supervised Learning to Rank
Use training data $rel_q(d)$, such as human relevance labels and click data, to train towards an IR objective.
There are three apporach :
* Pointwise Approach
* Listwise Approach
### Pointwise Approach (Regression model)
1. The relevance information $rel_q(d)$ is a numerical value associated with every query-document pair with input vector
2. The numerical relevance label can be derived from binary or graded relevance judgements or from implicit user feedback, such as a **clickthrough rate**.
### Listwise Approach
#### Input Features
* Traditional LTR models
* Query-independent or static features : incoming link count and document length
* Query-dependent or dynamic features : BM25
* Query-level features : query length
* Neural LTR model
#### Loss Function
* Hierarchical Softmax
* 將大量的文件先分組後,再計算loss,可以有效降低計算成本
### Pairwise Approach
## Deep Neural Network
### Siamese Networks

#### Loss Function
If each training sample consist of a triple $(\vec{v_q}, \vec{v_{d1}}, \vec{v_{d2}})$, such that $sim(\vec{v_q}, \vec{v_{d1}})$ should be greater than $sim(\vec{v_q}, \vec{v_{d2}})$

### Remark

## Deep Neural Networks for IR
### Siamese Networks for Short-Text Matching
#### Deep Semantic Similarity Model (DSSM)
Consist of two deep models for the query and the document with all fully- connected layers and cosine distance as the choice of similarity function in the middle.
( Note: convolutional layers, recurrent layers, and tree structured networks have been explored. Minimizing the cross-entropy loss )

### Notions of Similarity for Typical and Topical
### Interaction-based Networks

A sliding window is moved over both the query and the document text.

### Lexical and Semantic Matching
The representation learning models tend to perform poorly when dealing with rare terms and search intents.
* **Lexical matching** : it is easier to estimate relevance based on patterns of exact matches of the rare term
* **Semantic matching** : a neural model focusing on matching in the latent space is unlikely to have good representation for this rare term


### Matching with Multiple Document Fields
Consider more than just the document content for matching in commercial web search engines.
* Anchor texts corresponding to incoming hyperlinks
* The query text for which the document may have been previous viewed
Learning separate latent spaces for matching the query against the different document fields is more effective than using a shared latent space for all the fields.
