# Neural Information Retrieval ###### tags: `Information Retrieval and Extraction` ## Notation ![](https://i.imgur.com/f4TgxmZ.jpg) ![](https://i.imgur.com/uAaMwtT.png) ## Neural Approach to IR ### Document Ranking * Generate a representation of the query that specifies the information need * Generate a representation of the document that captures the distribution over information contained * Match the query and the document representations to estimate their mutual relevance ![](https://i.imgur.com/iPmU20B.png) ### Neural Approaches ![](https://i.imgur.com/22pxdPI.png) ### At the Point of Matching Learning to rank using manually designed features ![](https://i.imgur.com/yORO7IL.png) <br> DNN models to estimate relevance based on patterns of exact query term matches in the documents ![](https://i.imgur.com/V6rmNt9.png) ### At the Point of Query/Doc Representation Learn embedding and use them within traditional IR models or in conjunction with simple similarity metrics such as cosine similarity ![](https://i.imgur.com/w4pJOYm.png) ### Query Expansion Using Neural Embedding ![](https://i.imgur.com/lKgvoot.png) # Unsupervised learning of term representations ## Term Representation * Terms are the smallest unit of representation for indexing and retrieval. * Many IR models, both non-neural and neural, focus on learning good vector representations of terms. * Different vector representation exhibit different levels of generalization. * Different representation schemes derive different notions of similarity between terms. * Operate over fixed-size vocabularies? * Properties of compositionality (terms $\Rightarrow$ passages $\Rightarrow$ documents) ### Local Representations ![](https://i.imgur.com/wSy8vtn.png) ### Distributed Representation ![](https://i.imgur.com/ebLelsP.png) ### Feature-based Distributed Representations ![](https://i.imgur.com/rd4czk0.png) #### Example ![](https://i.imgur.com/fbT5Wtm.png) ### Remark ![](https://i.imgur.com/dkEQOpI.png) ![](https://i.imgur.com/tVCZMp4.png) ## Observed vs. Latent ![](https://i.imgur.com/n9PQJes.png) ### Compositionality * Distributed representations of items are derived from local or distributed representation of its parts. * A document can be represented by the sum of the on-hot vectors or embeddings corresponding to the terms in the document. * Distributed bag-of-terms representation * The character trigraph representation of terms is an aggregation over the one- hot representations of the constituent trigraphs. ### Notions of Similarity ![](https://i.imgur.com/mm4JFec.png) ### Remark ![](https://i.imgur.com/eaDuGJc.png) ## Observed Feature Spaces ![](https://i.imgur.com/JFVxzGo.png) ### Example ![](https://i.imgur.com/K3Fzbd5.jpg) ## Latent Feature Space ![](https://i.imgur.com/Yr7Eu8u.png) ### Latent Semantic Analysis (LSA) Perform singular value decomposition (SVD) on a term-document (or term-passage) matrix to obtain its low-rank approximation. ![](https://i.imgur.com/CjdZXsU.png) ### Probabilistic Latent Semantic Analysis (PLSA) Learn low-dimensional representations of terms and documents by modelling their co-occurrence p(t,d) as follows. ![](https://i.imgur.com/hdyVL3k.png) After learning the parameters of the model, a term ti can be represented as a distribution over the latent topics ![](https://i.imgur.com/I8R6Kcb.png) ### Latent Dirichlet Allocation (LDA) Each document is represented by a Dirichlet prior instead of a fixed variable. ![](https://i.imgur.com/6aefZbw.png) ## Neural Term Embedding * Models are trained by setting up a prediction task. * Instead of factorizing the term-feature matrix, neural models are trained to predict the term from its features. * The model learns dense low-dimensional representations in the process of minimizing the prediction error. ## word2vec 優點: 1. 由于 Word2vec 会考虑上下文,跟之前的 Embedding 方法相比,效果要更好(但不如 18 年之后的方法) 2. 比之前的 Embedding方 法维度更少,所以速度更快 3. 通用性很强,可以用在各种 NLP 任务中 缺點: 1. 由于词和向量是一对一的关系,所以多义词的问题无法解决。 2. Word2vec 是一种静态的方式,虽然通用性强,但是无法针对特定任务做动态优化 ### Skip-gram 用当前词来预测上下文。相当于给你一个词,让你猜前面和后面可能出现什么词。 ![](https://i.imgur.com/CcH8pN5.jpg) ### CBOW (Continuous Bag of Word) ![](https://i.imgur.com/sed729U.png) ## GloVe * 預測字跟字同時出現的次數(co-occurrence count)來訓練。也是滿早期的 model,在小一點的 dataset 也能有效訓練。 * Replace the cross-entropy error with a squared-error and apply a saturation function f(...) over the actual co-occurrence frequencies. ![](https://i.imgur.com/q2xphvp.png) ## Paragraph2vec * Train on term-document co-occurrences. * Predict a term given the ID of a document or a passage that contains the term. * In some variants, neighboring terms are also provided as input. * Training on term-document pairs is to learn an embedding that is more aligned with a topical notion of term-term similarity, which is often more appropriate for IR tasks. # Term Embeddings for IR ## Inexact Matching * Term embeddings can be useful for inexact matching * Deriving latent vector representations of the query and the document text for matching * As a mechanism for selecting additional terms for query expansion * Query-document matching * Compare the query with the document directly in the embedding space * Query expansion * Use embeddings to generate suitable query expansion candidates from a global vocabulary and then perform retrieval based on the expanded query ### Query-Document Matching * Deriving a dense vector representation for the query and the document from the embeddings of the individual terms in the corresponding texts. * Aggregate the term embeddings * Average Word (or Term) Embeddings (AWE) * Non-linear combinations of term vectors * Compare the query and the document embeddings * Cosine similarity * Choice of term embeddings * LSA, word2vec, GloVe #### Choice of Term Embeddings for Retrieval * Models, such as LSA and Paragraph2vec, that consider term-document pairs generally capture **topical similarities** in the learnt vector space. * Word2vec and GloVe embeddings may incorporate a mixture of topical and typical notions of relatedness. * The inter-term relationships modelled in the latent spaces of word2vec and GloVe may be closer to **type-based similarities** when trained with short window sizes or on short text, such as on keyword queries. ### IN-OUT Similarity When using word2vec embeddings for estimating the relevance of a document to a query, it is more appropriate to compute the IN-OUT similarity between the query and the document terms. * The query terms should be represented using the IN embeddings and the document terms using the OUT embeddings. * The difference between IN-IN or IN-OUT similarities between terms * Related model * Dual Embedding Space Model (DESM) * Neural Translation Language Model (NTLM) * Generalized Language Model (GLM) * Word Mover’s Distance (WMD) * Non-linear Word Transportation (NWT) * Saliency-Weighted Semantic Network (SWSN) ## Supervised Learning to Rank Use training data $rel_q(d)$, such as human relevance labels and click data, to train towards an IR objective. There are three apporach : * Pointwise Approach * Listwise Approach ### Pointwise Approach (Regression model) 1. The relevance information $rel_q(d)$ is a numerical value associated with every query-document pair with input vector 2. The numerical relevance label can be derived from binary or graded relevance judgements or from implicit user feedback, such as a **clickthrough rate**. ### Listwise Approach #### Input Features * Traditional LTR models * Query-independent or static features : incoming link count and document length * Query-dependent or dynamic features : BM25 * Query-level features : query length * Neural LTR model #### Loss Function * Hierarchical Softmax * 將大量的文件先分組後,再計算loss,可以有效降低計算成本 ### Pairwise Approach ## Deep Neural Network ### Siamese Networks ![](https://i.imgur.com/reDu0nW.png) #### Loss Function If each training sample consist of a triple $(\vec{v_q}, \vec{v_{d1}}, \vec{v_{d2}})$, such that $sim(\vec{v_q}, \vec{v_{d1}})$ should be greater than $sim(\vec{v_q}, \vec{v_{d2}})$ ![](https://i.imgur.com/jruFz7S.png) ### Remark ![](https://i.imgur.com/lwf664a.jpg) ## Deep Neural Networks for IR ### Siamese Networks for Short-Text Matching #### Deep Semantic Similarity Model (DSSM) Consist of two deep models for the query and the document with all fully- connected layers and cosine distance as the choice of similarity function in the middle. ( Note: convolutional layers, recurrent layers, and tree structured networks have been explored. Minimizing the cross-entropy loss ) ![](https://i.imgur.com/Qza6MF1.png) ### Notions of Similarity for Typical and Topical ### Interaction-based Networks ![](https://i.imgur.com/wEkdeQa.jpg) A sliding window is moved over both the query and the document text. ![](https://i.imgur.com/xpqh5Ph.png) ### Lexical and Semantic Matching The representation learning models tend to perform poorly when dealing with rare terms and search intents. * **Lexical matching** : it is easier to estimate relevance based on patterns of exact matches of the rare term * **Semantic matching** : a neural model focusing on matching in the latent space is unlikely to have good representation for this rare term ![](https://i.imgur.com/sl5BK9V.png) ![](https://i.imgur.com/Q3RJGTM.png) ### Matching with Multiple Document Fields Consider more than just the document content for matching in commercial web search engines. * Anchor texts corresponding to incoming hyperlinks * The query text for which the document may have been previous viewed Learning separate latent spaces for matching the query against the different document fields is more effective than using a shared latent space for all the fields. ![](https://i.imgur.com/tUUunPS.jpg)