GNN-encoder: Learning a Dual-encoder Architecture via Graph Neural Networks for Dense Passage Retrieval - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2204.08241) | [Note link](https://zhuanlan.zhihu.com/p/506122416) | EMNLP 2022 :::success **Thoughts** This paper has three modules: * Cross-encoder * Dual-encoder * Query-passage Graph * GAT * Aggregate * Filter Gate They use the **Masked Graph Training (MGT) Algorithm** to learn the similarity and score between query and passage The ablation study shows that with GNN, dense passage retrieval can be improved. However, they need to reduce the cache for both passage embeddings and cross-encoder embeddings and try to fuse more interaction information into the GNN structure. ::: ## Abstract A common practice of dense retrieval models is to exploit a dual-encoder architecture to represent a query and a passage independently. Though efficient, such a structure loses interaction between the query-passage pair, resulting in inferior accuracy. They propose a GNN-encoder model in which query (passage) information is fused into passage (query) representations via graph neural networks that are constructed by queries and their top retrieved passages. Evaluation results indicate that their method significantly outperforms the existing models on MSMARCO, Natural Questions and TriviaQA datasets, and achieves the new state-of-the-art on these datasets. ## Introduction Traditionally, we typically adopt a two-stage retrieval pipeline: - Retrieve a subset of candidate passages by a recall model from the entire corpus - Rerank the retrieved passages The underlying idea is to represent both queries and passages as embeddings, so that the semantic relevance can be measured via embeddings similarity. --- Two paradigms based on fine-tuned language models are typically built for retrieval: - cross-encoders: needs to recompute the representation of each passage in the corpus once a new query comes - dual-encoders: representing a query and a passage independently through two separate encoders But independent encoding without any interaction causes severe retrieval performance drop due to information loss. List of solutions: - attention layers - the sum of maximum similarity computations - transformer layers In this paper, they propose a novel approach that explicitly fuses query (passage) information into passage (query) embeddings through a graph neural network (GNN), and name the model **GNN-encoder**. Their model is built upon the dual-encoder, and learns query-interactive passage representations and passage-interactive query representations through a graph neural network. To avoid information leakage, they further design a new training algorithm and name it **Masked Graph Training (MGT)**, in which the query set used for training GNN is no longer used to construct the query-passage graph in each training epoch. ## Methodology ### Preliminary Given a query $q$, dense retriever is required to retrieve k most relevant passages $\{ p_i \}_{i=1}^k$ from a large corpus. And, they use dual encoder in this paper, where query encoder $E_Q(\cdot)$ and passage encoder $E_P(\cdot)$ are used. The similarity be computed as: $$ \tag{1} s(q, p) = E_Q(q)^\top \cdot E_P(p) $$ The training objective of the dual-encoder is to learn embeddings of queries and passages to make positive query-passage pairs have higher similarity than the negative query-passage pairs in training data. $$ \tag{2} L(q, p^+, \{ p^- \}, s) = - \log \frac{e^s (q, p^+)}{e^{s(q, p^+)} + \sum_{p^-}e^{s(q, p^-)}} $$ ### Graph Construction They use all the passages $P = \{ p_i \}_{i=1}^m$ and training queries $Q = \{ q_i \}_{i=1}^n$ as nodes to construct their **query-passage graph**. Let $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ denote their graph, where $\mathcal{V} = \mathcal{V}_q \cup \mathcal{V}_p$ is a node set and $\mathcal{E} = \mathcal{E}_{pp} \cup \mathcal{E}_{pq} \cup \mathcal{E}_{qq}$ is an edge set. The edge between node $x$ and node $y$ is denoted as $e(x, y)$. They retrieve the top-k candidate passages $P_i = \{ p_{ij} \}_{j=1}^k$ for each query $q_i$ from the corpus by the dual-encoder. Their graph $\mathcal{G}$ has total of ($n + m$) nodes and ($n \times k + m + n$) edges. **Node Features** - Query embeddings: $h_{q_i} = E_Q(q_i)$ - Passage embeddings: $h_{p_i} = E_P(p_i)$ **Edge Features** They utilize the embeddings of text pairs $(x, y)$ by cross-encoder as features $h_{x−y}$ of edge $e(x, y)$. --- In their experiments, both dual-encoder and cross-encoder use [CLS] representations as embeddings. ### GNN on Query-passage Graph ![image](https://hackmd.io/_uploads/HyBnJTgi6.png) They utilize GAT to learn how to exchange information between relevant nodes. **GAT Layer** They apply multi-head attention in GAT layer. Let $\mathcal{N}_i$ denote the neighbors of node $i$ in the graph. GAT layer computes the importance of node $j \in \mathcal{N}_i$ to node $i$ as: $$ \tag{3} e_{ij} = a^\top [W_t h_i \| W_s h_j \| W_e h_{i-j}] $$ where $h_i \in \mathcal{R}^{d_n}$, $h_j \in \mathcal{R}^{d_n}$, $h_{i-j} \in \mathcal{R}^{d_n}$ are the features of node $i$, node $j$ and edge $e(i, j)$, and $W_t \in \mathcal{R}^{d_n \times d}$, $W_s \in \mathcal{R}^{d_n \times d}$, $W_e \in \mathcal{R}^{d_e \times d}$, $a \in \mathcal{R}^{3d}$ are learnable model parameters, and $\|$ is concatenation operation. They set $d$, $d_n$, $d_e$ same as BERT base dimension, which is 768. The equation for calculating attention weight: $$ \tag{4} \alpha_{ij} = \frac{\exp(\text{LeakyReLU}(e_{ij}))}{\sum_{k \in \mathcal{N}_i} \exp (\text{LeakyReLU}(e_{ik}))} $$ The final ouput: $$ \tag{5} \widetilde{h_i} = \sigma (\sum_{j \in \mathcal{N}_i} \alpha_{ij} W_s h_j) $$ where $\sigma$ is an activation function, and the above equation same as $\widetilde{h_i} = \text{GAT}(\{ h_j \}_{j \in \mathcal{N}_i})$. They use two GAT layers to implement the interaction between nodes, because two-hop neighbors can exactly establish the relation between the queries and between the passages. **Aggregate** First GAT layer that get the aggregated contexts from query neighbors: $$ \tag{6} \widetilde{h_{q_i}} = \text{GAT}(\{ h_{p_{ij}} \}_{p_{ij} \in P_i} \cup \{ h_{q_i} \}) $$ where $P_i$ denotes the set of passages retrieved by $q_i$. Passage-interactive query embeddings: $$ \tag{7} h^\prime_{q_i} = W_{pq}[\widetilde{h_{q_i}} \| h_{q_i}] + b_{pq} $$ where $W_{pq} \in \mathcal{R}^{d \times 2d}$ and $b_{pq} \in \mathcal{R}^d$. Second GAT layer that get the aggregated contexts of passage neighbors: $$ \tag{8} \widetilde{h_{p_i}} = \text{GAT}(\{ h^\prime_{q_{ij}} \}_{q_{ij} \in Q_i} \cup \{ h_{p_i} \}) $$ where $Q_i$ denotes the set of queries retrieving $p_i$. **Filter Gate** To incorporate aggregated contexts of passage neighbors into passage embeddings, they utilize filter gate to clean the aggregated contexts: $$ \tag{9} f_{p_i} = \sigma (W_{qp} [\widetilde{h_{p_i}} \| h_{p_i}] + b_{qp}) $$ where $W_{qp} \in \mathcal{R}^{d \times 2d}$ and $b_{qp} \in \mathcal{R}^d$. The final **query-interactive passage embeddings**: $$ \tag{10} h^\prime_{p_i} = f_{p_i} \cdot \widetilde{h_{p_i}} + h_{p_i} $$ ### Training Procedure They first retrieve the top-k candidates of each query from the corpus by DPR and score them by a well-trained cross-encoder $M_{ce}$ to obtain denoised positives and hard negatives to train their initial dual-encoder $M_{de}$. They adopt $(2)$ as loss function to train dual-encoder and GNN jointly. For training step, they randomly sample a subset of training queries $Q_b = \{ q_i \}_{i=1}^b$ getting their denoised positives $P_b^+$ and hard negative $P_b^-$. For similarity and loss: $$ \tag{11} s_{\mathcal{G}}(q, p) = E_Q (q)^\top \cdot h_p^\prime $$ $$ \tag{12} L_{\mathcal{G}} = \sum_{q_i \in Q_b} L (q_i, p_{q_i}^+, P_b - \{ p_{q_i}^+ \}, s_{\mathcal{G}}) $$ --- To avoid $p_{q_i}^+$ focus too much on $q_i$ when construct graph, they propose **Masked Graph Training (MGT) Algorithm** which is applicable to other tasks suffering the same dilemma. In every epoch, they split the training queries into two parts $Q_g$ and $Q_t$ which are utilized for constructing the graph and training (i.e., masking some nodes in training), respectively. The proportion of query masked $\beta$ in the graph can not be too high, otherwise it will cause the gap between training graph and inference graph. ![image](https://hackmd.io/_uploads/rkQ5ulBi6.png) ## Experiments ### Dataset - MSMARCO - Natural Questions (NQ) - TriviaQA (TQA) ### Comparison Methods - Sparse Retrieval Models - Traditional - Neural Networks - Dense Retrieval Models - Dense passage retrieval models - Pre-training ### Experimental Results ![image](https://hackmd.io/_uploads/SJzajlHjT.png) ![image](https://hackmd.io/_uploads/By_x2lHjT.png) ### Ablation Study Their model explore how much performance improvement is introduced by GNN. ![image](https://hackmd.io/_uploads/SySXhlBop.png) ![image](https://hackmd.io/_uploads/ByjdneBop.png) ### Detailed Analysis **The number of edges in Graph** ![image](https://hackmd.io/_uploads/SyxJagHjp.png) **Masked Ratio** ![image](https://hackmd.io/_uploads/BJYlpeBop.png) **Efficiency** ![image](https://hackmd.io/_uploads/S1qGpxBjp.png) ## Conclusion In this paper, they introduced GNN-encoder for passage retrieval tasks. In the future, they will explore how to fuse more interaction information into GNN structure. Their method's limitation, which needs cache passage embeddings of $M_{de}$ ($m$ passage nodes) and embeddings of cross-encoder $((n × k + m + n)$ edges).