Integrating a GNN with PLM (Pre-trained Language Model)

Integrating a GNN with PLM (Pre-trained Language Model) = We know there are different ways of combining GNNs and Transformer models, for example, stacking them, putting them side by side, and leveraging attention masks and positional embeddings. Training signals could align their feature space if GNN and Transformer models are trained from scratch. However, empirical results for PLMs reveal two problems: performance is sensitive to the way of integration (**aligning GNN's and PLM's feature space is difficult**), adding GNN causes performance drop against using PLM alone (**PLM forgets knowledge during the joint training**). |![](https://i.imgur.com/bXGLQWF.png =200x)|![](https://i.imgur.com/hXHBgy8.png =200x) |![](https://i.imgur.com/McaEa1m.png =200x) | |-|-|-| ![](https://i.imgur.com/PYr2Fx8.png =200x) | | ### aligning GNN's and PLM's feature space is diffcult One possible reason is that the model loses freedoms during the pre-training, for example, we ask models to fit a synthetic pattern (e.g., randomly assign each word a number, and the target is the parity of the sum of words in a sequence). In this case, finetuning a pre-trained model could be more challenging than training a model with the same architecture from scratch. ### PLM forgets knowledge during the joint training PLM may lose its knowledge during the joint training with a randomly initialized GNN. This is because randomly initialized GNN has a large chance to pollute the PLM's feature, so the training signal back to the PLM also has low quality. ## Feedback of Missing Nodes for Subgraph Extraction Problem ![](https://i.imgur.com/0hlTrqI.png) We face the problem of selecting a group of nodes in a graph, but some desired nodes may not appear in the graph. The empirical results show that about 10%~20% of target nodes are missed in the input graph, and the main reason is that the graph construction module is not perfect. The question is how to provide feedbacks/training signals to the graph construction module with only the supervision of subgraphs. Formally, the question here is that the ground truth input graph should be $G^*$ and the graph construction model predicts $G$, we want the node selection module to find a group of nodes $S \subset V(G)$, and we have its ground truth $S^* \subset V(G^*)$, where $V(G)$ means the vertices of the graph $G$. How to affect the graph construction model with $\mathrm{diff}(S, S^*)$? Since we know that the missing node should belong to the ground truth graph, we can train a new model for injecting one node into the graph. However, the training objective of this refinement model is unclear and depends another model's behavior. ![](https://i.imgur.com/oiprzIR.png) A possible solution is to introduce a new pre-training task. We mask out one AMR node (red in the figure) and give the corresponding text span as a query to the model. The model needs to predict this AMR node's “position” and connections. In the following example, the model takes “food” and the whole AMR graph except the red node as the input, and the model will predict two edges (blue in the figure) connected to the red node. ## Some results with HALO and RGCN |model|Head F1|Head Role F1| Coref F1| Coref Role F1| |-|-|-|-|-| |2-layer RoBERTa+3-layer RGCN| 66.38|58.61 |61.97|55.04| |4-layer RoBERTa+3-layer RGCN|73.39|63.40| 68.40|59.50| |2-layer RoBERTa+3-layer HALO||||| |4-layer RoBERTa+3-layer HALO|70.01|59.97|65.21|56.48| |2-layer RoBERTa+10-layer HALO| 66.39|56.73|62.48|53.03| |4-layer RoBERTa+10-layer HALO| 71.49|62.41|66.73|58.53| |2-layer RoBERTa+20-layer HALO|64.24|54.74|60.00|51.71| |4-layer RoBERTa+20-layer HALO|71.54|63.87|66.66|59.46| - Although the gcn layers in HALO are shared, the computation cost is still a problem if we have a large number of iterations. - From the unfolding perspective, is there a criterion of convergence? In other words, do we know how many layers are enough?