GraphWriter: a Graph-to-Text model
=

Converting structural data to natural language is a widely used application in many areas. And the model which we are talking about is designed for transferring the knowledge from a knowledge graph to a text passage. Following the common choice of text generation tasks, GraphWriter adopts an encoder-decoder framework and more particularly, a graph encoder and a text decoder. In this example, we choose a Graph-Transformer encoder (a kind of GNN) and a LSTM decoder, but you can change them to other models, such as GCN, GAT, and Transformer.
The main idea of the GraphWriter is encoding entities in the knowledge graph as vertices and building edges between entities and relations. Then applying a graph encoder to obtain the representation of these entities. The final step is using a LSTM decoder to generate the text description conditioned on outputs of the graph encoder by attending entities at each step. We illustrate the whole pipeline as followings:

# Graph to text
The task definition of graphwriter is about converting the information from the graph into text, it's a typical NLG problem(Natural Language Generation). Same with other multimodal problems, the key idea is how to map the semantic information from one modality to another one. A real case is taking a knowledge graph as input and processing it to a sentence or a paragraph in a similar meaning. There are several datasets in this setting, such as WebNLG which extracted the graph from DBpedia and give a short text description, Agenda which consists of the abstract of research paper and human-annotated graphs.
[image struct2data]
# Model overview
GraphWriter adopts an encoder-decoder framework and it divides the model into several components, title encoder(optional), entity encoder, graph encoder, and a text decoder. As shown in Fig-XX, the model encodes the text of each entity into vector representations and take them as the initial feature of the node in the graph. Then, applying a GNN(Graph Neural Network) model to refine the node representation and also a global context vector. With the global context and node representations(both entity and relation), the decoding process is similar to other attention-based seq2seq models, the model will attend to a node in each decoding step and combine it with the LSTM state to make the prediction.
# Pipeline
Suppose we have three inputs, the title of a paper(text), the graph corresponds to the text, and the abstract of this paper as the target of the text generation.
``` python
raw_title: ['Resources', 'for', 'Urdu', 'Language', 'Processing', '.']
raw_ent_text: [['corpus', 'and', 'lexical', 'resources'], ['crulp'], ['urdu']]
raw_ent_type: ['<material>', '<material>', '<material>']
raw_rel: [[['crulp'], '--USED-FOR--', ['urdu']]]
raw_text: ['urdu', 'is', 'spoken', 'by', 'more', 'than', \
'100', 'million', 'speakers', '.', 'this', 'paper', 'summarizes', \
'the', '<material_0>', 'being', 'developed', 'for', '<material_2>', \
'by', 'the', '<material_1>', ',', 'in', 'pakistan', '.']
graph: DGLGraph(num_nodes=6, num_edges=16,
ndata_schemes={'type': Scheme(shape=(), dtype=torch.float32)}
edata_schemes={})
```
And this data is convert to the tensor and batching after numerlization based on the vocabulary. BTW, you can check the data loading process by typing the command, then you can see some data examples after processing.
```shell
$: python -u utlis.py
```
Thus, we feed the data to an entity encoder first, it takes the segmented text pieces as the input and processes them to a fixed vector for each entity by applying a bi-directional LSTM and aggregate them with the average pooling.
``` python
ent_enc = self.ent_enc(self.ent_emb(batch['ent_text']), \
ent_text_mask, ent_len = batch['ent_len'])
```
Another paralleled module is the title encoder, it is a normal bi-directional LSTM and keep the sequence length.
```python
title_enc = self.title_enc(self.title_emb(batch['title']), \
title_mask)
```
And we need several bi-LSTM to implement the title encoder and entity encoder.
```python
pad_seq = pack_padded_sequence(inp, lens, batch_first=True, \
enforce_sorted=False)
y, (_h, _c) = self.bilstm(pad_seq)
if self.enc_type=='title':
y = pad_packed_sequence(y, batch_first=True)[0]
return y
if self.enc_type=='entity':
_h = _h.transpose(0,1).contiguous()
# two directions of the top-layer
_h = _h[:,-2:].view(_h.size(0), -1)
ret = pad(_h.split(ent_len), out_type='tensor')
return ret
```
Then, the key component, the graph encoder for the input graph, it first converts a graph with edge types into a graph without edge types but with relation nodes. For example, a triplet (v, e, u) represents an edge with type “e” connects “v” and “u”, it equals we introduce an additional relation node “e” and build new edges without edge type, v→ e,e→ u. And often the inversed edge, u→ e,e→ v. The initialization of entity nodes is their entity features which take from the entity encoder, and the relation node will get a representation by querying a trainable look-up table according to its edge type. Finally, a GNN model is applied and we need the final state of entity nodes as the output.
```python
def forward(self, graph, feat):
"""graph transformer network layer.
Parameters
----------
graph : DGLGraph
The graph.
feat : torch.Tensor
The input feature of shape :math:`(N, D_{in})` where :math:`D_{in}`
is size of input feature, :math:`N` is the number of nodes.
Returns
-------
torch.Tensor
The output feature of shape :math:`(N, H, D_{out})` where :math:`H`
is the number of heads, and :math:`D_{out}` is size of output feature.
"""
graph = graph.local_var()
feat_c = feat.clone().detach().requires_grad_(False)
q, k, v = self.q_proj(feat), self.k_proj(feat_c), self.v_proj(feat_c)
q = q.view(-1, self._num_heads, self._out_feats)
k = k.view(-1, self._num_heads, self._out_feats)
v = v.view(-1, self._num_heads, self._out_feats)
graph.ndata.update({'ft': q, 'el': k, 'er': v})
# compute edge attention
graph.apply_edges(fn.u_add_v('el', 'er', 'e'))
e = math.sqrt(self._out_feats) * graph.edata.pop('e')
# compute softmax
graph.edata['a'] = self.attn_drop(edge_softmax(graph, e))
# message passing
graph.update_all(fn.u_mul_e('ft', 'a', 'm'),
fn.sum('m', 'ft'))
rst = graph.ndata['ft']
# residual
rst = rst.view(feat.shape) + feat
if self._trans:
rst = self.ln(rst)
rst = self.ln(rst+self.FFN(rst))
return rst
```
And the last module is a LSTM decoder for text generation, the entity feature extracted from GNN and the title feature are the material of the decoder, it uses attention mechanism to access these inputs and generate the abstract sequentially.
```python
tar_inp = self.tar_emb(batch['text'].transpose(0,1))
for t, xt in enumerate(tar_inp):
_xt = torch.cat([ctx, xt], 1)
_h, _c = self.decode_lstm(_xt, (_h, _c))
ctx = self.ent_attn(_h, g_ent, mask=ent_mask)
if self.args.title:
attn = self.title_attn(_h, title_enc, mask=title_mask)
ctx = torch.cat([ctx, attn], 1)
outs.append(torch.cat([_h, ctx], 1))
```
The final prediction is the combination of the prediction over the vocabulary and the copy distribution of the entities.
```python
outs = torch.stack(outs, 1)
copy_gate = torch.sigmoid(self.copy_fc(outs))
EPSI = 1e-6
pred_v = torch.log(copy_gate+EPSI) + torch.log_softmax(self.pred_v_fc(outs), -1)
pred_c = torch.log((1. - copy_gate)+EPSI) + torch.log_softmax(self.copy_attn(outs, g_ent, mask=ent_mask), -1)
pred = torch.cat([pred_v, pred_c], -1)
```
# Result
# Data Preprocessing
Data preprocessing is always an important part in NLP application, for example, AGENDA dataset's origin format is a json file like:
```python
{
"title": "Hierarchical Semantic Classification : Word Sense Disambiguation with World Knowledge .",
"entities": [
"task-specific and background data",
"lexical semantic classification problems",
"word sense disambiguation task",
"hierarchical learning architecture",
"task-specific training data",
"background data",
"learning architecture"
],
"types": "<material> <task> <task> <method> <material> <material> <method>",
"relations": [
"learning architecture -- USED-FOR -- lexical semantic classification problems"
],
"abstract": "we present a <method_6> for <task_1> that supplements <material_4> with <material_5> encoding general '' world knowledge '' . the <method_6> compiles knowledge contained in a dictionary-ontology into additional training data , and integrates <material_0> through a novel <method_3> . experiments on a <task_2> provide empirical evidence that this '' <method_3> '' outperforms a state-of-the-art standard '' flat '' one .",
"abstract_og": "we present a learning architecture for lexical semantic classification problems that supplements task-specific training data with background data encoding general '' world knowledge '' . the learning architecture compiles knowledge contained in a dictionary-ontology into additional training data , and integrates task-specific and background data through a novel hierarchical learning architecture . experiments on a word sense disambiguation task provide empirical evidence that this '' hierarchical learning architecture '' outperforms a state-of-the-art standard '' flat '' one ."
}
```
And we find that they represent the graph as a node list and an edge list in raw text. Since we need the index number rather than the raw text, several vocabularies are necessary. They can map the raw text, entities, and relations to some numbers. Our suggestion is building graphs as a part of data loading instead of building them dynamically in each batch if the memory and the disk have enough space. Thus, we build the graph following the guide in the author’s paper that adds a root node, relation nodes, and self-loop edges.
# Visualization
We further analyze the behaviour of models by visualizing the attention. The first visualization is probing the attention distributions over entities read from the graph encoder.

where the x-axis is the generated text sequence and the y-axis is the set of entities. The number of blocks is the intensity of attention weights that averaging all the attention heads.
The second visualization is about the copy mechanism.
[image copy attention]