Edge-Featured Graph Attention Network

# Edge-Featured Graph Attention Network ###### tags: `references` ## Introduction * Most of the current state-of-the-art graph neural networks have not consider the edge features * The edge information can be an important factor in attention-weight computing. * Multi-scale merge strategy: concatenate features from differenct iterations * All node and edge features will be gathered in the final layer so that the model can learn the necessary characteristics which benefit the classification from various scales. * Our models can be applied to graphs with discrete and continuous features for both nodes and edges. ## Related Work * Trandictional GNNs: cannot handle edge features in a reasonable way. * R-GCN: can accept graphs only when their edges are labeled, which indicates that the edges cannot include continuous attributes. * EGNN: * A framework enhances GCNs and GATs * An accept continuous attributes of edges, whereas it merely regards them as weights between different node pairs. * It's unreasonable in most cases ## Proposed Model * Notation: * Node features: $H = \{h_1, ..., h_N\}, h_i \in \mathbb{R}^{F_H}$ * Edge features: $E = \{e_1, ..., e_M\}, e_i \in \mathbb{R}^{F_E}$ * N: number of nodes * M: number of edges * Output after processing of nodes features: $H' = \{h'_1, ..., h'_N\}, h'_i \in \mathbb{R}^{F'_H}$ * Output after processing of edge features: $E' = \{e'_1, ..., e'_N\}, e'_i \in \mathbb{R}^{F'_E}$ * Node attention block * Add virtual featured self-loops: its edge features will be computed as an average of all its adjacent edges’ features in each dimension as a compromise * Attention weight: ![](https://i.imgur.com/i2rEyNL.png) * New set of node features $H$ * $h'_i = \sigma(\sum_{j\in N_i}\alpha_{ij}h_j)$: * Only aggregate the node features to generate the * If we merge edge features into nodes in each iteration, all the features may tangle up together and make the network more complicated and confusing * Edge-integrated node features $H_m$ * $m_i=\sigma(\sum_{j \in N_i}\alpha_{ij}(h_j||e_{ij}))$ * Only be used in the last-level merge layer to achieve a multi-scale concatenation * They will never be passed to the next EGAT layer as the inputs * Edge Attention Block * Graph transformation: * The nodes and edges’ roles are inversed * We consider two edges are adjacent only if two edges have at least one common vertex ![](https://i.imgur.com/dIXAqYX.png) * Attention weight: ![](https://i.imgur.com/TgCX8qV.png) * New set of edge features: $E'$ * $e'_p = \sigma(\sum_{q \in N_p}\beta_{pq}e_q)$ * EGAT Architecture * Stacking several EGAT layers and appending a merge layer at the tail ![](https://i.imgur.com/rJVaPLt.png) * All $H_m$ generated from different iterations will aggregate together using a concatenation operation * We adopt the multi-head attention in the merge layer to further stabilize the attention mechanism * Our multi-head attention is performed on the unity of all EGAT layers rather than a single layer * Resulting in the feature representation: $h_i^* = ||_{k=1}^{K}(||^L_{l=1}m_i^{l, k})$ * L: number of EGAT layers * K: number of attention heads * To obtain a more refined representation, we apply a 1D convolution to the results as a linear transformation and a non-linearity. ## Experiments * Datasets * Node-sensitive graphs * Node features highly correlate with node labels * Cora, Citeseer, Pubmed * Undirected and do not have edge features * Edge-sensitive graphs * Financial-collaborative and refer to real-world trading records * Trade-B (3907 nodes, 97 of them are labeled and 4394 edges, binary classification), Trade-M (4431 nodes, 139 of them are labeled and 4900 edges, ternary classifaction). * Each node represents a customer, with an attribute indicating the risk level of it * Each edges, represent the relations among customers, whose features contain the number and total amount of recent transactions. * The two datasets are directed initially; however, to make it suitable for EGATs, we converted them into an undirected form. * Experimental Setup * $L=2, K=8$ * Node-sensitive graphs * We generate one weak topological feature for each edge, by enumerating the numbers of its adjacent edges * $F'_H=8, F'_E=4$ * Edge-sensitive graphs * $F'_H, F'_E$ have different ratios: $8:4, 6:6, 4:8$ * Results * Node-sensitive graphs ![](https://i.imgur.com/YzZBtC5.png) * We notice that there is a slight decrease for both Cora and Citeseer compared with GAT, which may be caused by the introduction of edge features. * Since we generate the feature for each edge with the number of its adjacent edges, some interference may occur if these features are kind of useless. However, those negative effects are quite insignificant. * Edge-sensitive graphs ![](https://i.imgur.com/FZraaA0.png) * For GAT, we only feed the original node features as inputs. * To ensure fairness, we further create three variants of GAT, by aggregating edge features into nodes as node features in advance using different functions, including sum, average, and max pooling. * EGATs can achieve high accuracy against these baselines on all these datasets, which means that EGATs can learn these characteristics of graphs spontaneously. * To our best knowledge, there are no existing approaches that can process these kinds of graphs effectively. * According to the results, if the edge features may play a more important role than node ones, we recommend choosing a small or balance value of $F'_H : F'_E$ so that edges will have a higher chance to show themselves