<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/pdf/1710.10903.pdf) | [Note link](https://zhuanlan.zhihu.com/p/296587158) | [Code link](https://github.com/gordicaleksa/pytorch-GAT) | ICLR 2018
:::success
**Thoughts**
* We can do the operation efficiently using GAT since it is parallelizable across node neighbor pairs.
* But GAT still has lots of limitations. Like handling sparse matrix operations and reducing redundant computation.
* The result of inductive learning for GAT can improve performance.
:::
## Abstract
This paper use masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.
They address several key challenges of spectral-based graph neural networks simultaneously, and make their model readily applicable to inductive as well as transductive problems.
> **Inductive learning**, which aims to learn a general model that can make predictions for new, unseen data, **transductive learning** focuses on making predictions only for the data instances provided in the dataset.
## Introduction
Graph Neural Networks (GNNs) were introduced as a generalization of recursive neural networks that can directly deal with a more general class of graphs, e.g. cyclic, directed and undirected graphs.
GNNs consist of an iterative process, which propagates the node states until equilibrium; followed by a neural network, which produces an output for each node based on its state.
---
**Spectral approaches work**
**Non-spectral approaches work**
---
The attention architecture has several interesting properties:
1. The operation is efficient, since it is parallelizable across node neighbor pairs.
2. It can be applied to graph nodes having different degrees by specifying arbitrary weights to the neighbors.
3. The model is directly applicable to inductive learning problems, including tasks where the model has to generalize to completely unseen graphs.
## GAT Architecture
In this section, they will present the building block layer used to construct arbitrary graph attention networks
### Graph Attention Layer
They will start by describing a single *graph attentional layer*, as the sole layer utilized throughout all of the GAT architectures used in their experiments.
The input to their layer is a set of node features, $\mathbf{h}=\left\{\vec{h}_1, \vec{h}_2, \ldots, \vec{h}_N\right\}, \vec{h}_i \in \mathbb{R}^F$, where $N$ is the number of nodes, and $F$ is the number of features in each node.
The layer produces a new set of node features $\mathbf{h^\prime}=\left\{\vec{h}_1^\prime, \vec{h}_2^\prime, \ldots, \vec{h}_N^\prime \right\}, \vec{h}_i^\prime \in \mathbb{R}^{F^\prime}$.
Also, we need to learn a shared linear transformation, parametrized by a *weight matrix*, $\mathbf{W} \in \mathbb{R}^{F^\prime \times F}$, which applied to every node.
They then perform *self-attention* on the nodes
$$
\tag{1} e_{ij} = a(\mathbf{W} \vec{h}_i, \mathbf{W} \vec{h}_j)
$$
where $a : \mathbb{R}^{F^\prime} \times \mathbb{R}^{F^\prime} \rightarrow \mathbb{R}$ is *attention coefficients*, it *indicate* the importance of node $j$’s features to node $i$.
In its most general formulation, the model allows every node to attend on every other node, dropping all structural information.
They inject the graph structure into the mechanism by performing masked attention—they only compute $e_{ij}$ for nodes $j \in \mathcal{N}_i$ , where $\mathcal{N}_i$ is some neighborhood of node $i$ in the graph.
$$
\tag{2} \alpha_{i j}=\operatorname{softmax}_j\left(e_{i j}\right)=\frac{\exp \left(e_{i j}\right)}{\sum_{k \in \mathcal{N}_i} \exp \left(e_{i k}\right)}
$$
In their experiment
$$
\tag{3} \alpha_{i j}=\frac{\exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^T\left[\mathbf{W} \vec{h}_i \| \mathbf{W} \vec{h}_j\right]\right)\right)}{\sum_{k \in \mathcal{N}_i} \exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^T\left[\mathbf{W} \vec{h}_i \| \mathbf{W} \vec{h}_k\right]\right)\right)}
$$
where $\cdot^T$ T represents transposition and $\|$ is the concatenation operation.

The normalized attention coefficients are used to compute a linear combination of the features corresponding to them, to serve as the final output features for every node
$$
\tag{4} \vec{h}_i^{\prime}=\sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{i j} \mathbf{W} \vec{h}_j\right)
$$
Expanding to *multi-head attention*
$$
\tag{5} \vec{h}_i^{\prime}=\parallel_{k=1}^K \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{i j}^k \mathbf{W}^k \vec{h}_j\right)
$$
Note that, in this setting, the final returned output, $\mathbf{h}^\prime$, will consist of $KF^\prime$ features (rather than $F^\prime$ for each node.
For the final (prediction) layer of the network, they employ *averaging*, and delay applying the final nonlinearity
$$
\tag{6} \vec{h}_i^{\prime}=\sigma\left(\frac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}_i} \alpha_{i j}^k \mathbf{W}^k \vec{h}_j\right)
$$
### Comparisons to Related Work
- Computationally, it is highly efficient: the operation of the self-attentional layer can be parallelized across all edges, and the computation of output features can be parallelized across all nodes.
- As opposed to GCNs, our model allows for (implicitly) assigning *different importances* to nodes of a same neighborhood, enabling a leap in model capacity.
## Evaluation
### Dataset

### Results



## Conclusions
This paper presented graph attention networks (GATs), novel convolution-style neural networks that operate on graph-structured data, leveraging masked self-attentional layers.
There are several potential improvements and extensions to graph attention networks that could be addressed as future work
1. Handle larger batch sizes
2. Perform a thorough analysis on the model interpretability
3. Eextending the method to perform graph classification instead of node classification
4. Extending the model to incorporate edge features (possibly indicating relationship among nodes)