Graphomer - HackMD

# Graphomer > Report on Graphomer by Archit Jain(2019101053) and Pulkit Gupta(2019101078)  ## Motivation They have applied transformer architecture to graph based dataset to enhance the performance of graph neural network on graph based datasets. Transformers are very powerful deep learning architecture for sequential data, but they want to see the results on graph data. They combined the pure classical transformer with three types of embeddings namely spatial encoding, edge encoding and centrality encoding. ## Architecture ![](https://i.imgur.com/LA9ahgy.png) ### Transformer They are powerful encoder-decoder architectures that rely on the self-attention mechanism to generate representation of the input data. In the self-attention mechanism, input is projected into three spaces with trainable weight matrices: a query space, a key space and a value space. ![](https://i.imgur.com/bys5ApW.jpg =300x) Q, K, and V are the linear embeddings of the input sequence x, such that Q=Wqx, K=Wkx, V=Wvx, where W are trainable weight matrices. QKT is the matrix product of the query and the key embeddings, which produces a weight vector for every element in the value sequence. dk is the dimension of the key vector, and sqrt(dk)-1 is used as a scalar normalisation. The softmax function rescales the query-key weights to sum to one. Multiplying by V, we get one output vector per query, stacked into a matrix. ### Centrality encoding Centrality encoding is used to model the importance of node using node centrality. In graphormer, they have used degree centrality which is based on the concept: more neighbors a node has, the more central a node is. Centrality encoding incorporates degree of each node in its initial feature representation. A learnable vector is added to initial node features which is based on degree centrality. Each node is assigned two real valued embedding vectors according indegree and outdegree. ![](https://i.imgur.com/YaDrAfN.jpg =200x) hi(0) is the initial feature vector of the i-th node. It consists of an initial node representation xi summed with feature vectors z that depend on the degree of the node. z- is a vector representing the number of incoming edges, z+ represents the number of outgoing edges. ### Spatial encoding Spatial encoding is used to model the topological relation between the nodes. Transformers use positional encoding to retain information about position of elements in the sequence. However, nodes are not arranged in sequence for graphs. They may lie in multi-dimensional spatial space and are liked with edges. To encode this structural information of a graph in the model, spacial encoding is used. They have used distance of shortest path(SPD) between two nodes if two nodes are connected and if they are not connected, it will be assigned a value of -1. Then, the bias term is added to the attention score calculated with the queries and keys and is shared across all layers. ![](https://i.imgur.com/wh6ceZC.jpg =200x) b becomes a spatial bias for the attention: if b is a decreasing function of the SPD between i and j, then A, the attention score, will be greater for nodes that are close to each other in the graph. ### Edge encoding Edge encoding is used to model structural features such as describing the type of bond between them. For each ordered node pair, we find one of the shortest path between these two nodes and compute an average of dot-products of edge feature and learnable embedding along the path. ![](https://i.imgur.com/6svlBVT.jpg =200x) The edge encoding incorporates edge features via a bias term to the attention module. Similar to the bias, the edge score cij is added to the attention score of nodes i and j, yielding a final equation for the attention score between two nodes: ![](https://i.imgur.com/0lcNKGS.jpg =300x) ## Dataset used + OGB-LSC quantum chemistry regression(i.e PCQM4M-LSC) challenge. + OGBG-MolHIV + OGBG-MolPCBA + ZINC (from benchmarking GNN leaderboard) ## Result The graphormer won the 2021 Open Graph Benchmark Large Scale Challenge (OGB-LSC) in quantum chemistry. It has also performed better than many state-of-the-art graph neural networks on other datasets such as ZINC, MolHIV, MolPCBA. ------------------  ## Code Overview + [Github Repository](https://github.com/microsoft/Graphormer) + Graphormer inherits the extending usage of fairseq, which means it could easily support user-defined plug-ins. For example, the Graphormer-base model could be defined through GraphormerModel, which inherits the FairseqModel class. + It’s also easy to extend the Graphormer-base model, which means you could define your own model and criterion, and then use them in Graphormer. Also, development of new model is easy. ### File Structure . └── graphormer ├── criterions // code for loss functions used │ ├── binary_logloss.py │ ├── l1_loss.py │ ├── mae_deltapos.py │ └── multiclass_cross_entropy.py ├── data // code for data parsing and lookup table │ ├── algos.pyx │ ├── collator.py │ ├── dataset.py │ ├── dgl_datasets │ │ ├── dgl_dataset.py │ │ └── dgl_dataset_lookup_table.py │ ├── ogb_datasets │ │ └── ogb_dataset_lookup_table.py │ ├── pyg_datasets │ │ ├── pyg_dataset.py │ │ └── pyg_dataset_lookup_table.py │ ├── smiles │ │ └── smiles_dataset.py │ └── wrapper.py ├── evaluate // code for evaluation metric │ └── evaluate.py ├── models // code for model architecture for both 2D and 3D │ ├── graphormer.py │ └── graphormer_3d.py └── modules // code for components of model's architecture ├── graphormer_graph_encoder.py ├── graphormer_graph_encoder_layer.py ├── graphormer_layers.py └── multihead_attention.py ## Command-line Tools Graphormer reuses the `fairseq-train` command-line tools of [fairseq](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) for training, and here we mainly document the additional parameters in Graphormer and parameters of `fairseq-train` used by Graphormer. ### Installation This is a guide to install Graphormer. Currently Graphormer supports intallation on Linux only. On Linux, Graphormer can be easily installed with the `install.sh` script with prepared python environments. 1. Please use Python3.9 for Graphormer. It is recommended to create a virtual environment with [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest). For example, to create and activate a conda environment with Python3.9 ``` conda create -n graphormer python=3.9 conda activate graphormer ``` 2. Run the following commands: ``` git clone --recursive https://github.com/microsoft/Graphormer.git cd Graphormer bash install.sh ``` ### How to use CLI #### **Fine-tuning Pre-trained Models** To fine-tune pre-trained models, use ``--pretrained-model-name`` to set the model name. For example, the script ``examples/property_prediction/hiv_pre.sh`` fine-tunes our model ``pcqm4mv1_graphormer_base`` on the ``ogbg-molhiv`` dataset. The command for fine-tune is fairseq-train \ --user-dir ../../graphormer \ --num-workers 16 \ --ddp-backend=legacy_ddp \ --dataset-name ogbg-molhiv \ --dataset-source ogb \ --task graph_prediction_with_flag \ --criterion binary_logloss_with_flag \ --arch graphormer_base \ --num-classes 1 \ --attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 \ --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --weight-decay 0.0 \ --lr-scheduler polynomial_decay --power 1 --warmup-updates $warmup_updates --total-num-update $tot_updates \ --lr 2e-4 --end-learning-rate 1e-9 \ --batch-size $batch_size \ --fp16 \ --data-buffer-size 20 \ --encoder-layers 12 \ --encoder-embed-dim 768 \ --encoder-ffn-embed-dim 768 \ --encoder-attention-heads 32 \ --max-epoch $max_epoch \ --save-dir ./ckpts \ --pretrained-model-name pcqm4mv1_graphormer_base \ --flag-m 3 \ --flag-step-size 0.01 \ --flag-mag 0 \ --seed 1 \ --pre-layernorm After fine-tuning, use ``graphormer/evaluate/evaluate.py`` to evaluate the performance of all checkpoints: python evaluate.py \ --user-dir ../../graphormer \ --num-workers 16 \ --ddp-backend=legacy_ddp \ --dataset-name ogbg-molhiv \ --dataset-source ogb \ --task graph_prediction \ --arch graphormer_base \ --num-classes 1 \ --batch-size 64 \ --save-dir ../../examples/property_prediction/ckpts/ \ --split test \ --metric auc \ --seed 1 \ --pre-layernorm #### **Training a New Model** First, download IS2RE train, validation, and test data in LMDB format by: > cd examples/oc20/ && mkdir data && cd data/ > wget -c https://dl.fbaipublicfiles.com/opencatalystproject/data/is2res_train_val_test_lmdbs.tar.gz && tar -xzvf is2res_train_val_test_lmdbs.tar.gz Create ``ckpt`` folder to save checkpoints during the training: > cd ../ && mkdir ckpt/ Now we train a 48-layer ``graphormer-3D`` architecture, which has 4 blocks and each block contains 12 Graphormer layers. The parameters are sharing across blocks. The total training steps are 1 million, and we warmup the learning rate by 10 thousand steps. > fairseq-train --user-dir ../../graphormer \ ./data/is2res_train_val_test_lmdbs/data/is2re/all --valid-subset val_id,val_ood_ads,val_ood_cat,val_ood_both --best-checkpoint-metric loss \ --num-workers 0 --ddp-backend=c10d \ --task is2re --criterion mae_deltapos --arch graphormer3d_base \ --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm $clip_norm \ --lr-scheduler polynomial_decay --lr 3e-4 --warmup-updates --total-num-update 1000000 --batch-size 4 \ --dropout 0.0 --attention-dropout 0.1 --weight-decay 0.001 --update-freq 1 --seed 1 \ --fp16 --fp16-init-scale 4 --fp16-scale-window 256 --tensorboard-logdir ./tsbs \ --embed-dim 768 --ffn-embed-dim 768 --attention-heads 48 \ --max-update 1000000 --log-interval 100 --log-format simple \ --save-interval-updates 5000 --validate-interval-updates 2500 --keep-interval-updates 30 --no-epoch-checkpoints \ --save-dir ./ckpt --layers 12 --blocks 4 --required-batch-size-multiple 1 --node-loss-weight 15 ### Datasets #### **Supported Dataset** Graphormer supports training with datasets in existing libraries. Users can easily exploit datasets in these libraries by specifying the ``--dataset-source`` and ``--dataset-name`` parameters. ``--dataset-source`` specifies the source for the dataset, can be: 1. ``dgl`` for [DGL](https://docs.dgl.ai/) 2. ``pyg`` for [Pytorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/) 3. ``ogb`` for [OGB](https://ogb.stanford.edu/) #### **Customized Datasets** Users may create their own datasets. To use customized dataset: 1. Create a folder (for example, with name `customized_dataset`), and a python script with arbitrary name in the folder. 2. In the created python script, define a function which returns the created dataset. And register the function with ``register_dataset``. Here is a sample python script. We define a [QM9](https://docs.dgl.ai/en/0.6.x/api/python/dgl.data.html#qm9-datase) dataset from ``dgl`` with customized split. ` from graphormer.data import register_dataset from dgl.data import QM9 import numpy as np from sklearn.model_selection import train_test_split @register_dataset("customized_qm9_dataset") def create_customized_dataset(): dataset = QM9(label_keys=["mu"]) num_graphs = len(dataset) # customized dataset split train_valid_idx, test_idx = train_test_split( np.arange(num_graphs), test_size=num_graphs // 10, random_state=0 ) train_idx, valid_idx = train_test_split( train_valid_idx, test_size=num_graphs // 5, random_state=0 ) return { "dataset": dataset, "train_idx": train_idx, "valid_idx": valid_idx, "test_idx": test_idx, "source": "dgl" } ` The function returns a dictionary. In the dictionary, ``dataset`` is the dataset object. ``train_idx`` is the graph indices used for training. Similarly we have ``valid_idx`` and ``test_idx``. Finally ``source`` records the underlying graph library used by the dataset. 3. Specify the ``--user-data-dir`` as ``customized_dataset`` when training. And set ``--dataset-name`` as ``customized_qm9_dataset``. Note that ``--user-data-dir`` should not be used together with ``--dataset-source``. All datasets defined in all python scripts under the ``customized_dataset`` will be registered automatically.