Effective receptive field for Graph Networks

--- tags: projects, ERF --- # Effective receptive field for Graph Networks > We analyse the effective receptive field for GCN and self-attention layers > More specifically, we investigate if the ration $\frac{dy_i}{dx_i}$ / $\frac{dy_i}{dx_j}$ affect the learning and the representation power of graph models :::info :pushpin: Ideas: * this is mainly happends in problems where the neighbourhood is very large * possible issues: * could affect the long range capabilities (representation power) * could affect the feature learning in semi-supervised scenarios (optimisation) ::: :::warning :question: Intrebari: * Cand il intalnim? Mereu/ multi vecini/ anumiti tipi de vecini? * Face rau in invatare? In ce conditii? Mereu/heterofilie/self-superv/node-level/graph-level? * De ce ordin ar trebui sa fie raportul? Cat e la GAT? ::: ## :memo: Related work - [ ] [THE LIPSCHITZ CONSTANT OF SELF-ATTENTION](https://openreview.net/pdf?id=DHSNrGhAY7W) *fac o normalizare astfel incat self-attentionul sa devina lipshitz continuu. Scopul este sa nu mai explodeze gradientii pt modele tip GAT sau transformer, in special atunci cand folosesc multe layere. Arata performanta pe taskuri de long range ca zic ca acolo se vad problemele astea.* - [ ] [Improving the Long-Range Performance of Gated Graph Neural Networks.](https://arxiv.org/abs/2007.09668) - [ ] Improving Breadth-Wise Backpropagation in Graph Neural Networks helps Learning Long-Range Dependencies (ICML2021) - [ ] [STABILIZING TRANSFORMERS FOR REINFORCEMENT LEARNING](https://arxiv.org/pdf/1910.06764.pdf) *moves the layer norm inside the residual brench and uses different gating mechanisms instead of residual skip connection. Their motivations include the possibility to model the identity mapping and a more stable optimiztion process. I think that the gating mechanism has an impact on the ERF. Not sure in what direction (it deends on the value of the gate mainly)* - [ ] [ Pairnorm: Tackling oversmoothing in gnns. ]() - [ ] [Deepgcns: ¨Can gcns go as deep as cnns?] *not relevant for ERF. Adapts a series of concepts from CV: residual connection, dense connection and dilated convolutions to graph networks such that the resulting architecture suffer less from a higher number of layers.* - [ ] [ON THE BOTTLENECK OF GRAPH NEURAL NETWORKS AND ITS PRACTICAL IMPLICATIONS](https://arxiv.org/pdf/2006.05205.pdf) [TODO] parca mai avusesei de mult la review ceva ## :memo: Theoretical findings [TODO: scris rezultele+interpretari] ## :memo: Empirical finding --- ## Experiments 1.0 Try to replicate the theoretical findings on different datasets/setups: ### 1 Lipman: [GLOBAL ATTENTION IMPROVES GRAPH NETWORKS GENERALIZATION](https://github.com/omri1348/LRGA) (Lipman iclr21) *OGB-DDI:* link prediction ~4000 noduri. Graf complet conectat *OGB-COLLAB:* link prediction ~220000 noduri. Graf complet conectat ### 2 Bresson: [A Generalization of Transformer Networks to Graphs](https://github.com/graphdeeplearning/graphtransformer) *ZINC:* graf regression, grafuri micute, max 37 de noduri cred :warning: vecinatati foarte mici(1-3), raport subunitar. Pe graful fully connected(10-30 noduri) e supraunitar dar nu e raportul prea mare. Totusi de unde raportul global mare? :warning: Aici raportul mare e dat de fapt de BN si de reziduala, nu de graf ``` 1 complete layer fara BN: Layer 9: dyi/dxi: 0.015926280990242958 Layer 9: dyj/dxi: 0.002470437902957201 1 complete layer cu BN Layer 9: dyi/dxi: 0.004720521159470081 Layer 9: dyj/dxi: 2.767138994386187e-06 1 complete layer cu BN fara reziduala Layer 9: dyi/dxi: 1.500630514783552e-05 Layer 9: dyj/dxi: 1.4643245776824187e-05 1 complete layer cu LN Layer 9: dyi/dxi: 8.594810788054019e-05 Layer 9: dyj/dxi: 1.5473657185793854e-05 ``` *CLUSTER:* semi-supervised node prediction, sintetic 40-190 noduri, f putine labeluite (multe feateruri sunt 0). There is only one labelled node that is randomly assigned to each community and most node features are set to 0. [About dataset](https://arxiv.org/pdf/2003.00982.pdf) :::danger Raportul mare poate aparea si in grafurile mici, dar nu per layer, si global, din cauza altor elemente ca BN/residuala ::: --- ## Experiments 1.1 (01.04) > :question: Is the ratio correlated with the number of nodes? Find a set of datasets used to test the performance of transformance on long sequences: **Long-Range-Arena** Code based on [paper](https://arxiv.org/pdf/2102.03902.pdf) and [github](https://github.com/mlpen/Nystromformer/tree/main/LRA) 4 Datasets: listOps(ops evaluation), byte-level IMDb reviews text classification, byte-level document retrieval, pathfinder 1 Dataset still not able to process (image classification on cifar10 sequence-level) - fixed Performance for classical transformer architecture: ![](https://i.imgur.com/E3kAhvt.png) > OBS: Calculele sunt facute pentru num_nodes=100 de noduri din primul batch de la fiecare epoca > ratio = E[dy_i/dx_i] / E[dy_i/dx_j] ### ++1. Transformer with self-attention layer:++ | Dataset | #N| ratio SA0 init|ratio SA0 final|ratio SA1 init|ratio SA1 final| ratio M init | ratio M final |ratio T0 init|ratio T0 final| ratio T1 init|ratio T1 final|accuracy| | -------- | --------- | -------- | --------- |----| ---|----| ---| --- | ---|--- | ---| ----| | pathfinder | 390 | 15 |41| 5 | 87 | nan | 27 |37 |49|325 |240| 68 | | text | 795 | 26 |34| 13 | 55| nan | 40 |60 |77 |470 |533| 63.3 | | listops | 900 | 34 | 83| 10 | 107| nan | 90 |58 |116|562 |750| 38.4 | | image | 1023 | 40 | 47 | 24 | 65| nan | 43 |89 |82 |920 |517| 26| | retrieval | 4000 | 197 | 254| 74 | 168| nan | 131 |346 |357|3860 |1735| 69| ## Experiments 1.2 (14.04) Replace each self-attention layer with GAT to observe if the second order attention impact the ratio of gradients | Dataset | #N| ratio GAT0 init|ratio GAT0 final|ratio GAT1 init|ratio GAT1 final| ratio M init | ratio M final |ratio T0 init|ratio T0 final| ratio T1 init|ratio T1 final| accuracy | | -------- | --------- | -------- | --------- |----| ---|----| ---| --- | ---|--- | ---| --- | | pathfinder | 390 | 4 | 2 |3 |1|nan |12| 32 | 25| 201| 224|68 | | text | 795 | 8 |6 |3 |6 |nan |27| 52 | 55| 433 | 27| 63.4 | | listops | 900 | 11 | 18 |7 |1|nan |28| 79 | 57| 449 | 705| 37.5 (nu e term) | image | 1023 | 14 | 10 |1 |6 |nan |28| 80 | 62 | 701| 425 | 34| | retrieval | 4000 | 19 | 118 |14 |67| nan |79|350 | 277| 2780 | 1728| 73 | ## Experiments 1.2 (20.04) ##### Regarding the ratio = E[dy_i/dx_i] / E[dy_i/dx_j] when stacking multiple layers of transformers *With residual connection:* the ratio for each self-attention layer :arrow_down: with the depth the ratio for each transformer layer :arrow_up: with the depth the ration when composing multiple layers :arrow_down: ![](https://i.imgur.com/6wmINvB.png =300x) *Without residual connection:* the ratio for each self-attention layer :arrow_down: with the depth the ratio for each transformer layer :arrow_down: with the depth the ration when composing multiple layers :arrow_down: ![](https://i.imgur.com/B11vZpc.png =300x) ==:bulb: Observation:== Maybe the ratio betweeh the self gradients and the other gradients (as average) is not very relevant since we've expected that among all the tokens in the neighbourhood only a subset to be relevant. If those has a big effect on the final predictions but all the irrelevant ones are zero (as expected), the mean of the "others" gradients would be smaller. 1. Maybe we want an erf proportional to the correlation. This way, the only problematic terms during backpropagation would be the additional sum from the gradients of the self node (dyi/dxi = normal_term + sum_term) where the sum comes from the correlation term. We can rather look at the ratio between sum_term and normal_term. 2. The residual connection has a big impact in the final gradient (much bigger than the gradient of the self-attention layer). Is a gated connection better in terms of receptive field? ### Comparisons MLP-GAT-SA-stop_grads **LISTOPS** Legenda: :green_heart: scurt - GAT :green_heart: lung - self-attention :orange_heart: no residual :blue_heart: MLP :heart: stop gradients ![](https://i.imgur.com/7cMr1N1.png =300x) ![](https://i.imgur.com/d92QWHo.png =300x) **PATHFINDER** :orange_heart: self-attention :blue_heart: self-attention-re :hash: GAt :grey_exclamation: MLP :heart: stop_gradients ![](https://i.imgur.com/Jp2JtQS.png =300x) ![](https://i.imgur.com/SBHiXYL.png =300x) **IMAGE** :orange_heart: top GAT :orange_heart: bottom MLP :hash: stop_gradients :grey_exclamation: top self-attention re :grey_exclamation: bottom no residual :heart: self-attention ![](https://i.imgur.com/MXB6PDC.png =300x) ![](https://i.imgur.com/Bj4bZpJ.png =300x) **TEXT** :blue_heart: top GAT :blue_heart: middle stop_gradients :blue_heart: bottom self-attention re3 :grey_exclamation: self-attention re :grey_exclamation: self-attention :green_heart: MLP ![](https://i.imgur.com/r6ZHLFM.png =300x) ![](https://i.imgur.com/bFRfG4u.png =300x) ### Comparisons model complexity **LISTOPS** ![](https://i.imgur.com/WFJISxG.png =300x) ![](https://i.imgur.com/Z6k8cVt.png =300x) -- ![](https://i.imgur.com/mcPj7Z2.png =300x) ![](https://i.imgur.com/j6SxTPG.png =300x) **PATHFINDER** ![](https://i.imgur.com/bBbepUZ.png =300x) ![](https://i.imgur.com/PwKf2Gl.png =300x) ![](https://i.imgur.com/x06DsxD.png =300x) ![](https://i.imgur.com/2cPJHCi.png =300x) **IMAGE** ![](https://i.imgur.com/zrQ4pzf.png =300x) ![](https://i.imgur.com/nsQlgli.png =300x) ![](https://i.imgur.com/6MgFySl.png =300x) ![](https://i.imgur.com/xVmzS2X.png =300x) ### Comparisons normalize gradients **LISTOPS** ![](https://i.imgur.com/M6wfCm9.png =300x) **PATHFINDER** ![](https://i.imgur.com/gwdqWVw.png =300x) **TEXT** ![](https://i.imgur.com/LhOMgTI.png =300x) **IMAGE** ![](https://i.imgur.com/3FHYU0k.png =300x) ## Experiments to check the variance Longer experiments are the ones with smaller batch size **LISTOPS &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; PATHFINDER &emsp; &emsp; &emsp; &emsp; TEXT\* &emsp; &emsp; &emsp; &emsp; &emsp;&emsp;&emsp;&emsp; IMAGE** ![](https://i.imgur.com/zzlWNuV.png =170x) ![](https://i.imgur.com/ahvwMqk.png =170x) ![](https://i.imgur.com/S1JNCog.png =170x) ![](https://i.imgur.com/wTJJ88r.png =170x) \* for TEXT it is a single set of experiments (1 value for batch size) ==:bulb: Observation== LISTOPS and TEXT does not have big variance (max-min = 0.5/1% for LISTOPS and 1% for TEXT), but the variance is bigger for IMAGE and PATHFINDER (max-min for pathfinder 6% for big batch size and 2% for small batch size; for image: 0 wit batch size, and 5% for small batch size) ==:bulb: Observation 2== As results, listops obtains similar reuslts regrdless of the batch size, but for image and pathfinder the peformance decreases (74 vs 70 but still running) and for image 40.5 vs 41.5 (still running) Dataset length: listops 96000 pathfinder 160000 text 25000 image 45000

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.