[Semi-supervised classification with Graph Convolutional Networks](https://openreview.net/pdf?id=SJU4ayYgl)

# [Semi-supervised classification with Graph Convolutional Networks](https://openreview.net/pdf?id=SJU4ayYgl) ###### tags: `paper` [PPT]( https://onedrive.live.com/view.aspx?resid=110098534F5A2124!180&ithint=file%2cpptx&authkey=!AHsDdxUxJwNHlEk) [Word]( https://onedrive.live.com/view.aspx?resid=110098534F5A2124!182&ithint=file%2cdocx&authkey=!AL8UOcKAy8-EIpI) ## Abstract 1. Scalable approach(graph edges) for semi supervised learning 2. graph structured data 3. efficient varient of CNN 4. localized first-order aproximation of spectral graph convolution 5. local graph structure & nodes ## Introduction 問題: graph中分類node $\to$ 在引用文件的網絡中分類文件且 label 只會存在在某些小的子集合中(某小部分的文件會有標註分類) $\to$ graph based semi supervised learning 基於圖的半監督學習大致分為兩類，分別是基於graph-based regularization以及基於graph embedding的。 * 前者是假設相連結點有類似的label graph Laplacian regulation in loss function: ![](https://i.imgur.com/W8X9qvW.png) 由(1)中的鄰接矩陣可以發現本損失函數是假設相鄰的點會有類似的性質，所以有相同的label，但是這個損失函數在某種程度上降低了模型的彈性。 * 後者是embedding-based embedding-based的模型在訓練的時候有兩個stage，首先要先訓練embedding接著是classifier，每個stage要分別Optimize，無法達成end-to-end。本論文的目標便是將捲積應用在圖的結構上，並排除掉graph Laplacian regulation的假設，以及避免mult-step pipeline的使用。 ## Graph Convolutional Network ![](https://i.imgur.com/DhQDGE0.png) (2)式是GCN的順向傳遞方式。在GCN的網路裡，可訓練的參數只有每一層的W(在論文第三節的範例裡面用的是梯度下降法)，同時每一層的大小是固定的，而H0則是指每個node的原始的特徵(feature X)。 :::info <font color=#000000> $\color{black}{Def}$ $\color{black}{\tilde{A} = A + I_N}$ $L = D - A$ $L^{\text{sym}}:=D^{-{\frac {1}{2}}}LD^{-{\frac {1}{2}}}=I-D^{-{\frac {1}{2}}}AD^{-{\frac {1}{2}}}$ </font> ::: >[color=#ff00ff] > >$\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}$是~~標準化~~正規化的Laplacian Matrix，如果從一般的Laplacian Matrix (L=D-A)的角度來看，那GCN每一層在做的事情就像是在計算每個節點自身要增加多少資料並且往其他節點流出多少資料。 > >純粹個人見解，請不吝指教。 >[name=Neko] >[color=#00A000] > >參照[wiki](https://en.wikipedia.org/wiki/Laplacian_matrix)中的資料: >$L^{\text{sym}}:=D^{-{\frac {1}{2}}}LD^{-{\frac {1}{2}}}=I-D^{-{\frac {1}{2}}}AD^{-{\frac {1}{2}}}$ >是正則化(Normalized)的Laplacian Matrix > >[算詳細的介紹？](https://zhuanlan.zhihu.com/p/81502804) >[name=Omnom] ### Origin of (2) 1. **Spectral Graph Convolution** Graph的Spectral相當於對圖的Laplacian做特徵值分解 ![](https://i.imgur.com/qGJoaZI.png) 把圖的Laplacian想成是對鄰接矩陣A的某種正則化的話，特徵值分解就好比是在拆解graph。在這裡我們把Spectral想成是把signal、graph之類的東西拆解成小成分就可以了。 >[color=#00a000] > >有關[Laplacian Matrix](http://www.csie.ntnu.edu.tw/~u91029/Graph3.html#2)的介紹 >有種Kirchoff law的感覺 >[name=Omnom] 2. **Chebyshev graph convolution** Spectral Graph Convolution的缺陷在於速度，因為要計算出Matrix的特徵分解是相當耗時的，而且占用記憶體，因此Defferrard跟Hammond就使用了Chebyshev graph convolution來改善這個問題。 * Chebyshev polynomials (First kind): <font color=#000000> $T_0(x) = 1$ $T_1(x) = x$ $T_{n+1}(x) = 2xT_n(x)-T_{n-1}(x)$ </font> * $$g_{\theta~'}\star x \approx \sum_{k=0}^K \theta~'_kT_k(\tilde L)x ~~~~~~~~~~~(5)$$ * $$\tilde L=\frac{2}{\lambda_{max}}L-I_N$$ >[color=#00a000] > >關於$\tilde L$的長相⋯⋯ > >Connected Graph 的 Normalized Laplacian Matrix有一個特性：$0 ≤ \lambda ≤ 2$ > >$\tilde \Lambda$形式如同上式，是為了將所有$\lambda$ scaling 至 $[-1, 1]$之間，以符合Chebyshev polynomials approximation的值域範圍。 >[name=Omnom] 定K為1的話相當於將X當作參數丟進網路，K=2相當於將$LX+X$，K=3則為$L^2X+LX+X$，而鄰接矩陣的次方剛好有這個[特性](https://en.wikipedia.org/wiki/Adjacency_matrix#Matrix_powers) 3.**GCN** Chebyshev graph convolution還是有其問題，就是訓練參數過多，比較前述的K=1,K=2,K=3就可以發現，而通常Graph的資料大小並不大，所以不當的設計容易造成過擬合。使用stacking multiple convolutional layer來取代K=1時的Chebyshev graph convolution，讓整體function可以保持線性(不過實際上還是會加上ReLU等activation function讓model能保有非線性的性質) * **Renormalization trick**: alleviate numerical instability and gradient vanishing ![](https://i.imgur.com/Woq4fcL.png) * general formula ![](https://i.imgur.com/xUVDEy1.png) ![](https://i.imgur.com/emzRfy5.png) >[color=#00A000] > > >[name=Omnom] >[color=#00a000] > >[摘](https://archive.siam.org/books/ot99/OT99SampleChapter.pdf) >Chebyshev polynomials form a special class of polynomials especially suited for approximating other functions. They are widely used in many areas of numerical analysis: uniform approximation, least-squares approximation, numerical solution of ordinary and partial differential equations (the so-called spectral or pseudospectral methods), and so on. >[name=Omnom] >[color=#00a000] > >[摘](https://zh.wikipedia.org/wiki/切比雪夫多项式) >第一類切比雪夫多項式的根（被稱為切比雪夫節點）可以用於多項式插值。相應的插值多項式能最大限度地降低龍格現象，並且提供多項式在連續函數的最佳一致逼近。 > >[龍格現象](https://blog.csdn.net/qq_39521554/article/details/79835492) >[name=Omnom] ## Experiments ### Datasets ![](https://i.imgur.com/Lppwbsa.png) :::danger empty ::: ### Baselines * label propagation(LP) * semi-supervised embedding(SemiEmb) * manifold regularization(ManiReg) * skip-gram based graph embeddings(DeepWalk) * iterative classification algorithm(ICA) * Planetoid :::danger empty ::: ### Result ![](https://i.imgur.com/OCmyTge.png) * comparison of diffferent propagation models ![](https://i.imgur.com/8714TWi.png) * influence of model depth ![](https://i.imgur.com/eI4GEv0.png) ## Discussion ### Limitations & Possible Future work #### Memory requirement * Limits 1. setup with full-batch gradient descent, memory requirement grows linearly in the size of the dataset 2. For very large and densely connected graph datasets, further approximations might be necessary. * Future 1. #### Directed edges and edge features * Limits 1. does not naturally support edge features 2. limited to undirected graphs * Future 1. possible to handle both directed edges and edge features by representing the original directed graph as an undirected bipartite graph with additional nodes that represent edges in the original graph #### Assumptions * Limits 1. locality 2. equal importance of self-connections vs. edges to neighboring nodes * Future 1. introduce a trade-off parameter $\lambda$ in the definition of $A^{}$, plays a similar role as the trade-off parameter between supervised and unsupervised loss in the typical semi-supervised setting ## Conclusion 1. novel approach for semi-supervised classification on graph-structured data 2. efficient layer-wise propagation rule --- based on a first-order approximation of spectral convolutions on graph 3. capable of encoding both graph structure and node feature in a way useful for semi-supervised classification