Part 4 : GRAPH CONVOLUTIONAL NETWORK(GCN/GCNN)

# **GRAPH CONVOLUTIONAL NETWORK(GCN/GCNN)** > [Deepak Yadav](https://hackmd.io/profile) > - **Why Graph Convolutional Networks(GCN/GCNN)?** - **Appllications of GCNs** - **What GCNs?** - **How GCNs?** - **A Simple Graph Example** - **Conclusion** ## **Why Graph Convolutional Networks(GCN/GCNN)?** So let’s get into the real deal. Looking around us, we can observe that most of the real-world datasets come in the form of graphs or networks: social networks, protein-interaction networks, the World Wide Web, etc. This makes learning on graphs an exciting problem that can solve tonnes of domain-specific tasks rendering us insightful information. **But I still haven’t answered the big elephant in the room. WHY?** To answer why, we first need to understand how a class of models like Convolutional Neural Networks(CNNs) work. CNN’s are really powerful, and they have the capacity to learn very high dimensional data. Say you have a **512∗512 pixel** image. The dimensionality here is approximately **1 million**. For 10 samples, the space becomes 101,000,000, and CNNs have proven to work really well on such tough task settings! But there is a catch! These data samples, like images, videos, audio, etc., where CNN models are mostly used, all have a specific compositionality, which is one of the strong assumptions we made before using CNNs. **So CNNs basically extract the compositional features and feeds them to the classifier.** --- **What do I mean by compositionality?** The key properties of the assumption of compositionality are - Locality - Stationarity or Translation Invariance - Multi-Scale: Learning Hierarchies of representations --- **2D Convolution vs. Graph Convolution** If you haven’t figured it out, not all types of data lie on the Euclidean Space and such are the graphs data types, including manifolds, and 3D objects, thus rendering the previous 2D Convolution useless. Hence, the need for GCNs which have the ability to capture the inherent structure and topology of the given graph. Hence this blog :P. ![](https://i.imgur.com/CgfDw0f.jpg) --- ## **Appllications of GCNs** One possible application of GCN is in the Facebook’s friend prediction algorithm. Consider three people A, B and C. Given that A is a friend of B, B is a friend of C. You may also have some representative information in the form of features about each person, for example, A may like movies starring Liam Neeson and in general C is a fan of genre Thriller, now you have to predict whether A is friend of C. ![](https://i.imgur.com/fpGoQHQ.jpg) --- ## **What GCNs?** As the name suggests, Graph Convolution Networks (GCNs), draw on the idea of Convolution Neural Networks re-defining them for the non-euclidean data domain. A regular Convolutional Neural Network used popularly for Image Recognition, captures the surrounding information of each pixel of an image. Similar to euclidean data like images, the convolution framework here aims to capture neighbourhood information for non-euclidean spaces like graph nodes. > In other word we can say - GCNs are a very powerful neural network architecture for machine learning on graphs. In fact, they are so powerful that even a randomly initiated 2-layer GCN can produce useful feature representations of nodes in networks. The figure below illustrates a 2-dimensional representation of each node in a network produced by such a GCN. Notice that the relative nearness of nodes in the network is preserved in the 2-dimensional representation even without any training. A GCN is basically a neural network that operates on a graph. It will take a graph as an input and give some (we’ll see what exactly) meaningful output. GCNs come in two different styles: ---- - **Spectral GCNs :** Spectral-based approaches define graph convolutions by introducing filters from the perspective of graph signal processing based on graph spectral theory. - **Spatial GCNs :** Spatial-based approaches formulate graph convolutions as aggregating feature information from neighbours. **Note:** Spectral approach has the limitation that all the graph samples must have the same structure, i.e. homogeneous structure. But it is a hard constraint, as most of the real-world graph data have different structure and size for different samples i.e. heterogeneous structure. The spatial approach is agnostic of the graph structure. --- ## **How GCNs?** First, let’s work this out for the Friend Prediction problem and then we will generalize the approach. **Problem Statement:** You are given N people and also a graph where there is an edge between two people if they are friends. You need to predict whether two people will become friends in the future or not. A simple graph corresponding to this problem is: ![](https://i.imgur.com/WF5p8lN.png) Here person (1,2) are friends, similarly (2,3),(3,4),(4,1),(5,6),(6,8),(8,7),(7,6) are also friends. Now we are interested in finding out whether a given pair of people are likely to become friends in the future or not. Let’s say that the pair we are interested in is (1,3), and now since they have 2 common friends, we can softly imply they have a chance of becoming friends, whereas the nodes (1,5) have no friend in common, so they are less likely to become friends. Let’s take another example: ![](https://i.imgur.com/mTbg4DJ.png) Here (1,11) are much more likely to become friends than say (3,11). Now the question that one can raise is ‘How to implement and achieve this result?’. GCN’s implement it in a way similar to CNNs. In a CNN, we apply a filter on the original image to get the representation in the next layer. Similarly, in GCN, we apply a filter which creates the next layer representation. Mathematically we can define as follows: - **Hi =f(Hi−1,A)** A very simple example of f maybe: - **f(Hi,A)=σ(AHiWi)** where - **A** is the **N×N** adjacency matrix - **X** is the input feature matrix **N×F**, where N is the number of nodes and **F** is the number of input features for each node. - **σ** is the Relu activation function - **Ho=X** Each layer **Hi** corresponds to an **N×Fi** feature matrix where each row is a feature representation of a node. - **f** is the propagation rule At each layer, these features are aggregated to form the next layer’s features using the propagation rule f. In this way, features become increasingly more abstract at each consecutive layer. Yes, that is it, we already have some function to propagate information across the graphs which can be trained in a semi-supervised way. Using the GCN layer, the representation of each node (each row) is now a sum of its neighbour’s features! In other words, the layer represents each node as an aggregate of its neighbourhood. **But, Wait is it so simple?** I’ll request you to stop for a moment here and think really hard about the function we just defined. Is that correct? **STOP** ....... ..... ... .. …. …. …. --- ## **A Simple Graph Example** As a simple example, we’ll use the the following graph: ![](https://i.imgur.com/5Mjgfxr.png) --- And below is its `numpy` adjacency matrix representation. ![](https://i.imgur.com/9MSGsWT.png) Next, we need features! We generate 2 integer features for every node based on its index. This makes it easy to confirm the matrix calculations manually later. ![](https://i.imgur.com/qBZVERV.png) --- **Applying the Propagation Rule** Alright! We now have a graph, its adjacency matrix **A** and a set of input features **X**. Let’s see what happens when we apply the propagation rule: ![](https://i.imgur.com/ltu3oex.png) What happened? The representation of each node (each row) is now a sum of its neighbors features! In other words, the graph convolutional layer represents each node as an aggregate of its neighborhood. I encourage you to check the calculation for yourself. Note that in this case a node n is a neighbor of node v if there exists an edge from v to n. --- **Problems on the Horizon!** It is sort of! But it is not exactly what we want. If you were unable to arrive at the problem, fret not. Let’s see what exactly are the ‘**problems**’ (yes, more than one problem) this function might lead to: - The aggregated representation of a node does not include its **own features**! The representation is an aggregate of the features of neighbor nodes, so only nodes that has a self-loop will include their own features in the aggregate. - Nodes with large degrees will have large values in their feature representation while nodes with small degrees will have small values. This can cause **vanishing or exploding gradients**, but is also problematic for stochastic gradient descent algorithms which are typically used to train such networks and are sensitive to the scale (or range of values) of each of the input features. --- **Adding Self-Loops** To address the first problem, one can simply add a self-loop to each node. In practice this is done by adding the identity matrix I to the adjacency matrix A before applying the propagation rule. ![](https://i.imgur.com/P09ILE2.png) Since the node is now a neighbor of itself, the node’s own features is included when summing up the features of its neighbors! --- **Normalizing the Feature Representations** The feature representations can be normalized by node degree by transforming the adjacency matrix A by multiplying it with the inverse degree matrix. Thus our simplified propagation rule looks like this [1]: - **f(X, A) = D⁻¹AX** - Let’s see what happens. We first compute the degree matrix. ![](https://i.imgur.com/D8ye4Qg.png) **NOTE:** Before applying the rule, let’s see what happens to the adjacency matrix after we transform it --- **Before** ![](https://i.imgur.com/29UsDfn.png) --- **After** ![](https://i.imgur.com/On9Y98y.png) Observe that the weights (the values) in each row of the adjacency matrix have been divided by the degree of the node corresponding to the row. We apply the propagation rule with the transformed adjacency matrix ![](https://i.imgur.com/SPQJJtL.png) and get node representations corresponding to the mean of the features of neighboring nodes. This is because the weights in the (transformed) adjacency matrix correspond to weights in a weighted sum of the neighboring nodes’ features. Again, I encourage you to verify this observation for yourself. --- **Putting it All Together** We now combine the self-loop and normalization tips. In addition, we’ll reintroduce the weights and activation function that we previously discarded to simplify the discussion. --- **Adding back the Weights** First order of business is applying the weights. Note that here `D_hat` is the degree matrix of `A_hat = A + I`, i.e., the degree matrix of A with forced self-loops. ![](https://i.imgur.com/oT6HICc.png) And if we want to reduce the dimensionality of the output feature representations we can reduce the size of the weight matrix W ![](https://i.imgur.com/Zok4trM.png) --- **Adding an Activation Function** We choose to preserve the dimensionality of the feature representations and apply the ReLU activation function. ![](https://i.imgur.com/u0h3kD1.png) Voila! A complete hidden layer with adjacency matrix, input features, weights and activation function! --- **Back to Reality** Now, finally, we can apply a graph convolutional network on a real graph. ![](https://i.imgur.com/PanJ84S.png) `Likewise In Deep Convolutional Nural Network, we will introduce some pre-inbuilt Models like ResNet50, VGG, etc. Similarly in GCN, we have a pre-inbuilt model where I have to pass this trained graph data.` In GCN we have `Zachary’s Karate Club` Zachary’s karate club is a commonly used social network where nodes represent members of a karate club and the edges their mutual relations. While Zachary was studying the karate club, a conflict arose between the administrator and the instructor which resulted in the club splitting in two. The figure below shows the graph representation of the network and nodes are labeled according to which part of the club. The administrator and instructor are marked with ‘A’ and ‘I’, respectively. ![](https://i.imgur.com/4CLznWQ.png) **Building the GCN** Now let us build the graph convolutional network. We won’t actually train the network, but simply initialize it at random to produce the feature representations we saw at the start of this post. We will use `networkx` which has a graph representation of the club easily available, and compute the `A_hat` and `D_hat` matrices. ![](https://i.imgur.com/oWYKLNi.png) Next, we’ll initialize weights randomly. ![](https://i.imgur.com/6AFviS8.png) Stack the GCN layers. We here use just the identity matrix as feature representation, that is, each node is represented as a one-hot encoded categorical variable. ![](https://i.imgur.com/TWH0LTY.png) We extract the feature representations. ![](https://i.imgur.com/NDGKaZD.png) And voila! Feature representations that separate the communities in Zachary’s karate club quite well. And we haven’t even begun training yet! ![](https://i.imgur.com/aia6KYl.png) I should note that for this example the randomly initialized weights were very likely to give 0 values on either the x- or the y-axis as result of the ReLU function, so it took a few random initializations to produce the figure above. --- ## Simple Code of GCN [Google Colab ](https://colab.research.google.com/drive/1NwcGZLqg0K3KJOfz8TqXvhfioZSoi8K4#scrollTo=y1IkGoMk-gq7) --- ## **Conclusion** In this post I have given a basic introduction to graph convolutional networks and illustrated how the feature representation of a node at each layer in the GCN is based on an aggregate of the its neighborhood. and how powerful they are: even randomly initialized GCNs can separate communities in Zachary’s Karate Club