Inductive Representation learning on large graphs

# Inductive Representation learning on large graphs https://arxiv.org/abs/1706.02216 ## Abstract and introduction One of the early foundational works in the domain of Graph Neural networks, which proposes a technique called GraphSage. Previous work on learning representation for various objects in a graph have a major limitation of being shallow and inherently trasductive, that is they were not generalizable to new nodes not observed during training. GraphSage overcomes these limitations by learning a function that generates representations by sampling and aggregating features from a node's local neighbourhood. ## Visual illustration of GraphSage ![](https://i.imgur.com/1almHgQ.png) First we sample the neighbourhood for any concerned node. For example, if we choose the red node in 1 in the above figure, we first choose it's 1-hop neighbours, followed by it's 2-hop neighbours, which we can see from the concentric circles. Within the 1-hop neighbours, we sample some of the nodes. Then we sample some nodes from the 2nd hop. Hence we get a computational graph with the selected nodes. Any other sample of nodes will have it's own distinct computational graph. Instead of training a distinct embedding vector for each node, we train a set of aggregator functions that learn to aggregate feature information from a node’s local neighborhood for message propagation. For example in (2) of the above figure, the blue node accepts information from each of the green nodes adjacent to it, and then propagate it to the red node. Hence the red node has the final embedding. The red and blue nodes also have an aggregatal function attached to them, which gives information on how to merge the various arrows of the same color while propagating the information. The aggregator function should be order invariant. We can also have an intermediate layer for each of the edges which does a linear transformation in the early node representation to each of the neighbours. So we do a transform and then aggregate from the neighbourhoods. Same process is repeated for all nodes, and the model is trained over multiple minibatches of various samples. At test, or inference time, we use our trained system to generate embeddings for entirely unseen nodes by applying the learned aggregation functions. The red node's embedding can be obtained by training in a supervised or unsupervised fashion. For supervised, let's say the red node is either labelled spam or not spam, we can then use cross entropy loss and backpropagate and train each of the computational graphs. For unsupervised, we can use the property that if two nodes are neighbours of each other, then even in higher dimension space they'll be close to each other, so we try to optimize on that loss. ## Embedding generation (forward propagation) algorithm We assume that the training is already done. We have these k-aggregator functions already trained and the parameters are freezed. The aggregate functions are represented by $AGGREGATE_k$ and the weight matrix at every level k by $W^k$. The following is the pseudocode for this: ![](https://i.imgur.com/dwxoBWL.png) The input is a graph with V vertices and E edges. We are provided with a feature representation for every node $x_v$, such as the node degree or the one-hot encoded vector for that node. The depth k shows how deep we want to go in the network. If there are multiple aggregator functions to be learnt at each level, then they will have shared parameters. The output is a dense vector representation for every node. The max value of K is usually 2 or 3, since further levels could lead to redundancy and similar embedding for every node. So restricting the number of levels ensures that the computational graph of every node is more exclusive, leading to a greater diversity in the embeddings being learnt. Each step in the outer loop of Algorithm 1 proceeds as follows,where k denotes the current step in the outer loop (or the depth of the search) and $h_k$ denotes a node’s representation at this step: First, each node $v \in V$ aggregates the representations of the nodes in its immediate neighborhood, $\{h_u^{k-1} \forall u \in \mathcal{N}(v)\}$, into a single vector $h^k_{\mathcal{N}(v)}$. Note that this aggregation step depends on the representations generated at the previous iteration of the outer loop (i.e., k-1), and the k = 0 (“base case”) representations are defined as the input node features. After aggregating the neighboring feature vectors, GraphSAGE then concatenates the node’s current representation, $h^{k-1}_v$ , with the aggregated neighborhood vector, $h^k_{\mathcal{N}(v)}$, and this concatenated vector is fed through a fully connected layer with nonlinear activation function, which transforms the representations to be used at the next step of the algorithm (i.e., $\{h^k_v \forall v \in V \}$). The final representations output at depth K as $z_v = h^K_v, \forall v \in V$. For example, if we have a node v with 3 neighbours. So now to get a representation for the v-th node for the k-th depth is nothing but what we get from the neighbouring nodes as part of (k-1)th step, along with what value we have at the kth step. So we use the aggregator function that combines the representation for the nodes that were there in the neighbourhood, after which we concatenate the representation for the node itself, followed by a linear transformation and a non-linearity. Hence this is the forward propagation algorithm, the learnable parameters are $W^k$, and we could have aggregator functions also learnt. So we have a linear transformation for the representation of every neighbouring node, so if we do that we have one more parameter to be learnt. To extend Algorithm 1 to the minibatch setting, given a set of input nodes, we first forward sample the required neighborhood sets (up to depth K) and then we run the inner loop, but instead of iterating over all nodes, we compute only the representations that are necessary to satisfy the recursion at each depth. ## Learning Parameters of GraphSage A graph-based loss function is applied to the output representations $z_u$: ![](https://i.imgur.com/MwAEADC.png) where where v is a node that co-occurs near u on fixed-length random walk, $\sigma$ is the sigmoid function, $P_n$ is a negative sampling distribution, and $Q$ defines the number of negative samples. The weight matrices $W^K$ and the parameters of the aggregator functions are tuned using stochastic gradient descent. The above graph-based loss function encourages nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct. It has two major terms. Let the log term towards the left of the RHS be T1, and the expectation be T2. The authors propose both supervised and unsupervised approach for parameter learning. For unsupervised, the idea is that if two nodes are co-occurring in a certain window, then they should also be close to each other in a higher dimensional representation. Which gives the above objective. For supervised, we can just simply replace the above loss with a cross-entropy loss as: $J= -tlog(p) - (1-t)log(1-p)$, for two classes, where p is the probability of predicting a certain class, and t is the ground truth label. For T1 above, u and v are the nodes that occur in a certain random walk of length l. If the dot product between them is high (meaning that the nodes are very similar), the sigmoid would be close to 1, the log of which would be close to 0. For T2, we take the expectation over nodes that are not in a similar random walk, and we calculate their dot product. Since these are far away nodes, those should also not be close in a higher dimensional space, hence the dot product is going to be really low, and the negative of that, negative of which would be very high, and sigmoid of which would again be 1, resulting in log loss tending to 0. In this way, T2 optimizes over the nodes that are far away and are dissimilar in the higher dimensional space. The Q in the equation denotes the number of negative samples that we might want to consider for calculating this loss, which indirectly gives a higher weightage to the T2 term. ## Aggregator architectures ### Mean aggregator ![](https://i.imgur.com/KnubJD9.png) where $h_u^{k-1}$ is the representation for neighbouring nodes in the previous depth, and $h_v^{k-1}$ is the representation for central nodes in the previous depth, we take a mean of that, do a linear transformation, pass through non-linearity, to get the final representation. We skip the concatenation operation when we use this as the aggregator. ## LSTM aggregator Earlier we mentioned that the aggregator function should be order invariant. However, LSTMs by nature are sequential models which capture the temporal pattern. Hence in order to use LSTMs to operate on this unordered set, the authors train LSTMs by feeding it random permutations of the neighbouring nodes, thus forcing it to be symmetric. ## Pooling aggregator In this pooling approach, each neighbor’s vector is independently fed through a fully-connected neural network; following this transformation, an elementwise max-pooling operation is applied to aggregate information across the neighbor set: ![](https://i.imgur.com/N6kmcS9.png) This aggregator is both symmetric and trainable. Here we have an extra parameter ($W_{pool}$) or network which will try to combine all the representations of the neighbourhood, and eventually we take an element wise maximum over all these transform representations. Hence we get an aggregate pool, which we concat with the current one, and then do a linear transformation followed by a non-linear mapping to get the final representation.