SEESAW BlogPost ICLR 2024

--- layout: distill title: Shallow Graph Embedding Methods vs. Graph Neural Networks: How and Why Do They Excel Differently? description: In this blog, we systematically compare the advantages of Shallow Graph Embedding Methods and Graph Neural Networks, where we summarized related key papers with a particular focus on the previously published papers regarding graph representation leanring. By highlighting the crucial findings and insights, we aim to shed light on the understanding and future advancements for the development of graph representation learning methods. date: 2024-05-07 future: true htmlwidgets: true # anonymize when submitting authors: - name: Anonymous # must be the exact same name as your blogpost bibliography: 2024-05-07-seesaw.bib # Add a table of contents to your post. # - make sure that TOC names match the actual section names # for hyperlinks within the post to work correctly. toc: - name: Graph Representation Learning: Overview - name: Shallow Graph Embedding Methods vs. Graph Neural Networks subsections: - name: Proficiencies of Graph Neural Networks - name: Proficiencies of Shallow Embedding Methods - name: Experiments subsections: - name: Performance over Different Attribute Dimensionalities - name: Performance over Different Heterophily Levels - name: Conclusion --- <span style="color:red">Note: (1) The header above is required by the submission guidlines; (2) References to be added — requires dependencies of folder structures </span> ## Introduction Graph Neural Networks (GNNs) have manifested significant proficiency in various graph learning tasks over recent years. Owing to their exemplary performance, GNNs have garnered increasing attention from both the research community and industrial practitioners. Consequently, there has been a notable transition away from the conventional and prevalent shallow graph embedding methods. However, in tandem with this transition, an imperative question arises — do GNNs always outperform shallow embedding methods in node representation learning? Despite the doubts cast by multiple recent studies, the field of graph machine learning still lacks a systematic understanding, which is essential for meticulously paving its advancement. In this blog post, we will explore the existing graph representation learning papers by discussing the impact of this work and examining how this work contributes to the advancement of graph embedding learning. ## Graph Representation Learning: Overview Graph-structured data is ubiquitous across a plethora of applications, including recommender systems<d-cite key="ying2018graph"></d-cite> <d-cite key="fan2019graph"></d-cite> <d-cite key="sankar2021graph"></d-cite> <d-cite key="tang2022friend"></d-cite>, predictive user behavior models <d-cite key="pal2020pinnersage"></d-cite> <d-cite key="zhao2021synergistic"></d-cite> <d-cite key="tang2020knowing"></d-cite>, and chemistry analysis <d-cite key="you2018graphrnn"></d-cite> <d-cite key="li2018learning"></d-cite>. To gain deeper understanding on graph data and exploit the rich relational information, there has been a surge of interest in learning informative representations for graphs <d-cite key="hamilton2017inductive"></d-cite> <d-cite key="XuHLJ19"></d-cite>. We first introduce graph representation learning, where shallow graph embedding methods and Graph Neural Networks (GNNs) are two representative categories of approaches. **What Is Graph Representation Learning?** Graph representation learning typically refers to the learning process that encodes nodes, edges, or subgraphs as data points in a low-dimensional hidden space. Here, the primary goal of graph representation learning is to preserve as much task-relevant information of the graph (e.g., the proximity of nodes over the graph topology) as possible <d-cite key="hamilton2017inductive"></d-cite>. Such encoding is usually achieved by optimizing a mapping from the graph data to the low-dimensional hidden space based on available graph data. Once the mapping is optimized, the learned representations can serve as the input features to perform a wide spectrum of downstream tasks on graphs, such as node classification and link prediction. In general, commonly used graph representation learning methods in practice can be divided into two categories, i.e., shallow graph embedding methods and deep graph learning methods. Below we present a brief introduction of their most representative learning schemes: walk-based shallow graph embedding methods and Graph Neural Networks (GNNs) <d-cite key="hamilton2017representation"></d-cite>. **Shallow Graph Embedding Methods** Shallow graph embedding methods are mostly characterized by using an embedding lookup table as the mapping from nodes to their representations <d-cite key="hamilton2017representation"></d-cite>. All entries in the embedding lookup table are considered as free parameters, which are optimized via objective functions based on random walks over the graph topology. For example, DeepWalk <d-cite key="perozzi2014deepwalk"></d-cite> and node2vec <d-cite key="grover2016node2vec"></d-cite> directly consider node representations as free parameters. These representations are optimized with a skip-gram model based on randomly generated walks. **Graph Neural Networks (GNNs)** Deep graph learning methods learn mappings from node attribute space to the latent space based on given graph topology. Among these methods, GNNs have emerged as one of the most popular learning schemes in recent years. Specifically, GNNs take node attributes and the graph topology as input, and they exploit the topology and attribute information concurrently via neighborhood aggregation, and examples include GCN <d-cite key="KipfW17"></d-cite> and GraphSAGE <d-cite key="hamilton2017inductive"></d-cite>. In practice, GNNs are often found to show superior performance in node representation learning to power various tasks over attributed graphs, such as node classification, link prediction, and graph classification. Such a huge success has placed GNNs among the most popular graph representation learning methods, attracting increasing attention from researchers and practitioners in recent years. ## Shallow Graph Embedding Methods vs. Graph Neural Networks We now briefly introduce the proficiencies of GNNs and shallow graph embedding methods. ### A Comparison Under A Unified View There has been a large amount of existing works on graph representation learning, covering both shallow graph embedding methods (such as DeepWalk <d-cite key="perozzi2014deepwalk"></d-cite> and node2vec <d-cite key="grover2016node2vec"></d-cite>) and GNNs (such as GCN <d-cite key="KipfW17"></d-cite> and GraphSAGE <d-cite key="hamilton2017inductive"></d-cite>). Below we unify the pipelines of the two branches of methods from the perspective of prior-posterior process in Figure 1 below. ![analysis_pipeline_new3-1](https://hackmd.io/_uploads/BJ_sQ6oHT.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 1. A unified view of pipelines for graph representation learning.</center> **(1) Difference in Learning Pirors.** Consider the distribution of node representations in the hidden space after model initialization as the prior distribution. For shallow graph embedding methods, as their embedding matrix is randomly initialized, the adopted distribution for node embedding initialization is the prior for representation learning, e.g., a uniform distribution. For GNNs e.g., GCN and GraphSAGE, the node attributes transformed by the randomly initialized learnable parameters are regarded as the prior distribution of node representations in the hidden space. **(2) Difference in Optimization Operations.** For shallow graph embedding methods, the prior distribution is directly optimized with a walk-based objective function. For GNNs, the output node representations can be optimized with the same objective, but only after the layer-wise neighborhood aggregation is performed. ### Proficiencies of Graph Neural Networks **(1) Inudctive Learning.** An optimized GNN model is able to be generalized onto unseen nodes during the training process and directly perform inference for them. Such an advantage naturally fits the need of a plethora of real-world applications (e.g., the need of recommendation algorithms to hold new users) and thus GNNs have been widely adopted in industrial areas. **(2) Sublinear Parameter Number w.r.t. Node Number.** GNNs perform neighborhood aggregation to extract the localized information for each node to perform inference on. In this process, (1) the number of learnable parameters is fixed (determined by the number of input and output dimensionality) associated each neighborhood aggregation process; and (2) the trajectories of neighborhood aggregation over the graph topology form a tree, and thus the total number of learnable parameters is linear w.r.t. the depth of such a tree. Therefore, the number of parameters is generally sublinear w.r.t. the number of nodes in most graphs. **(3) Utilizing Both Node Attributes and Topology.** GNNs are able to take both node attributes and graph topology as input. Correspondingly, they are able to extarct information from both resources to learn the output representations. Such an advantage is especially beneficial for those attributed graph data where attributes encode rich information needed by downstream predictive tasks. ### Proficiencies of Shallow Embedding Methods It is worth noting that the research attention has been shifting from shallow graph embedding to GNNs in recent years. Therefore, most existing works mainly focus on the advantages of GNNs, neglecting the potential proficiencies of shallow graph embedding methods. We present two advantages of shallow graph embedding methods. **(1) Handling Limited Attribute Dimensionalities.** Most walk-based shallow graph embedding methods can only take graph topology as input, failing to take attribute information into cosnideration. However, compared with GNNs, this may also help shallow embedding methods avoid dimensional collapse in attribute-poor scenarios. Specifically, when the input attribute dimensionality is limited, GNNs may learn the representations that collapse into a lower-dimensional subspace (instead of spanning the entire available hidden space) as in Figure 2. This could be a result of GNNs' low-rank prior, which severely hinders the fitting ability of the representation learning. However, shallow embedding methods may do not bear such a problem, since the prior of shallow embedding methods is randomly initialized and thus is always full-rank. ![xxxxx](https://hackmd.io/_uploads/ByIm8CoBa.png)![collapse_new-1](https://hackmd.io/_uploads/Sku0HAjBa.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 2. An illustration of (a) normal representations in the hidden space vs. (b) representations after dimensional collapse.</center> **(2) Handling Highly Heterophilic Nodes.** Recent studies have found that the neighborhood aggregation mechanism in GNNs may lead to performance gap between highly homophilic and highly heterophilic nodes. However, compared with GNNs, shallow embedding methods do not explicitly perform neighborhood aggregation. Therefore, such a design may help mitigate the performance gap between homophilic and heteropilic nodes. ## Experiments The advantages of GNNs over shallow graph embedding methods have been well-demonstrated by a series of existing works in this area. Here we present a brief empirical verification over the two advantages of shallow embedding methods above. The experimental verification in the original paper performed node representation learning for two different tasks, including node classification and link prediction. Here we take the results from node classification as an example, and we mainly present the results over *Amazon-Computers* dataset for more detailed experiments. Observations are consistent across different tasks and datasets. ### Performance over Different Attribute Dimensionalities First, it is found in Figure 3 and 4 that when the input attribute dimensionality is limited, then both the performance and the effective dimension ratio drop significantly across a series of real-world datasets. Here the effective dimension ratio measures the ratio of the dimensionality of learned representations to the dimensionality of the hidden space. Here the dimensionality of learned representations is measured with the rank of the learned representation matrix. This verifies that limited input node attribute dimensionality lead to dimensional collapse. ![nc_rank_update1-1](https://hackmd.io/_uploads/r12gZWhBa.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 3. Representation effective dimension ratios under different levels of attribute dimensionality.</center> ![nc_acc_update1-1](https://hackmd.io/_uploads/S1L-WZ2Hp.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 4. Node classification accuracy under different levels of attribute dimensionality.</center> Second, to further examine whether the reduction in effective dimension ratio directly influences the downstream task performance, performance is measured over *Amazon-Computers* dataset under different rank bounds (of representation matrix) and pre-set embedding dimensions with all node attributes being available. It is obsereved in Figure 5 that the bound of the rank plays a critical role in determining the performance regardless of the pre-set embedding dimensions. This verifies that the level of dimensional collapse directly determines the performance. ![gcn_amazon_computer_100-1](https://hackmd.io/_uploads/HyzoUAiST.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 5. Tendencies of performance under different rank bounds (enforced for embedding matrix).</center> Third, a comparison between GNNs under different levels of available node attributes and shallow graph embedding method over *Amazon-Computers* dataset is presented below. We found in Figure 6 that compared with shallow embedding method, the increasing of effective dimension in GNNs fall behind the increasing of bound of the rank (of the node representation matrix). This demonstrates that limited attribute dimensionality plays a critical role in causing dimensional collapse, while shallow embedding method does not bear such a problem. ![gcn-Amazon-Computers-rank-new-1](https://hackmd.io/_uploads/Sk5jIRjST.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 6. Tendencies of effective dimension number under different rank bounds (enforced for embedding matrix).</center> To summarize, GNNs are corroborated to bear dimensional collapse and thus may exhibit unsatisfying performance when the input attribute dimensionality is limited. However, shallow embedding method does not bear such a problem, since node attributes are not utilized as the prior for node representation learning. ### Performance over Different Heterophily Levels We then focus on the performance gap between highly homophilic and highly heterophilic nodes. Specifically, existing studies found that in many cases, the neighborhood aggregation mechanism helps GNNs achieve better performance over highly homophilic nodes, while it may hurt the performance over highly heterophilic nodes at the same time. This leads to a performance gap between highly homophilic and highly heterophilic nodes. To verify the influence of neighborhood aggregation mechanism in GNNs, we examine the cumunative performance of node classification (in accuracy) below for both shallow graph embedding methods and GNNs. First, for shallow embedding method, the vanilla shallow embedding method with and without neighborhood aggregation is compared over *Amazon-Computers* dataset in Figure 7 below. It is found that despite their similar overall performance, vanilla shallow embedding method significantly outperforms that with information aggregation over highly heterophilic nodes. This reveals that when enforcing the neighborhood aggregation (from GNNs) on shallow embedding method, the performance over highly heterophilic nodes also reduces. ![deepwalk-nc-new-1-1](https://hackmd.io/_uploads/ByiaURsrp.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 7. Shallow methods w/ vs. w/o neighborhood aggregation.</center> Second, for GNNs, the vanilla GNN with and without neighborhood aggregation is compared over *Amazon-Computers* dataset in Figure 8 below. A similar observation can be found: despite the comparable overall performance, neighborhood aggregation tends to reduce the performance of GNNs on those highly heterophilic nodes. ![gnn-nc-new-1](https://hackmd.io/_uploads/S1bZPCoBp.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 8. GNNs w/ vs. w/o neighborhood aggregation.</center> To summarize, the experiments above reveals that shallow embedding methods may outperform GNNs over those highly heterophilic nodes. The reason is that, empirically, neighborhood aggregation mechanism could hurt the performance of highly heterophilic nodes, while shallow graph embedding methods do not explicitly perform such a mechanism. ### Combining Their Strengths Simply concatenating the representations from both models helps achieve satisfying performance across different attribute dimensionality ratios. We present a performance comparison between shallow method, GNN, and the strategy of using the concatenation of node representations yielded by the two methods on *DBLPFull* dataset in Figure 9. We observe that the performance and effective dimension ratio of GNNs reduce significantly when attribute dimensionality ratio declines. However, by simply combining the representations yielded by the two methods, we are able to significantly reduce its sensitivity to attribute dimensionality ratio, achieving a more stable performance along both axes. We also have similar observations on other datasets. Nevertheless, such combination would result in significantly higher computational complexity as well as the need of maintaining two models instead of one, which is often much less preferred in industrial applications given cost and human resourcing concerns. ![gnn_vs_concate_new1-1](https://hackmd.io/_uploads/HyAlSRira.png) <center style="font-size:20px;color:#C0C0C0;text-decoration:underline">Figure 9. Comparison between GNNs, GNNs with concatenated representations from shallow embedding method (denoted as "Concat"}), and shallow embedding methods (denoted as "Shallow").</center> ## Conclusion: A Practitioners' Guide **Data Perpective - Attribute-Rich vs. Attribute-Poor Networks.** GNNs often achieve superior performance in scenarios with rich attribute compared to shallow embedding methods. Correspondingly, adopting GNNs for representation learning on attribute-rich networks is an obvious choice. Nevertheless, in attribute-poor scenarios, e.g., when the attribute dimensionality is limited, GNNs are prone to exhibit dimensional collapse, while shallow embedding methods do not bear such a drawback. Therefore, we recommend adopting GNNs and shallow embedding methods on attribute-rich and attribute-poor networks, respectively. **Data Perpective - Homophilic vs. Heterophilic Networks.** Shallow embedding methods and GNNs exhibit different performance on nodes with different levels of heterophily. In particular, GNNs exhibit superior and inferior performance on homophilic and heterophilic nodes (compared with shallow embedding methods), respectively. A preliminary reason is that explicitly performing neighborhood aggregation is helpful for homophilic nodes while harmful to heterophilic nodes. Therefore, GNNs are recommended for representation learning if the network data is homophilic, otherwise shallow embedding methods could be more suitable. **Model Perpective - Transductive vs. Inductive Settings.** As the shallow embedding methods rely on training an embedding vector for each of the node in the graph, they naturally do not support inductive learning. That is, given any newly appeared nodes, shallow embedding methods cannot produce it's representation without retraining or at least fine-tuning the model. On the other hand, as feature-based models, GNNs are naturally inductive and are able to inference node representations for the newly appeared nodes <d-cite key="hamilton2017inductive"></d-cite>. Hence, for use-cases such that the graphs are rapidly updating (e.g., social networks, e-commerce networks, etc.), GNNs are recommended given their inductive bias, whereas shallow embedding methods require frequent costly retrains. **Model Perpective - Low-parameter vs. High-parameter Settings.** Modern machine learning usually requires loading all learnable parameters into limited GPU memory to achieve higher training speed. The number of learnable parameters for shallow embedding methods grows linearly with the number of nodes. On the other hand, the parameters size of GNNs is only proportional to the dimension of node attributes and not the number of nodes. Therefore, for large graphs with high number of nodes, GPU training might not be feasible for shallow embedding methods without techniques such as model parallelism <d-cite key="paralleltorch"></d-cite>.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.