ICLR 2024 SEESAW Rebuttal

Updated Version 11/19/2023: ## Reviewer 1 (VyuR) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: W1. Novelty is not good. The findings that GNNs face some challenges when we do not have enough attribute information (e.g.[1, 2]) and when we have heterophilic data (e.g.[3]) are already identified by other works. **Authors**: We agree with the reviewer that (1) existing works such as [1, 2] have pointed out that GNNs face challenges when we lack attribute information, and (2) GNNs also face challenges on heterophilic graphs [3]. However, we note that the main focus of this paper is to perform a comprehensive comparison between GNNs and shallow embedding methods. Especially, to the best of our knowledge, no existing work has comprehensively explored the strengths and weaknesses of GNNs compared with shallow methods with a unified view. Therefore, a comprehensive analysis is still needed to have an in-depth analysis of GNNs. For example, it is valuable to point out that dimensional collapse solely happens in GNNs instead of traditional shallow methods, and such a conclusion has not been revealed by any other existing works. [1] Wei, J., Goyal, M., Durrett, G., & Dillig, I. (2020). Lambdanet: Probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161. [2] Sun, Z., Zhang, W., Mou, L., Zhu, Q., Xiong, Y., & Zhang, L. (2022, June). Generalized equivariance and preferential labeling for gnn node classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8395-8403). [3] Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., & Koutra, D. (2020). Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in neural information processing systems, 33, 7793-7804. --- > **Reviewer**: W2. Only empirical results are provided, there is no theoretical analysis or deep explanation regarding the empirical results, which makes this work less solid. **Authors**: We agree with the reviewer that the corresponding theoretical analysis regarding the comparative study between GNNs and shallow methods would be an interesting follow-up work. Nevertheless, this paper mainly focuses on an empirical comparison between GNNs and shallow methods. [Tong: we can also emphasize a bit on the findings and practitioner's guide here. Of course theoretical analysis is def good to have, but the consistent observations are already pretty self-explainary and giving us valuable observations. Basically i think we need to argue on the usefulness of the analysis even though there's no theory supporting it [looking for further comments on this reply] --- > **Reviewer**: W3. For the experiments, only Deepwalk is compared among all the shallow methods, and only homophilic datasets are used while some heterophilic datasets are missing (e.g. datasets in [3]). **Authors**: We thank the reviewer for pointing this out. We elaborate on the details corresponding to the two concerns below. First, the reason why DeepWalk is adopted is that DeepWalk is a representative example of walk-based shallow methods **in its design**. Specifically, DeepWalk is among the most commonly used shallow graph embedding methods, and a large amount of following works under the umbrella of shallow methods are developed based on DeepWalk, such as [1, 2, 3]. Therefore, **DeepWalk is among the best options we can choose to obtain generalizable analysis**, and adopting more follow-up methods that share similar design with DeepWalk does not change the observation and conclusion. Second, we would like to note that our analysis **does not depend on whether the adopted datasets are homophilic or not**. As suggested, we perform experiments on the suggested datasets. We select the Squirrel dataset and present the corresponding performances below as a representative example, since the Squirrel dataset has a comparable scale (5,201 nodes) with the datasets adopted in our paper and is also highly heterophilic (homophilic ratio 0.22, among the lowest ones). First, we perform experiments to evaluate dimensional collapse. Here utility is measured by node classification accuracy, while the effective dimension ratio (EDR) is measured by the ratio of the value of rank (of representation matrix) to the representation dimensionality. We observe a similar tendency as presented in our paper: the GNN model bears severe dimensional collapse when available attributes become limited, while the shallow method is not influenced since it does not take any node attribute as its input. | | 100%Att, Acc | 100%Att, EDR | 1%Att, Acc | 1%Att, EDR | 0.01%Att, Acc | 0.01%Att, EDR | | -------------- | ------------ | ------------ | ---------- | ---------- | ------------- | ------------- | | GNN | **38.8%** | 74.2% | 26.3% | 21.9% | 19.0% | 1.56% | | Shallow Method | 31.5% | **99.6%** | **31.5%** | **99.6%** | **31.5%** | **99.6%** | Second, we also perform experiments to study how the performance change from highly heterophilic nodes to highly homophilic nodes. We first present the study between GNN (w/ neighborhood aggregation) and GNN w/o neighborhood aggregation. We present their cumulative performances in node classification accuracy under different values of homophilic score below (the same setting as in Fig. 6 in our paper). We observe similar tendency as presented in our paper: neighborhood aggregation will jeopardize the performances on those highly heterophilic nodes while benefit highly homophilic nodes. | | 1e-3 | 5e-3 | 1e-2 | 5e-2 | 1e-1 | 5e-1 | 1e0 | | -------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | GNN (w/ Aggregation) | 26.88% | 26.88% | 26.88% | 26.73% | 25.44% | **36.64%** | **38.75%** | | GNN w/o Aggregation | **30.10%** | **30.10%** | **30.10%** | **30.69%** | **26.48%** | 32.20% | 33.17% | We also perform experiments to compare shallow method w/ neighborhood aggregation vs. shallow method w/o neighborhood aggregation, and the observations remain consistent. | | 1e-3 | 5e-3 | 1e-2 | 5e-2 | 1e-1 | 5e-1 | 1e0 | | ------------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | Shallow w/ Aggregation | 21.86% | 21.86% | 21.86% | 23.23% | 26.35% | **30.36%** | **31.54%** | | Shallow (w/o Aggregation) | **25.14%** | **25.14%** | **25.14%** | **26.26%** | **26.71%** | 29.95% | 31.44% | In conclusion, we argue that, first, DeepWalk is **representative enough** to obtain generalizable experimental results, and adopting more follow-up methods that share a similar design with DeepWalk does not change the observation and conclusion; Second, we also have **similar observations** on heterophilic datasets from both studied perspectives, and our analysis does not depend on whether the adopted datasets are homophilic or not. [Tong: thank the reviewer on bringing this up so we can make our evaluation even more comprehensive. And also mention that we are including these experiments and discussions in the paper. we can prob just add a subsection in the appendix [1] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). [2] Perozzi, B., Kulkarni, V., & Skiena, S. (2016). Walklets: Multiscale graph embeddings for interpretable network classification. arXiv preprint arXiv:1605.02115, 043238-23. [3] Huang, J., Chen, C., Ye, F., Wu, J., Zheng, Z., & Ling, G. (2019). Hyper2vec: Biased random walk for hyper-network embedding. In Database Systems for Advanced Applications: DASFAA 2019 International Workshops: BDMS, BDQM, and GDMA, Chiang Mai, Thailand, April 22–25, 2019, Proceedings 24 (pp. 273-277). Springer International Publishing. --- > **Reviewer**: Q1. In difference 1, I personally feel GNN is very flexible to the learning prior. Though one of the most frequently used learning prior would be the transformed node attributes, but it can also take uniform initialization (i.e. treat the input graph as an unattributed graph, then assign uniform initial features on each node). So, it seems to me that, it is unfair to claim GNN is limited to taking the transformed node attributes as prior？ **Authors**: We agree with the reviewer that GNNs do not necessarily take node attributes as prior. We have improved the expression in our paper accordingly. However, we also note that taking both graph topology and node attributes as the input of GNNs is the **most widely studied scenario**. In fact, being able to utilize node attributes is considered an advantage of GNNs in most cases compared with most traditional methods that only take graph topology as input. If node features are already available, avoiding using them usually leads to suboptimal performances. --- > **Reviewer**: Q2. Only DeepWalk is examined among all the shallow methods. Is it representative enough? Can it outperform all other shallow methods on all datasets? If yes, then why? [Tong: combine this with the answers for W3 and put them together? **Authors**: We thank the reviewer for pointing this out. The reason why DeepWalk is adopted is that DeepWalk is a representative example of shallow methods **in its design**. Specifically, DeepWalk is among the most commonly used shallow graph embedding methods, and a large amount of following works under the umbrella of shallow methods are developed based on DeepWalk, such as [1, 2, 3]. Therefore, **DeepWalk is among the best options we can choose to obtain generalizable analysis**, and adopting more follow-up methods that share similar design with DeepWalk does not change the observation and conclusion. In addition, we note that a model with representative design does not necessarily have SOTA performances. For example, we do not examine SOTA shallow methods with a series of unique designs in this paper (e.g., [4]), since the analysis performed on these models may not be generalizable to other shallow methods. [1] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). [2] Perozzi, B., Kulkarni, V., & Skiena, S. (2016). Walklets: Multiscale graph embeddings for interpretable network classification. arXiv preprint arXiv:1605.02115, 043238-23. [3] Huang, J., Chen, C., Ye, F., Wu, J., Zheng, Z., & Ling, G. (2019). Hyper2vec: Biased random walk for hyper-network embedding. In Database Systems for Advanced Applications: DASFAA 2019 International Workshops: BDMS, BDQM, and GDMA, Chiang Mai, Thailand, April 22–25, 2019, Proceedings 24 (pp. 273-277). Springer International Publishing. [4] Postăvaru, Ş., Tsitsulin, A., de Almeida, F. M. G., Tian, Y., Lattanzi, S., & Perozzi, B. (2020). InstantEmbedding: Efficient local node representations. arXiv preprint arXiv:2010.06992. ## Reviewer 2 (4Dj2) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: W1. The novelty of the work seems limited - most of the observations are straightforward or appeared in previous research. **Authors**: We agree with the reviewer that a part of our observations has been revealed by existing works. However, we note that the main focus of this paper is to perform a comprehensive comparison between GNNs and shallow embedding methods. Especially, to the best of our knowledge, no existing work has comprehensively explored the strengths and weaknesses of GNNs compared with shallow methods with a unified view. Therefore, a comprehensive analysis is still needed to have an in-depth analysis of GNNs. For example, it is valuable to point out that dimensional collapse solely happens in GNNs instead of traditional shallow methods, and such a conclusion has not been revealed by any other existing works. --- > **Reviewer**: W2. While the paper contains a guide for practitioners about which model to choose, it is not specific to be directly applied to a given application. For instance, in Section 5, it is written "we recommend adopting GNNs and shallow embedding methods on attribute-rich and attribute-poor networks, respeectively." However, it is not clear how to decide whether the attributes are rich. For instance, in both Flickr and PubMed, there are 500 features, but the results on them are completely different. So, it is not the number of features that can be used for this decision. **Authors**: We thank the reviewer for pointing this out. We would like to note that rich attributes do not necessarily mean a large number of available features. For example, no matter what the dimensionality of the input node features is, if most of the available features are linearly correlated with each other, i.e., the rank of the input feature matrix is small, then dimensional collapse is still likely to happen. This is because as shown in Figure 5.b, the non-linear operations in GNNs can hardly improve the rank of representations (compared with node features) regardless of the dimensionality. Correspondingly, we note that it is the rank of the input features that truly determines whether it is an attribute-poor scenario or not. We have improved the corresponding explanation in Section 5 accordingly to avoid further confusion. In addition, we would like to point out that it is not appropriate to compare performances across different datasets, since how much the node attributes help with the GNN prediction could vary across different datasets. Instead, we present Figure 2 to compare the performances within each dataset under different levels of available input attribute dimensions, which makes the analysis of dimensional collapse more rigorous. --- > **Reviewer**: Q1. How to decide whether the features are sufficiently rich? **Authors**: We thank the reviewer for pointing this out. We would like to note that whether the features are sufficiently rich or not **is usually determined by the specific downstream task**, which is also the reason why there is **no unified metric** to measure whether the features are "rich" enough. For example, in a node regression task (i.e., predict a specific value regarding each node), we may not need the embeddings to span the whole hidden space, since more discriminative node embeddings may not be as helpful in improving the performance as in node classification tasks [1]. Accordingly, the requirement in the "richness" of node features could be less strict compared with that for node classification tasks. To avoid further confusion and misunderstanding, we added a discussion on whether the features of specific graph data are sufficiently rich or not in our Appendix. However, this goes beyond the scope of this paper to perform a comprehensive comparison between the two types of models. [1] Yu, Y., Chan, K. H. R., You, C., Song, C., & Ma, Y. (2020). Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33, 9422-9434. --- > **Reviewer**: Q2. It is written that "It is difficult for shallow embedding methods to properly exploit information encoded in node attributes and make use of the homophily nature of most graphs" - why the latter is true? Classic methods have similar embeddings for nodes located close to each other in the graph. Under the homophily assumption, such nodes often have the same label. **Authors**: We would like to thank the reviewer for pointing this out. We agree that the latter is not appropriate, and the main claim here is that shallow embedding methods cannot effectively exploit node attribute information. We have revised the expression accordingly to avoid confusion. --- > **Reviewer**: Q3. The fact that GNNs strongly rely on node features and removing them leads to decreased performance is very natural. Can this problem be solved by augmenting node features with structural graph-based features? Or maybe even with random features? Both options can also increase the representation effective dimension. **Authors**: We agree with the reviewer that feature augmentation could be a potential direction to explore, such that the disadvantage of GNNs revealed in this paper can be mitigated. However, the problem of dimensional collapse cannot be properly addressed simply by augmenting node attributes by either adding structural graph-based features or adding random features. We performed experiments by (1) concatenating a random matrix with the same dimensionality as the original node attributes onto the node attribute matrix and (2) concatenating a matrix encoded with structural information following the state-of-the-art *degree+* strategy [1] onto the node attribute matrix. We present the unsupervised learning performances on the Cora dataset below as an example. Here utility is measured by node classification accuracy, while the effective dimension ratio (EDR) is measured by the ratio of the value of rank (of representation matrix) to the representation dimensionality. | | 100%Att, Acc | 100%Att, EDR | 1%Att, Acc | 1%Att, EDR | 0.01%Att, Acc | 0.01%Att, EDR | | -------------------- | ------------ | ------------ | ---------- | ---------- | ------------- | ------------- | | Walk-GCN | 67.8% | 96.9% | 36.9% | 35.5% | 32.3% | 1.56% | | Walk-GCN, Random | 68.0% | 97.3% | 29.9% | 67.2% | 10.6% | 5.08% | | Walk-GCN, Structural | 69.0% | 97.3% | 37.3% | 35.2% | 32.3% | 2.34% | We observe that: (1) concatenating a matrix with structural information slightly improves the node classification accuracy, while such a strategy does not stop the significant drop in the rank of node representations; (2) concatenating random node attributes successfully improves the rank of the node representations, however, the classification accuracy is reduced. Therefore, both strategy does not really solve the problem of dimensional collapse, and we believe handling such a problem is non-trivial. Correspondingly, this paper is particularly interesting to researchers working in this area and such a problem is also worth to be explored in future works. [1] Cui, H., Lu, Z., Li, P., & Yang, C. (2022, October). On positional and structural node features for graph neural networks on non-attributed graphs. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (pp. 3898-3902). --- > **Reviewer**: Q4. Can small representation effective dimension be explained just by the dimension of the initial feature set? This would also explain Figure 5(b) since increased rank bound cannot solve this issue. **Authors**: We thank the reviewer for pointing this out and we agree with the reviewer. Specifically, if the number of the dimensions of the initial feature set is already small, this means the initial feature set bears the problem of low rank, since its rank cannot exceed its feature dimensionality. Accordingly, the learned node representations are then more likely to bear the dimensional collapse, since the GNNs cannot effectively improve the rank again even if the rank bound increases (i.e., the number of hidden dimensionality increases). We have also added the associated explanation to make this paper more enjoyable to read. [Tong: enjoyable? --- > **Reviewer**: Q5. The concatenation experiment is conducted only on one dataset (DBLPFull). Are the results on other datasets consistent with this? **Authors**: We thank the reviewer for pointing this out. We would like to note that for the results reported in this paper, we also have similar observations on other adopted datasets. In fact, we have presented the rank of representations learned by GNNs in Figure 9 (Appendix) and the rank of representations learned by shallow methods in Figure 11 (Appendix). The rank of the concatenation of two representation matrices will be at least the rank of the maximum rank of the two matrices. Noticing that in all datasets, the representations learned by shallow methods are full-rank. Therefore, the tendency curve of the concatenated representation will either reach a minimum of effective dimension ratio of 0.5 (similar to the results presented in Figure 8) or a value of effective dimension ratio better than 0.5 (i.e., greater than 0.5). We have also added the associated explanation to our paper. --- > **Reviewer**: Q6. Do I understand correctly that GNN w/o Agg (Figure 6) does not use any graph structure? **Authors**: Yes. For GNN w/o Agg, we remove all message-passing operations. --- > **Reviewer**: There are some typos throughout the text. **Authors**: We thank the reviewer for pointing this out. We have revised them throughout this paper. Thanks again for your efforts to make this paper more enjoyable to read! ## Reviewer 3 (4Gbd) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: W1. It is already shown that MLPs (which is same with GNNs without aggregation) outperforms commonly used GNNs such as GCN, GAT, and SAGE. Furthermore, several methods are proposed to perform well on both homophilous and heterophilous graphs [1, 2, 3]. [Tong: our work has nothing to do with homophily right? we should prob point out that we actually focus on the comparison of gnns vs shallow embeddings. Also i guess these methods are all somehow graph-based or at least feature-based? can we maybe argue that they still showcase the same drawbacks as normal GNNs in those scenarios we talked about? I assume that when the features are super bad or meaningless, no MLP or GNN would be as good as deepwalk. **Authors**: We agree with the reviewer that existing works have proposed approaches to achieve better performance on both homophilous and heterophilous graphs. However, we note that the main focus of this paper is to perform a comprehensive comparison between GNNs and shallow embedding methods. Existing works (e.g, [1, 2, 3]) do not explicitly explore the strengths and weaknesses of GNNs compared with shallow methods. We argue that the novelty of our paper mainly lies in the comprehensive comparison on the (1) the levels of dimensional collapse and (2) performances over different levels of homophily between the two types of models. [1] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." Advances in neural information processing systems 33 (2020): 7793-7804. [2] Lim, Derek, et al. "Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods." Advances in Neural Information Processing Systems 34 (2021): 20887-20902. [3] Yang, Liang, et al. "Diverse message passing for attribute with heterophily." Advances in Neural Information Processing Systems 34 (2021): 4751-4763. [looking for further comments on this reply] --- > **Reviewer**: W2. It seems that the circumstances where input features are limited is not common in the real-world senarios and we can augment input features with large language models even in attribute-poor settings [4]. **Authors**: We thank the reviewer for pointing this out. We also agree with the reviewer that input features with large language models in attribute-poor settings, which could be an interesting future direction to explore. However, we would like to note that, in fact, the circumstances where input features are limited is common. For example, online social networks have become prevalent, while it is common that anonymized social networks usually lack the identity information of nodes [1]. As another exmaple, the attribute of a node in social networks may also be an artefact and captures no semantic meanings [1]. Therefore, the phenomenons revealed by our paper widely exist, and such a weakness would also be shown across a series of real-world applications. [Tong: also mention that augmenting features with LLMs might not scale to the "real-world scenarios" where the graph has billions of nodes, can cite things like fb user stats to support this. And also mention that it could be a interesting follow up topic tho. [1] Sun, Z., Zhang, W., Mou, L., Zhu, Q., Xiong, Y., & Zhang, L. (2022, June). Generalized equivariance and preferential labeling for gnn node classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8395-8403). --- > **Reviewer**: W3. It is not surprising that GNNs suffer dimensional collapses when input features are poor since it is well-known for general neural netoworks. **Authors**: We agree with the reviewer that dimensional collapse has been found to happen in general neural networks by a series of existing works. However, we would like to point out that such a phenomenon does not necessarily mean that GNNs will also suffer a similar problem. Especially, to the best of our knowledge, no existing work has comprehensively explored the strengths and weaknesses of GNNs compared with shallow methods with a unified view. Therefore, a comprehensive analysis is still needed to have an in-depth analysis of GNNs. For example, it is valuable to point out that dimensional collapse solely happens in GNNs instead of traditional shallow methods, and such a conclusion has not been revealed by any other existing works. --- > **Reviewer**: There are several models combining walking-based approaches and GNNs such as APPNP [1]. I think that this kind of mechanism might alleviate the problem of GNNs due to the adoption of pagerank. Does APPNP also suffer similar problems as other GNNs? **Authors**: We thank the reviewer for pointing this out. First, we would like to point out that in the mentioned model APPNP, only the way of the information distribution in APPNP is similar to the random walk. However, as is revealed in our paper, **the problem of dimensional collapse is caused by utilizing the node attributes as the prior of learning**. Since APPNP still utilizes the node attributes as the prior of learning, **APPNP does not bear any significant difference in terms of the cause for dimensional collapse** compared with the GNNs adopted in this paper. Therefore, we conclude that APPNP naturally suffers similar problems as other GNNs. In addition, we also perform empirical experiments based on APPNP. We present the unsupervised learning performances on the Cora dataset below as an example. Here utility is measured by node classification accuracy, while the effective dimension ratio (EDR) is measured by the ratio of the value of rank (of representation matrix) to the representation dimensionality. We observe that similar to GCN, APPNP also bears severe dimensional collapse (exhibited by the significant reduction in the value of EDR). | | 100%Att, Acc | 100%Att, EDR | 1%Att, Acc | 1%Att, EDR | 0.01%Att, Acc | 0.01%Att, EDR | | ----- | ------------ | ------------ | ---------- | ---------- | ------------- | ------------- | | GCN | 67.8% | 96.9% | 36.9% | 35.5% | 32.3% | 1.56% | | APPNP | 75.5% | 56.3% | 42.4% | 13.7% | 10.6% | 0.78% | We note that the discussion above reveals that the problem pointed out by our work is non-trivial to handle. Correspondingly, this paper is particularly interesting to researchers working in this area and follow-up study of our work remains desired. --- > **Reviewer**: Several approaches such as LINKX [2] encode node topology and node attribute separately and combine two representations later. Since these approaches can learn how much to reflect node attributes on node representations, I think that these methods might not suffer dimensional collapse. Does LINKX also suffer similar problems? **Authors**: We agree with the reviewer that dimensional collapse could be mitigated by encoding node topology and node attribute separately and combining two representations later. However, it is also worth noting that **the adjacency matrix is naturally low-rank as well** [1, 2], and thus it could be difficult to avoid dimensional collapse with the operations above. Specifically, we perform empirical experiments based on LINKX. We present the unsupervised learning performances on the Cora dataset below as an example. Here utility is measured by node classification accuracy, while the effective dimension ratio (EDR) is measured by the ratio of the value of rank (of representation matrix) to the representation dimensionality. We observe that compared with GCN, LINKX exhibits smaller values of EDR in attribute-rich scenarios (e.g., 100% available node attributes), while it also mitigates dimensional collapse in attribute-poor scenarios (e.g., 0.01% available node attributes). This demonstrates that (1) such an approach may jeopardize the effective dimension ratio in attribute-rich scenarios and (2) such an approach effectively helps to mitigate dimensional collapse in attribute-poor scenarios. | | 100%Att, Acc | 100%Att, EDR | 1%Att, Acc | 1%Att, EDR | 0.01%Att, Acc | 0.01%Att, EDR | | ----- | ------------ | ------------ | ---------- | ---------- | ------------- | ------------- | | GCN | 67.8% | 96.9% | 36.9% | 35.5% | 32.3% | 1.56% | | LINKX | 65.9% | 50.4% | 64.6% | 49.6% | 68.8% | 24.2% | However, we would also like to point out that even if such a method successfully mitigates dimensional collapse for GNNs, it **is not ideal**, since it (1) sacrifices the capability of GNNs in inductive learning and (2) improves the computational complexity from $\mathcal{O}(n*k)$ to $\mathcal{O}(n^2)$ to perform inference ($k$ is the number of node attributes and $n$ is the number of nodes). Therefore, the problem of dimensional collapse is non-trivial to handle, and more analysis can be a great follow-up study of our work. [1] Entezari, N., Al-Sayouri, S. A., Darvishzadeh, A., & Papalexakis, E. E. (2020, January). All you need is low (rank) defending against adversarial attacks on graphs. In *Proceedings of the 13th International Conference on Web Search and Data Mining* (pp. 169-177). [2] Zhuang, L., Gao, H., Lin, Z., Ma, Y., Zhang, X., & Yu, N. (2012, June). Non-negative low rank and sparse graph for semi-supervised learning. In *2012 ieee conference on computer vision and pattern recognition* (pp. 2328-2335). IEEE. ## Reviewer 4 (ySnn) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: W1. This article lacks novelty to some extent, despite providing detailed analysis and experiments. The conclusions are relatively trivial and are already a consensus in the community. **Authors**: We agree with the reviewer that a part of the observations revealed in this paper has also been reported by existing literatures, e.g., GNNs could bear sub-optimal performances on highly heterophilic nodes. However, we would like to note that the main focus of this paper is to take an initial step to perform a comprehensive comparison between GNNs and shallow methods. Especially, to the best of our knowledge, no existing work has comprehensively explored the strengths and weaknesses of GNNs compared with shallow methods with a unified view. Therefore, a comprehensive analysis is still needed to have an in-depth analysis of GNNs. For example, it is valuable to point out that dimensional collapse solely happens in GNNs instead of traditional shallow methods, and such a conclusion has not been revealed by any other existing works. --- > **Reviewer**: W2. The paper may be improved if it discusses some works that combine the advantage of network embedding and GNN, like [1,2]. **Authors**: We would like to thank the reviewer for pointing this out. We have improved our discussion by adding [1, 2] as additional literature in our Related Work section (Appendix A). Specifically, we note that [1] successfully achieves optimization for the context hyper-parameters of shallow graph embedding methods. However, the proposed approach cannot take node attributes as the input and thus fails to effectively utilize the information encoded in node attributes; [2] explored to generalize the GNNs to adaptively learn high-quality node representations under both homophilic and heterophilic node label patterns. Nevertheless, it fails to avoid using the node attributes as the learning prior, and thus still follows a design that has been proved to bear dimensional collapse in our paper. We thank the reviewer again for providing two additional related literature to make this paper more enjoyable to read. [1] Abu-El-Haija S, Perozzi B, Al-Rfou R, et al. Watch your step: Learning node embeddings via graph attention[J]. Advances in neural information processing systems, 2018, 31. [2] Chien E, Peng J, Li P, et al. Adaptive Universal Generalized PageRank Graph Neural Network[C]//International Conference on Learning Representations. 2020. --- > **Reviewer**: (1) The result of E2E-GCN on GCN, CiteSeer, Pubmed is lower than that reported in the paper. Can the authors explain the difference of experimal setting? **Authors**: We would like to note that we utilized standard implementations of GCN layer from standard package PyG, and we added a series of scalability operations consistently across all datasets, e.g., layer normalization, which could jeopardize the performances of the mentioned datasets. [Tong: let's not talk about these. this sounds like saying that we made those numbers worse on purpose. Just say that those are what we reproduced with standard packages, and we changed to their numbers now We note that the performance differences do not influence our observations and conclusions. We agree with the reviewer that such inconsistency over these most commonly used datasets could lead to confusion, and we have changed the corresponding performances under a consistent GNN structure with the original papers to avoid any further confusion. --- > **Reviewer**: (2) Since the GNNs usually contain non-linear activation functions, is it reasonable to measure the Dimensional Collapse by evaluating rank of the embedding matrix? **Authors**: We would like to argue that (1) it is **reasonable to measure** the Dimensional Collapse with the rank of the representation matrix and (2) whether a model contains non-linear activation functions or not **does not influence the choice of metrics** to measure the Dimensional Collapse. We elaborate on the details below. First, Dimensional Collapse refers to the phenomenon where the representations of data points (i.e., the input nodes in our paper) collapse into a lower-dimensional subspace instead of spanning the entire available hidden space. Accordingly, the rank of the representation matrix directly reveals the dimensionality of the subspace these representations span, which is also consistent with a series of recent works (e.g., [1, 2]). Therefore, we argue that it is **reasonable to measure** the Dimensional Collapse with the rank of the representation matrix. Second, we note that the major connection between non-linear activation functions and Dimensional Collapse is that when non-linear activation functions are adopted in a model, they may improve the rank of the output representation matrix (compared with the input feature matrix) and thus may mitigate the level of Dimensional Collapse (we have empirically found that such mitigation is marginal in GNNs). Such a connection does not influence the semantic meaning of the rank of the representation matrix, and thus **does not influence the appropriate choice of metrics** to measure the Dimensional Collapse. [1] Roth, A., & Liebig, T. (2023). Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks. arXiv preprint arXiv:2308.16800. [2] Sun, J., Chen, R., Li, J., Wu, C., Ding, Y., & Yan, J. (2022). On Understanding and Mitigating the Dimensional Collapse of Graph Contrastive Learning: a Non-Maximum Removal Approach. *arXiv preprint arXiv:2203.12821*. --- > **Reviewer**: (3) Is it possible to overcome the weakness of both GNNs and shallow embedding methods, and propose a new graph representation paradigm to combine their strengths? **Authors**: We agree with the reviewer that there could be ways to combine their strengths, and we believe this would be an interesting future direction to work on. However, despite the significance of this problem, we would also like to note that **handling such a problem is difficult**. In fact, most existing works fail to avoid the weaknesses characterized by our paper, and thus they are not able to properly combine the strengths of the two methods. We believe it would be exciting for the graph machine learning community to explore on such new learning paradigm, which would be great future directions. In fact, it is great that this paper can lead to questions like this from the readers, becuase we believe that this is part of what we want with this work: to have researchers and practitioners to not only looking at GNNs but also other learning paradigms for further advancements. Old Version: Tong: For ALL reviewers, we should also respond to the weaknesses in the review as well, and also try to address them ## Reviewer 1 (VyuR) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: Q1. In difference 1, I personally feel GNN is very flexible to the learning prior. Though one of the most frequently used learning prior would be the transformed node attributes, but it can also take uniform initialization (i.e. treat the input graph as an unattributed graph, then assign uniform initial features on each node). So, it seems to me that, it is unfair to claim GNN is limited to taking the transformed node attributes as prior？ **Authors**: We would like to note that in this paper, we perform the comprehensive comparison under **the most widely studied and most realistic context**, i.e., input includes graph topology and node attributes, where node attributes usually contain beneficial information. Tong: it's kinda rude to imply that the reviewer is asking sth unrealistic in our response even if he is doing so. We should instead explain and argue with subjective reasons on why we don't do that, for example, not using feature when we have feature would result with suboptimal performances In addition, we agree with the reviewer that GNNs can also take non-attributed graphs as input, and utilize randomly initialized node attributes. However, such an approach makes the ground truth (e.g., node classification labels) and node attributes independent from each other. It has been empirically proved Tong: where exactly? which table in [1]? that if the node attributes do not encode information of the ground truth, the GNNs tend to underperform the cases where the node attributes encode the information of the ground truth in a series of recent studies (e.g., [1]). Therefore, [Tong: i dont really see the logic from the previous arguement to this one] we argue that the existence of such an initialization strategy does not undermine the value of our contribution of **identifying the key disadvantages of GNNs under the most widely studied and most realistic context**. [Tong: 1. we need to stop implying the reviewer is stupid and suggesting unrealistic things 2. can we also argue that being able to use features is also a adv for GNNs, and this only bring issues in certain scenarios] [1] Duong, C. T., Hoang, T. D., Dang, H. T. H., Nguyen, Q. V. H., & Aberer, K. (2019). On node features for graph neural networks. arXiv preprint arXiv:1911.08795. --- > **Reviewer**: Q2. Only DeepWalk is examined among all the shallow methods. Is it representative enough? Can it outperform all other shallow methods on all datasets? If yes, then why? **Authors**: The reason why DeepWalk is adopted is that DeepWalk is a representative example of shallow methods and a large amount of following works under the umbrella of shallow methods are developed based on DeepWalk, such as [1, 2, 3]. Therefore, **DeepWalk is among the best options we can choose to obtain generalizable empirical analysis**. [Tong: mention it's the most commonly adopted one] In addition, noticing that the reviewer focused on [Tong: rude; i'd just say sth like representative method doesn't need to be the SOTA, and explain why. E.g., it should be representative because of the design] whether DeepWalk "outperforms all other shallow methods on all datasets" or not, we would like to note that **a SOTA approach does not mean it is representative enough**. For example, several recent shallow methods, such as [4], are proposed and claimed to achieve better performance over other shallow embedding methods. However, such superiority relies on **a series of unique designs different from all other shallow methods**. [Tong: this sentece sounds like you think we should also compare with [4], but we don't. Explain why we don't need to also evaluate with methods like [4]] In this case, we argue it is reasonable to consider these approaches as non-representative shallow methods and thus they should not be adopted in this paper. [1] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). [2] Perozzi, B., Kulkarni, V., & Skiena, S. (2016). Walklets: Multiscale graph embeddings for interpretable network classification. arXiv preprint arXiv:1605.02115, 043238-23. [3] Huang, J., Chen, C., Ye, F., Wu, J., Zheng, Z., & Ling, G. (2019). Hyper2vec: Biased random walk for hyper-network embedding. In Database Systems for Advanced Applications: DASFAA 2019 International Workshops: BDMS, BDQM, and GDMA, Chiang Mai, Thailand, April 22–25, 2019, Proceedings 24 (pp. 273-277). Springer International Publishing. [4] Postăvaru, Ş., Tsitsulin, A., de Almeida, F. M. G., Tian, Y., Lattanzi, S., & Perozzi, B. (2020). InstantEmbedding: Efficient local node representations. arXiv preprint arXiv:2010.06992. ## Reviewer 2 (4Dj2) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: Q1. How to decide whether the features are sufficiently rich? [Tong: start all responses politely with things like: thanks for pointing this out] **Authors**: We would like to note that whether the features are sufficiently rich or not **is usually determined by the specific downstream task**, which is also the reason why there is **no unified metric** to measure whether the features are "rich" enough. For example, in a node regression task (i.e., predict a specific value regarding each node), we may not need the embeddings to span the whole hidden space, since more discriminative node embeddings may not be as helpful in improving the performance as in node classification tasks. Accordingly, the requirement in the "richness" of node features could be less strict compared with that for node classification tasks. [Tong: would be nice to have some citations backing this paragraph up] To avoid further confusion and misunderstanding, we added a discussion on whether the features of specific graph data are sufficiently rich or not in our Appendix. However, this goes beyond the scope of this paper to perform a comprehensive comparison between the two types of models. --- > **Reviewer**: Q2. It is written that "It is difficult for shallow embedding methods to properly exploit information encoded in node attributes and make use of the homophily nature of most graphs" - why the latter is true? Classic methods have similar embeddings for nodes located close to each other in the graph. Under the homophily assumption, such nodes often have the same label. **Authors**: We would like to thank the reviewer for pointing this out. We agree that the latter is not appropriate, and the main claim here is that shallow embedding methods cannot effectively exploit node attribute information. We have revised the expression accordingly to avoid further confusion. [Tong: hmm i'm also not sure why we are talking about homophily here, could we be talking only the feature homophily instead of structural homophily? I think what you said here it's fine and we can add a bit about why we had that vague sentence (like feature homophily or sth else). Then update the sentence in paper and also include the updated sentence here in our response] [Looking for further suggestions for this reply] --- > **Reviewer**: Q3. The fact that GNNs strongly rely on node features and removing them leads to decreased performance is very natural. Can this problem be solved by augmenting node features with structural graph-based features? Or maybe even with random features? Both options can also increase the representation effective dimension. **Authors**: We agree with the reviewer that feature augmentation could be a potential direction to explore, such that the disadvantage of GNNs revealed in this paper can be mitigated. [Tong: can we also run some simple experiments here of this? it seems this is very easy to conduct, also there should be past work did so] However, we would like to note that (1) **augmenting node features with structural graph-based features does not necessarily address the problem of dimensional collapse**. For example, the rank of the node feature matrix cannot be guaranteed to improve by adding multiple collums of handcrafted structural information as additional feature dimensions. (2) **Simply using random features could potentially jeopardize the performance of GNNs**, since the dependency between node features and ground truth labels is wiped out (also see in the 1st reply to Reviewer 1 VyuR[Tong: we have unlimited space, so don't point. If you want to refer sth, copy it to here] ). Therefore, we believe such a problem pointed out by our work is critical and non-trivial. Correspondingly, this paper is particularly interesting to researchers working in this area and such a problem is worth to be explored in future works. However, we note that the exploration on such a problem goes beyond the scope of this paper of performing a comprehensive comparison between the two types of models. --- > **Reviewer**: Q4. Can small representation effective dimension be explained just by the dimension of the initial feature set? This would also explain Figure 5(b) since increased rank bound cannot solve this issue. **Authors**: Yes. Specifically, if the number of the dimensions of the initial feature set is already small, this means the initial feature set bears the problem of low rank, since its rank cannot exceed its feature dimensionality. Accordingly, the learned node representations are then more likely to bear the dimensional collapse, since the GNNs cannot effectively improve the rank again even if the rank bound increases (i.e., the number of hidden dimensionality increases). **We believe the reviewer has the correct understanding to this question**. [Tong: don't judge him, are we implying that he has wrong understanding to others? just say sth like we agree with your points etc] --- > **Reviewer**: Q5. The concatenation experiment is conducted only on one dataset (DBLPFull). Are the results on other datasets consistent with this? **Authors**: We would like to note that for the results reported in this paper, we also have similar observations on other adopted datasets. In fact, we have presented the rank of representations learned by GNNs in Figure 9 (Appendix) and the rank of representations learned by shallow methods in Figure 11 (Appendix). The rank of the concatenation of two representation matrices will be at least the rank of the maximum rank of the two matrices. Noticing that in all datasets, the representations learned by shallow methods are full-rank. Therefore, the tendency curve of the concatenated representation will either reach a minimum of effective dimension ratio of 0.5 (similar to the results presented in Figure 8) or a value of effective dimension ratio better than 0.5 (i.e., greater than 0.5). --- > **Reviewer**: Q6. Do I understand correctly that GNN w/o Agg (Figure 6) does not use any graph structure? **Authors**: Yes. For GNN w/o Agg, we remove all message-passing operations. **We believe the reviewer has the correct understanding to this question**. [Tong: same comment as i had for q4 for this one] --- > **Reviewer**: There are some typos throughout the text. **Authors**: We thank the reviewer for pointing this out. We have revised them throughout this paper. Thanks again for your efforts to make this paper more enjoyable to read! ## Reviewer 3 (4Gbd) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: There are several models combining walking-based approaches and GNNs such as APPNP [1]. I think that this kind of mechanism might alleviate the problem of GNNs due to the adoption of pagerank. Does APPNP also suffer similar problems as other GNNs? **Authors**: First, we would like to point out that in the mentioned model APPNP, only the way of the information distribution in APPNP is similar to the random walk. However, as is revealed in our paper, **the problem of dimensional collapse is caused by utilizing the node attributes as the prior of learning**. Since APPNP still utilizes the node attributes as the prior of learning, **APPNP does not bear any significant difference in terms of the cause for dimensional collapse** compared with the GNNs adopted in this paper. Therefore, we conclude that APPNP naturally suffers similar problems as other GNNs. [Tong: we should give some simple experiments here to validate our claim] In addition, the discussion above reveals that the problem pointed out by our work is critical and non-trivial. [Tong: these words are very objective and really plain, reviewer won't agree it's critical because we say it's critical. We should give more reasonings and subjective facts. I think here it's also a good place to re-emphysize our contributions, like elaborating the next sentence a bit more] Correspondingly, this paper is particularly interesting to researchers working in this area and these problems are worth to be explored as future works. However, we note that the corresponding exploration goes beyond the scope of this paper of performing a comprehensive comparison between the two types of models. [Tong: this sounds like we don't want to do it, say sth like: more in-depth analysis can be good followup study of our work] --- > **Reviewer**: Several approaches such as LINKX [2] encode node topology and node attribute separately and combine two representations later. Since these approaches can learn how much to reflect node attributes on node representations, I think that these methods might not suffer dimensional collapse. Does LINKX also suffer similar problems? **Authors**: In the mentioned model LINKX, the author proposed to utilize MLP models to encode the graph topology and node attributes separately, and concatenate the two representations later. However, it is worth noting that **the adjacency matrix is naturally low-rank as well** [1, 2]. At the same time, we have empirically proved in this paper that GNN models with MLP cannot effectively improve the rank of the input matrix. This means that representations from both branches will naturally bear the problem of dimensional collapse. Therefore, in terms of dimensional collapse, **LINKX does not bear any significant difference in terms of the cause for dimensional collapse compared with the GNNs adopted in this paper**. In addition, we would like to note that the exploration on how to mitigate dimensional collapse goes beyond the scope of this paper of performing a comprehensive comparison between the two types of models. [Tong: similar to above, we should also include experimental results of LINKX to should that it also bear the same problem, these can also help us claim that our observations can be generalized to more complicated GNNs. I think this reviewer's points can be easily dealed with by adding experiments, only the Weakness2 (LLM) one is really out of scope.] [1] Entezari, N., Al-Sayouri, S. A., Darvishzadeh, A., & Papalexakis, E. E. (2020, January). All you need is low (rank) defending against adversarial attacks on graphs. In *Proceedings of the 13th International Conference on Web Search and Data Mining* (pp. 169-177). [2] Zhuang, L., Gao, H., Lin, Z., Ma, Y., Zhang, X., & Yu, N. (2012, June). Non-negative low rank and sparse graph for semi-supervised learning. In *2012 ieee conference on computer vision and pattern recognition* (pp. 2328-2335). IEEE. ## Reviewer 4 (ySnn) We sincerely appreciate the time and efforts you've dedicated to reviewing and providing invaluable feedback to enhance the quality of this paper. We provide a point-to-point reply below for the mentioned concerns and questions. --- > **Reviewer**: (1) The result of E2E-GCN on GCN, CiteSeer, Pubmed is lower than that reported in the paper. Can the authors explain the difference of experimal setting? **Authors**: We would like to note that the GCN here utilizes a series of scalability operations consistently across all datasets, e.g., layer normalization, which could jeopardize the performances of the mentioned small datasets. We agree with the reviewer that such inconsistency over these most commonly used datasets could lead to confusion, and we have changed the corresponding performances under a consistent GNN structure with the original papers to avoid any further confusion. [Tong: mention we are using standard implementations from standard packages (pyg?), also mention it's normal to have slightly inconsistent numbers between paper reported and reproduced numbers (cite some), also mention it doesn't affect any of our observation and conclusions (between using reported and ours)] --- > **Reviewer**: (2) Since the GNNs usually contain non-linear activation functions, is it reasonable to measure the Dimensional Collapse by evaluating rank of the embedding matrix? **Authors**: We would like to argue that (1) it is **reasonable to measure** the Dimensional Collapse with the rank of the representation matrix and (2) whether a model contains non-linear activation functions or not **does not influence the choice of metrics** to measure the Dimensional Collapse. We elaborate on the details below. First, Dimensional Collapse refers to the phenomenon where the representations of data points (i.e., the input nodes in our paper) collapse into a lower-dimensional subspace instead of spanning the entire available hidden space. Accordingly, the rank of the representation matrix directly reveals the dimensionality of the subspace these representations span, which is also consistent with a series of recent works (e.g., [1, 2]). Therefore, we argue that it is **reasonable to measure** the Dimensional Collapse with the rank of the representation matrix. Second, we note that the major connection between non-linear activation functions and Dimensional Collapse is that when non-linear activation functions are adopted in a model, they may improve the rank of the output representation matrix (compared with the input feature matrix) and thus may mitigate the level of Dimensional Collapse (we have empirically found that such mitigation is marginal in GNNs). Such a connection does not influence the semantic meaning of the rank of the representation matrix, and thus **does not influence the appropriate choice of metrics** to measure the Dimensional Collapse. [1] Roth, A., & Liebig, T. (2023). Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks. arXiv preprint arXiv:2308.16800. [2] Sun, J., Chen, R., Li, J., Wu, C., Ding, Y., & Yan, J. (2022). On Understanding and Mitigating the Dimensional Collapse of Graph Contrastive Learning: a Non-Maximum Removal Approach. *arXiv preprint arXiv:2203.12821*. --- > **Reviewer**: (3) Is it possible to overcome the weakness of both GNNs and shallow embedding methods, and propose a new graph representation paradigm to combine their strengths? **Authors**: There could be ways to combine their strengths, and we believe this would be an interesting future direction to work on. However, we would also like to note that **handling such a problem is difficult**. In fact, most existing works fail to avoid the weaknesses characterized by our paper, and thus they are not able to properly combine the strengths of the two methods (also see the replies to Reviewer 3 4Gbd). In addition, we note that this question also shows that the problem pointed out by **our work is critical and non-trivial**.[Tong: this reviewer is not even questioning our novelty and importance, why we calling this out here?] Correspondingly, this paper is particularly interesting to researchers working in this area. However, we note that the corresponding exploration **goes beyond the scope of this paper** of performing a comprehensive comparison between the two types of models. [Tong: i don't think this reviewer is implying any of our weakness in this Q here, but instead just posing a very open question. So don't be so denfensive in our response. Just follow the previous discussions and keep elaborating it's hard, but it's a important problem, and it'd be exciting for the GML community to have such new learning paradigm, and it'd great future directions. Can also mention that it's great that this paper can lead to such question from the readers, becuase that's part of what we want with paper, which is to have ppl not only looking at GNNs but other learning paradigms

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.