# Overall Response
We sincerely appreciate the valuable feedback and insightful comments from all our reviewers, and are delighted to see that our effort of utilizing the similarity-based GNN to address heterophily is acknowledged by the reviewers. We have thoroughly investigated and addressed all the issues raised. Below is the list summarizing our major updates:
1. **New experiment on *aggregation settings*:** We have appended experiments to demonstrate that the SIGMA aggregation is applicable to iterative settings which learns from progressive representations (Reviewer G4Zz A2).
2. **New evaluation on *model architectures*:** We conduct experiments on alternative designs featuring pre- and post-propagation schemes (Reviewer xhwP A3).
3. **New baselines for *homophilous aggregations*:** We further evaluate the performance of GNN aggregation and propagation utilizing PPR and other diffusion schemes on heterophilous datasets (Reviewer CAQb A3.2).
4. **Comparison with other *GNN schemes*:** We present comprehensive comparison regarding the aggregation schemes between SIGMA and other iterative heterophilous designs (Reviewer CAQb A1, Reviewer hpmD A2) and similar homophilous models (Reviewer CAQb A3).
5. **Elaboration on *SIGMA aggregation*:** We provide detailed explanations on aspects of our aggregation design, including the relation with features and topology (Reviewer G4Zz A1, Reviewer CAQb A3.1), the top-k setting (Reviewer G4Zz A3, Reviewer xhwP A2), implication of the theory (Reviewer CAQb A2).
---
# For Each Reviewer
Scores in format Novelty, Technical Quality, Confidence.
## Reviewer G4Zz --- 5,5,3
We thank the reviewer for the insightful comments as well as acknowledging the novelty of our proposed method. We address all the comments and questions raised as follows.
> **W1.** Ignoring node feature: Based on Eq. (3), SimRank is a structure-based similarity metric that ignores the node features. In the case where node features are significant (e.g., more important than structure information) for the node classification, such similarity could result in a suboptimal solution.
**A1.** We would like to address the concern on the relation between SIGMA and node features in two folds.
### (1) SIGMA as GNN Aggregation
Firstly, we would like to highlight that SIGMA is an *aggregation* approach, which differs from the *propagation* methods such as diffusions based on node features. As depicted in Eq.(7), the SIGMA aggregation is applied to node representations $h_v$, which can integrate arbitrary information such as node attributes and/or diffused features. Therefore, SIGMA in the model is applied on top of the representations that are learned from node features, and preferably augments the information for better performance.
To adapt graphs with different significance of feature and structure information, we introduce the model parameters $\delta$ and $\alpha$. A larger $\delta$ emphasizes the importance of feature representations against structural ones, while a larger $\alpha$ is able to handle the situation where local and diffused features contribute to the eventual performance. As evaluated in Table 4 in Section 5.5 Ablation Study, our model successfully applies to graphs where feature information $X$ is important and achieves superior accuracy.
### (2) Pure Structural Information for Heterophily
Originating from simple aggregations such as $ADD()$ in GCN and $MEAN()$ in GraphSAGE, an array of aggregation approaches have been proposed for GNNs under heterophily. Most of them are designed based on pure graph structures, as it is revealed that structural information is pivotal in heterophilous graph learning tasks [1-3]. We thus believe that aggregation built on graph topology rather than directly interacting with features is a *common solution* to address the heterophily issue in GNN designs. As we analyzed in Section 3.3, our method differs from previous heterophilous aggregations by utilizing the well-defined global structural similarity measure with effective and efficient calculation.
[1] Pei, et al. "Geom-GCN: Geometric Graph Convolutional Networks." ICLR 2020.
[2] Suresh, et al. "Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns." KDD 2021.
[3] Li, et al. "Finding Global Homophily in Graph Neural Networks When Meeting Heterophily." ICML 2022.
******************************************************************
> **W2.** Static setting/One-time precomputation: due to efficiency considerations, the similarity is computed once. However, if the graph structure is not of a high quality, such a decision could be overconfident. There could be cases that the graph structure is not perfect initially and along with the training, node features could help update the graph structure in an iterative way. Similar to the first point but from another perspective, there are concerns that the setting in this paper has a high requirement for structure quality, which might limit the model usage.
**A2.** Being an aggregation approach, *SIGMA is able to adopt iterative computations*. Although in Eq.(6) we perform one-time MLP transformation and SIGMA aggregation, it is also applicable to other model architectures and propagation schemes.
To illustrate, in an iterative GCN model, the iterative aggregation steps are represented as $Z = \sigma(...\sigma(A\ \sigma(AXW)W)...)$, where each layer performs an aggregation step based on the adjacency matrix $A$. The SIGMA can be employed by replacing the adjacency matrix $A$ with the SimRank matrix $S$ in the aggregation process, and $Z = \sigma(...\sigma(S\ \sigma(SX_S)\ W)...)$, where $X_S=\delta AW_A+(1-\delta)XW_X$. By this means, the model is able to progressively learn the structure information and refine accordingly during training.
We perform additional experiments on SIGMA with the iterative designs as shown in the table below. **SIGMA-L** denotes the $L$-layer model with SIGMA aggregations. It can be observed that compared with the corresponding GCN, SIGMA aggregation significantly improves the model performance especially on heterophilous graphs. Meanwhile, stacking more iterative layers may lead to the over-smoothing problem. We draw the conclusion that our proposed SIGMA is applicable to iterative models and progressive updates that learns from mutative information.
| **Model**|`genius` |`arXiv-year` |`Penn94`| `twitch-gamers` |`snap-patents`| `pokec`|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|**GCN-1**|62.84| 42.97| 81.86| 61.33| 40.78| 71.15|
|**SIGMA-1**|**91.41**|**55.41**|85.27|67.29|64.71|**82.24**|
|**GCN-2**|61.84| 42.93| 81.90| 61.28| 41.34| 72.78|
|**SIGMA-2**|87.63|55.32|**85.43**|**67.34**|**64.79**|82.2|
|**GCN-3**|66.80| 43.43| 77.58| 62.81| 44.17| 67.63|
|**SIGMA-3**|86.67|54.91|85.27|67.09 |64.69|82.19|
******************************************************************
> **W3.** Aggregating information from top k neighbors with high similarity improves the efficiency, however, it puts a strong restriction that every node will aggregate from the same number of neighbors. In traditional GNNs, degrees are also an important factor. It would be better to discuss this difference and potential consequences: a unified “degree” vs the original degrees.
**A3.** We would like to validate the effectiveness of our approach from both design and empirical perspectives.
### (1) Top-k Aggregation Design
Retrieving messages from a fixed-size subset of neighboring nodes has long been a common approach in graph processing techniques. In the scope of GNN, similar top-k schemes have been utilized in aggregations such as PPR [5] and k-NN [6]. Empirically, this design is a practical solution offering satisfactory accuracy with good efficiency. Although the degree-based design brings more precise controls, it also comes with additional overhead due to the node-wise personalized scheme.
### (2) Empirical Effect on the Value of k
We would like to note that *the common value of k in our major experiments (32 and 128 for small and large graphs, respectively) is generally larger than the average degree (1.5~30) of corresponding evaluated graphs*. Therefore, adopting top-k aggregation with a large enough k value ensures inclusivity for most nodes in the graph. We further evaluate the effect on the value of k in Figure 7 in Appendix and show that a larger k>128 does not significantly affect the model performance. We hence believe that our top-k scheme is sufficient even for those nodes with greater degrees.
[5] Bojchevski, et al. "Scaling Graph Neural Networks with Approximate PageRank." KDD 2021.
[6] Jin, et ak. "Universal Graph Convolutional Networks." Neurips 2021.
******************************************************************
> **Q1.** Related to Figure 2: whether the experiment is based on the whole dataset or only the training dataset? Topk is yet to be mentioned when introducing Figure 2. The choice of using top k is mainly for efficiency purposes after reading the latter part, but it is unclear when seeing the figure and readers might be confused. Additionally, what is the k used in this setting?
**A4.** Figure 2 is conducted on the entire dataset, which includes the complete graph adjacency. In this figure, we employ the same top-k setting as in other experiments, i.e., k=32 for small-scale datasets and 128 for large-scale ones. We will improve the related sections and figures to avoid confusion in the future version.
> **Q2.** Is there a typo in Figure 3 for the proposed model name: using SIMGNN to refer to the proposed model rather than SIGMA? The legend in subfigure c is different from others. It would be better to update it.
**A5.** Yes, SIMGNN in Figure 3 refers to the proposed method. We will update the legend and unify the figure format in the future version.
******************************************************************
---
## Reviewer CAQb --- 4,4,2
We sincerely thank the reviewer for the in-depth feedback. We would like to provide a more detailed elaboration of our method design in the following points.
> **W1.** The concept of utilizing global relationships to address heterophily datasets appears to have been considered in many prior studies addressing heterophily [1-2]. A more detailed discussion on the differences in capturing long-term relationships mechanisms is needed.
**A1.** In this work, we propose SIGMA featuring the global similarity metric SimRank with distinct interpretation and efficient computation. We list the layer-wise aggregation scheme, final model output, and suitability to heterophily for H$_2$-GCN [1], GPR-GNN [2], and other similar models in the table below.
It can be inferred from the comparison that the four methods including H$_2$-GCN, GPR-GNN, APPNP, and GDC-HK employ *iterative and hop-by-hop* aggregation in each model layer. In each iteration, H$_2$-GCN integrates features from 1-hop and 2-hop nodes, while the rest gather embeddings from 1-hop neighbors. For such a model with $L$ layers, it only receives information up to $L$ hops. In heterophilous scenarios where global information is useful, these approaches may necessitate a large number of iterations. We distinguish our SIGMA from these designs as they realize long-term relationships only through iterative propagation.
The SIGMA aggregation applied to the model structure described in Section 3.2 is more like PPRGo, which leverages a precomputed relationship matrix based on graph topology. In other words, the matrix naturally contains the aggregation weight for any node pair $(u,v)$ in the graph without the need to perform iterative propagation. As illustrated in Figure 1 (b) and \(c\), the PPR matrix utilized in PPRGo is not suitable for heterophily, while the SimRank matrix successfully captures globally similar nodes. Consequently, a *single aggregation* operation suffices for SIGMA to establish connections with distant nodes directly. This comparison underscores the novelty of SIGMA in addressing heterophily with global similarity and efficient computation.
| **Model** | **Layer Aggr** | **Final** | **Heterophily?** |
|:---|:---|:---|:--:|
|**H$_2$-GCN** [1]|$H^{(\ell)}=AH^{(\ell-1)}+A^2H^{(\ell-1)}$|$Z=Concat(H^{(1)},H^{(2)},...,H^{(L)})$|Y|
|**GPR-GNN** [2]|$H^{(\ell)}=\hat{A}H^{(\ell-1)}$|$Z=\sum_{\ell=0}^{L}\gamma_\ell H^{(\ell)}$|Y|
|**APPNP** [3]|$H^{(\ell)}=(1-\alpha)\hat{A}H^{(\ell-1)}+\alpha H^{(0)}$|$Z=(1-\alpha)\hat{A}H^{(L-1)}+\alpha H^{(0)}$|N|
|**GDC-HK** [4]|$H^{(\ell)}=e^{-t}\frac{t^\ell}{\ell!} \hat{A}^\ell H^{(0)}$|$Z=\sum_{\ell=0}^{L}H^{(\ell)}$|N|
|**PPRGO** [5]|$Z=\Pi^{ppr}H^{(0)}$|$Z=\Pi^{ppr}H^{(0)}$|N|
|**SIGMA**|$Z=S\hat{H}^{(0)}$|$Z=S\hat{H}^{(0)}$|Y|
[1] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." Neurips 2020.
[2] Chien, Eli, et al. "Adaptive universal generalized pagerank graph neural network." ICLR 2021.
[3] Klicpera, et al. "Predict then Propagate: Graph Neural Networks meet Personalized PageRank." ICLR 2018.
[4] Gasteiger, et al. "Diffusion Improves Graph Learning." Neurips 2019.
[5] Bojchevski, et al. "Scaling Graph Neural Networks with Approximate PageRank." KDD 2021.
> **W2.** In Lemma 3.2, the term $h_u^{(\ell)}[v]$ is not defined, and obtaining this embedding strictly assumes that the initial embedding matrix is an identity matrix, which may not be practical.
**A2.** We wish to provide further details regarding the interpretation of Lemma 3.2 and Theorem 3.3. $h_u^{(\ell)}[v]$ denotes the aggregation weight assigned by the ego node $u$ to any node $v \in V$. The weight is associated with a random walk of length $\ell$ and encapsulates structural information about the reaching probability of a particular node.
Rather than directly used as node *features* learned in the GNN, the vector $h_u^{(\ell)}$ in this context is regarded as a certain structural embedding for node $u$. Then the pairwise structural similarity $S(u,v)$, calculated on inner-products $<h_u^{(\ell)}, h_v^{(\ell)}>$ following Eq.(5), is utilized as the exact *aggregation weight* from $v$ to $u$. Under this interpretation, $H^{(0)}$ is initialized as the identity matrix following the implication of $h_u^{(\ell)}[v] = p(v | u, t_{u:v}^{(\ell)})$, that the $l=0$ hop similarity for node $u$ is only related with $u$ itself. As we stated in **A1**, this is similar to PPRGo, where the PPR calculation is also initialized by an identity matrix.
Since $S$ only represents the aggregation weight, the exact model input in Eq.(6), i.e., node attributes $X$ and adjacency matrix $A$, and the aggregation input in Eq.(7), i.e. feature matrix $H$, can be of arbitrary values.
******************************************************************
> **W3.1.** In Lemma 3.2, the definition of embedding and the SimRank matrix do not seem strongly correlated.
**A3.1.** As we elaborated in **A2** and **Reviewer G4Zz A.1**, we would like to highlight the difference between GNN *aggregation*, which operates on structural similarity, and *propagation* that actually integrates node features. SIGMA is an aggregation approach whose weights are calculated by the SimRank similarity with interpretation in Lemma 3.2 and Theorem 3.3. Respectively, Lemma 3.2 links the structural embedding to random walk, while Theorem 3.3 derives SimRank calculation based on the embedding.
> **W3.2.** A closer connection might be established based on random walk-based metrics such as Personalized PageRank or diffusion matrices. In light of this, are the two advantages of SimRank, global aggregation, and one-time manner specific for SimRank (for instance, iterative training of PageRank also exhibits a one-time manner when the iteration approaches infinity). Relevant discussions need to be provided.
**A3.2.** Here, we compare SIGMA with GNNs that incorporate similar walk-based and diffusion approaches, as suggested, including PPRGo [5], AGP-HK [6], and SGC [7]. The aggregation scheme of PPRGo is in the table in **A1**. AGP-HK utilizes the Heat Kernel diffusion for precomputed *propagation*, which corresponds to the iterative GDC-HK in the same table. SGC adopts the $\hat{H}=\hat{A}^LX$ propagation.
### (1) Discovering Global Similarity
Similar to the analysis in **A1**, the PPR matrix in PPRGo retrieves local information and fails to address heterophily globally. AGP-HK and SGC are also revealed as locality-based band-pass filters in the graph spectral domain, such as in Figure 2(a) in [4]. In comparison, we demonstrate both theoretically and experimentally that SimRank/SIGMA retrieves global similarity in heterophilous graphs.
### (2) One-time Computation
To the best of our knowledge, there are no existing methods specifically designed for heterophily graph learning task using a one-time aggregation mechanism. Homophily-based one-time methods including PPRGO, AGP-HK, and SGC are known to underperform in heterophily settings.
Here we conduct additional experiments to directly compare the effectiveness and efficiency of these methods with SIGMA on several heterophily graphs. It can be observed that although baselines such as PPRGo are relatively faster in some cases, they constantly achieve suboptimal accuracy across these 5 datasets. Therefore, their application to heterophily is limited.
| |`genius` | |`arXiv-year` | |`Penn94`| | `twitch_gamers` | |`snap-patents`| |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **Model**|**Acc (%)**|**Time (s)**|**Acc (%)**|**Time (s)**|**Acc (%)**|**Time (s)**|**Acc (%)**|**Time (s)**|**Acc (%)**|**Time (s)**|
|**SGC** [7]|81.94|**55.25**|32.58|**13.6**|76.03|**9.2**|59.21|**46.8**|29.09|**88.6**|
|**PPRGo** [5]|77.51|*84.49*|30.60|38.3|75.36|24.6|51.03|*63.6*|36.9|758.8|
|**AGP-HK** [6]| 79.83| 555.04 |36.76 | 232.2| 70.73| 137.17 |59.16 |220.6 |32.3 |*356.9* |
|**SIGMA**|**91.68**|153.6|**55.16**|*36.5*|**86.31**|*17.2*|**67.21**|236.5|**64.63**|408.2|
[6] H Wang, et al. "Approximate graph propagation." KDD 2021.
[7] Wu, et al. "Simplifying Graph Convolutional Networks." ICML 2019.
## Reviewer xhwP --- 5,6,4
We thank the reviewer for the thoughtful review and evaluation of our contribution. We would like to address the questions as below.
> **W1.** Writing Improvement Needed: There is a need for refinement in terms of grammar and formatting, particularly in the appendix section. A thorough double-check is essential to rectify evident errors, such as the instance in the introduction: "on a realistic dataset is shown in Figure 1 (b) and \(c\)."
**A1.** We thank the reviewer for pointing out the mistake. We will conduct a thorough review of the paper to refine the grammar and formatting in the next version.
> **Q1.** Under the top-k scheme, do we only use the k selected nodes for node aggregation and update in Equations 7 and 8?
**A2.** Yes. $\hat{z}_u$ in Eq.(7) and Eq.(8) performs aggregation only considering top-k nodes $v$ for each node $u$ based on the SimRank score $S(u,v)$. In the update operation of Eq.(8), the aggregated embedding $\hat{z}_u$ is combined with the ego node's feature $h_u$ to produce the output $z_u$.
> **Q2.** In Appendix ‘C: ARCHITECTURE OF SIGMA’, Figure 6, are the embeddings from SimRank Aggregation directly passed to the softmax classifier? Is there a MLP before the softmax classifier?
**A3.** SIGMA in Figure 6 utilizes the post-propagation mechanism, where the embedding after aggregation based on $S$ are directly passed to the softmax classifier. The $MLP_H$ denoted as Linear Combination is performed before aggregation.
We here compare SIGMA with the alternative **SIGMA-Pre** employing pre-propagation, which first conducts aggregation and then performs learning with $MLP_H$ before the softmax classifier. As shown in the table below, the current post-aggregation achieves better performance. We believe that it is more effective to first combine representations using the MLPs, and then use the resulting output for aggregation here.
|**Model**|`genius` |`arXiv-year` |`Penn94`| `twitch-gamers` |`snap-patents`| `pokec`|
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
|**SIGMA-Pre**|90.73|54.28|85.06|66.07|63.85|81.97|
|**SIGMA**|**91.68**|**55.16**|**86.31**|**67.21**|**64.63**|**82.33**|
## Reviewer hpmD --- 2,3,3
We sincerely thank the reviewer for the detailed comments. We are to respond the questions as below.
> **Q1.** For Figure 2, authors validate the ability of SimRank for aggregating global homophily. The author provided the results of five small scale datasets, but the Squirrel dataset was also added in the subsequent experiments. Is its answer not good or why it not good? And, why is this experiment not done on large scale data sets? According to the authors, the study focused on large-scale graphs.
**A1.** In Figure 2, we mainly intend to elaborate the ability of SimRank in identifying global homophily node pairs. Due to space limitation, we only show results on representative datasets such as with different homophily/heterophily ratios and different SimRank score distributions. The results on other datasets are similar to our observation that homophilous node pairs tend to have higher SimRank scores than heterophily ones. We will include more results in the future version appendix.
> **Q2.** In the last paragraph of part 4, authors highlight that “We highlight the uniqueness of SIGMA to all the above works for its global aggregation incorporating the well-defined topological similarity”. How do you explain that uniqueness?
**A2.** This statement intends to highlight the methodological novelty of our proposed model. As reviewed in the section, we classify GNN models addressing heterophily into two categories. A series of designs utilize *iterative aggregation* schemes to discover long-term relationship. We present a comparison table and relevant discussions in **Reviewer CAQb.A1**. In short, the multi-hop capability of these models is highly dependent on the model layer, which limits their adaptability and efficiency.
Another line of works incorporate *global information* to address heterophily. LINKX directly learns from the graph adjacency, while GloGNN solves the global optimization. In comparison, our SIGMA design is based on the global topological similarity measure SimRank. We demonstrate the advantage of our approach including theoretical interpretation and efficient computation of linear complexity. We hence distinguish our model from the above for proposing simple global aggregation incorporating the well-defined topological similarity.
> **Q3.** In line 729, what do you mean “Table 2 in Figure 3”? It might be “Table b in figure 3”?
**A3.** We apologize for the confusion. In this part, we first select 4 baseline models with leading overall efficacy based on the evaluation in Table 2. Then, we compare the convergence time of SIGMA against these models on various datasets as illustrated in those subfigures in Figure 3. We believe such comparison demonstrates the superiority of our model in reaching high accuracy with a relatively short training overhead. We will revise the presentation to clarify the experiment and comparison in the future version.
> **Q4.** It is suggest to add the model diagram to facilitate understanding.
**A4.** Yes, the model diagram corresponding to Eq.(6)-(8) is provided in Figure 6 in Appendix. We elaborate the design details in Appendix C. Architecture of SIGMA.
> **Q5.** Besides, there are some punctuation mistakes in this paper. For example, in the first paragraph of part 3.1, the end period was missing. Please double check your paper.
**A5.** We thank the reviewer for pointing out the typo. We will conduct a careful review and correct these mistakes in the next version.