Cooperative Graph Neural Networks - ICML Rebuttal

# Cooperative Graph Neural Networks - ICML Rebuttal # Global Response We thank the reviewers for their comments. We respond to each concern in detail in our individual responses to the Reviewers, and provide here a summary of our rebuttal: * **Experiments with deep Co-GNNs (Reviewer sMNf and k8kJ):** * We repeated our experiments with Co-GNN($\mu$,$\mu$) architecture using an increasing number of layers, showing Co-GNNs are effective in alleviating the oversmoothing phenomenon over the homophilic **cora** dataset and heterophilic **roman-empire** dataset. * We present the ratio of directed edges kept for a deep 40-layered architecture showing the ability of Co-GNNs to sparsify the homophilic **cora** dataset and heterophilic **roman-empire** dataset, effectively ignoring distant nodes. * **Experiments on the choice of environment and action networks (Reviewer sMNf):** We added an ablation study over all the different combinations of action and environment networks in Co-GNNs, on the __minesweeper__ (heterophilic) and __RootNeighbors__ datasets. Further emphasizing that the optimal choice of action and environment networks heavily depends on the underlying task. * **Experiments comparing with RSGNN (Reviewer zYzs):** To highlight the difference between RSGNN and Co-GNNs, we provide experimental results for the RSGNN model on the homophilic **cora** and **pubmed** datasets and on the heterophilic **roman-empire** and **amazon-ratings** datasets. The comparison shows RSGNN is a strong baseline on homophilic datasets but it does not match the performance of Co-GNN models. * **Experiments using stronger base architectures (Reviewer zYzs):** We provide experimental results using the more powerful architecture GCNII as the base architecture of Co-GNN on the homophilic datasets **pubmed** and **cora**, achieving even better performance. * **Flexibility of Co-GNN framework (Reviewer MprR and zYzs):** We would like to emphasise that Co-GNNs can be used with a variety of base architectures, enhancing their capabilities. We hope that our answers address your concerns along with the new experiments. We are looking forward to a fruitful discussion period. # Reviewer sMNf (6) We thank the Reviewer for their comments and for acknowledging the promise of the framework in various tasks. We address each of their questions/concerns below. > L1: The choice of action network and environment network seems to significantly affect performance (Table 1). However, it remains unclear which to select for a specific task. Although the authors provide five options for both, resulting in 25 combinations, only 2 combinations are shown in Table 2. It would be beneficial to include a table showing the performance of all combinations and discuss these further. **Answer:** Thanks for raising this point! We agree that an ablation study over the choice of environment and action networks would provide additional insights. To address this, we ablated over all the different combinations of action and environment networks in Co-GNNs, resulting in 25 combinations, as suggested by the Reviewer. Specifically, we ran additional experiments using MeanGNN, GAT, SumGNN, GCN and GIN on the heterophilic graph __minesweeper__ and on the synthetic dataset __RootNeighbors__. In the following table, we report the accuracies over __minesweeper__ where the action networks are shown in *italic* and the environment networks in **bold**: | | MeanGNN | GAT | SumGNN | GCN | GIN | |-------------|---------|---------|--------|---------|---------| | *MeanGNN* | 97.31 | 97.46 | 98.24 | 95.15 | 93.19 | | *GAT* | 97.02 | 96.97 | 97.00 | 93.74 | 92.94 | | *SumGNN* | 97.02 | 96.06 | 95.09 | 95.14 | 91.83 | | *GCN* | 97.44 | 96.52 | 96.35 | 94.43 | 92.89 | | *GIN* | 97.27 | 95.93 | 91.83 | 92.02 | 92.41 | Co-GNNs achieve multiple state-of-the-art results on __minesweeper__ with different choices of action and environment networks. Specifically, we draw the following general observations from the ablation results: -- *Environment network*: We observe that MeanGNN and GAT yield consistently robust results when they are used as environment networks regardless of the choice of the action network. This makes sense in the context of the __minesweeper__ game: In order to make a good inference, a $k$-layer environment network can keep track of the average number of mines found in each hop distance. The task is hence well-suited to mean-style aggregation environment networks, which manifests itself as empirical robustness. GCN performs worse as environment network, since it cannot distinguish a node from its neighbors. -- *Action network*: The role of the action network in the context of the __minesweeper__ game is intricate as it makes a distinction between 0-labeled nodes and non-0-labeled nodes (please refer to our response to L3). In principle, such an action network can be realized with all of the architectures considered, and hence we do not observe dramatic differences in the performance regardin the choice of the action network. In the following table, we report the accuracies over the __RootNeighbors__ dataset, where the action networks are shown in *italic* and the environment networks in **bold**: | | MeanGNN | GAT | GCN | GIN | SumGNN | |---------|---------|---------|---------|---------|--------| | *MeanGNN* | 0.339 | 0.352 | 0.356 | 0.354 | 0.353 | | *GAT* | 0.334 | 0.323 | 0.354 | 0.370 | 0.355 | | *GCN* | 0.336 | 0.353 | 0.351 | 0.370 | 0.354 | | *GIN* | 0.088 | 0.104 | 0.154 | 0.171 | 0.176 | | *SumGNN* | 0.079 | 0.085 | 0.139 | 0.181 | 0.196 | The empirical results on __RootNeighbors__ further support our analysis from Section 6.1: a mean-type aggregation environment network and a sum-type aggregation action network solves the task. Unlike __minesweeper__, the choice of the action network is very critical for this task: SumGNN and GIN - both using sum aggregation - yield best results across the board when they are used as action networks. In both experiments the empirical findings suggest that the optimal choice of action and environment networks heavily depends on the underlying task. We are happy to include these results and discussions in the revised version of our paper. > L2: For the homophilic dataset, how does performance change with an increasing number of layers? It appears that Co-GNN can learn to filter out ALL edges to focus on near neighbors. Does this imply that Co-GNN maintains performance even with a very deep model (e.g., more than 10 layers), by automatically ignoring long-distance nodes? **Answer:** Thanks for raising these important aspects, which we answer in two parts: **Accuracy as the number of layers grow:** Co-GNNs can alleviate over-smoothing that manifests itself with the use of more layers in GNNs. This is possible due to the flexible message-passing paradigm of Co-GNNs enabled by the action network that can, for instance, decide to isolate the nodes that do not benefit from additional information. Following the Reviewer's request, we repeated our experiments with Co-GNN($\mu$,$\mu$) architecture over the homophilic **cora** dataset and heterophilic **roman-empire** dataset using an increasing number of layers and report the accuracies: | \#Layers | cora | | roman-empire | | | -------- | ----------------------- | ---------------- | ----------------------- | ----------------- | | | __Co-GNN($\mu$,$\mu$)__ | __MeanGNN__ | __Co-GNN($\mu$,$\mu$)__ | __MeanGNN__ | | 1 | 85.51 $\pm$ 1.10 | 83.62 $\pm$ 1.84 | 81.01 $\pm$ 0.53 | 78.92 $\pm$ 0.01 | | 5 | 86.40 $\pm$ 1.40 | 28.87 $\pm$ 2.17 | 90.36 $\pm$ 0.42 | 87.15 $\pm$ 0.01 | | 10 | 86.08 $\pm$ 1.80 | 28.87 $\pm$ 2.17 | 91.30 $\pm$ 0.29 | 86.22 $\pm$ 1.17 | | 20 | 85.23 $\pm$ 2.00 | 28.87 $\pm$ 2.17 | 91.05 $\pm$ 0.81 | 75.11 $\pm$ 4.12 | | 30 | 83.02 $\pm$ 1.80 | 28.87 $\pm$ 2.17 | 91.07 $\pm$ 1.09 | 58.46 $\pm$ 7.51 | | 40 | 82.04 $\pm$ 2.00 | 28.87 $\pm$ 2.17 | 90.44 $\pm$ 1.10 | 38.67 $\pm$ 11.80 | The results indicate that the performance is generally retained for deep models and that Co-GNNs are effective in alleviating the oversmoothing phenomenon even though their base GNNs suffer from performance deterioration already with a few layers. **Graph topology as the number of layers grow:** To better understand the contribution of the action network, we show the ratio of directed edges kept for a deep 40-layered Co-GNN($\mu$,$\mu$) architecture (much like in Section 7.2) on two datasets, **cora** (homophilic) and **roman-empire** (heterophilic): | Layer | cora | roman-empire | |-------|----------|--------------| | 0 | 0.795 | 0.000 | | 10 | 0.250 | 0.013 | | 20 | 0.060 | 0.514 | | 30 | 0.000 | 0.976 | | 39 | 0.000 | 0.980 | The results reaffirm that the performance is retained for deep models also in the homophilic setting, thanks to the ability of sparsifying the graph, effectively ignoring distant nodes. > L2: For the heterophilic dataset, more edges are kept in the later layers. While this may indicate that the model focuses on longer-distance nodes, it seems problematic when almost 100% edges are kept as it loses the capability of information selection, potentially leading to the over-squashing problem that this work aims to solve. **Answer:** This is related to a very subtle point regarding the behaviour of Co-GNNs on heterophilic datasets: While the action network tends to keep more and more edges as the number of layers increase, this behaviour is *adaptive* to the number of layers being used. For example, if we train a 40-layer network rather than a 10-layer network (as reported in the table above), we observe that the action network *very gradually* increases the connectivity of the graph. In fact, we observe very high connectivity only after ~30 layers. If we plot the ratio of directed edges kept (as in Section 7.2, Fig 4), we essentially observe the same S-shaped curve seen in Fig 4E. This is reassuring, since it precisely answers the concern of the reviewer: it could have been problematic to use all the edges starting from early layers onwards (as it could lead to over-squashing), but the network avoids this issue, thanks to its adaptive nature. > L2: how do the model's actions change with an increasing or decreasing number of layers? For example, it seems that the information from 10-hop neighbors is important on the roman-empire dataset, so what is the performance of 9-layer Co-GNN, given that it cannot reach 10-hop neighbors? **Answer:** The choices of the action network are adaptive as explained in response to the earlier question. To see the effect of the \#layers, we ran additional experiments using the Co-GNN($\mu$,$\mu$) architecture with varying number of layers $L \in $\{$1,\cdots,20 $\}$. The 9-layer model achieves $90.87\%$ accuracy which already surpasses all of the baselines, but does not match the highest accuracy of Co-GNNs. The peak accuracy of $91.57\%$ is achieved using the 12-layer model and deeper-layer models largely retain this accuracy, e.g., we observe $91.05\%$ with the 20-layer model. > L3: When we play the minesweeper game, we only consider the information from 1-hop neighbors. What is the necessity of considering 10-hop neighbors in this example? **Answer:** Unlike the classical Minesweeper game, the **minesweeper** dataset has only *partial* information: 50% of nodes have their features hidden. To accurately predict the map of mines from the partial information, the model needs to take into account the long-range interactions. > L3: The authors claim that the 0-labeled nodes do not help and can be filtered out. However, the 0-labeled nodes should be important as they indicate that their neighbors are NOT mines. **Answer:** We agree that we need to phrase this better to avoid any ambuiguities. In this game, every label is informative, but we think non-0-labeled nodes are more informative in the *earlier layers*, whereas 0-labeled nodes become more informative at *later layers*. This can be explained by considering two cases: - **Case 1**: The target node has *at least one* 0-labeled neighbor: In this case, the prediction trivializes, since we know that the target node cannot be a mine. To classify this target node correctly, a single-layer network is sufficient. - **Case 2**: The target node has *no* 0-labeled neighbors: In this case, the model needs to make a sophisticated inference based on the surrounding mines within the $k$-hop distance. In this scenario, a node obtains more information by aggregating from non-0-label nodes. The model can still implicitly infer "no mine" information from the lack of a signal/message from a node. To achieve high accuracy in this task, it is important to develop a strategy that works best for *all* nodes. Hence, the model diffuses the information of non-0-labeled nodes in the early layers to capture nodes that fall under Case 2, while largely isolating the 0-labeled nodes until the last layers (avoiding mixing of signals and potential loss of information) which can later be used to capture the nodes that fall under Case 1. > L3:In addition, it seems that all the nodes in the left section are connected to their neighbors. How does the Co-GNN focus on the relevant information and how does it address the over-squashing issue in this case? **Answer:** This is related to our earlier response, as the model tends to diffuse information from non-0-labeled nodes which well-aligns with this game. This leads to a connected subgraph which itself can be subject to over-squashing, but much less than the original graph: the empirical results (achieving near-perfect accuracy) suggest that the model does not suffer from this problem. # Reviewer MprR (5) We thank the reviewer for their comments and finding our work innovative and acknowledging the potential of our cooperative message-passing to overcome the limitations of traditional GNNs. We address each of the raised questions/concerns below. > W1:Table 2 exclusively features heterophilous datasets, whereas the hyperparameter analysis and dataset statistics involve some homophilous datasets (Cora, PubMed). Are Table 5 and Table 2 based on the same settings? If so, why not demonstrate the performance on homophilous datasets in the main text to prove the framework’s effectiveness more convincingly? It's crucial to show that a framework performs well on both homophily and heterophily. **Answer:** The heterophilic and homophilic node classification experiments follow different setups as detailed in Section 6.2 and Appendix C.4, respectively. They do not use the same hyper-parameters (see Tables 10 and 13, respectively). Please note that we follow the standard experimental protocols for all these experiments. We would like to draw the Reviewer's attention to extensive experiments on a broad range of tasks and benchmarks, including long-range, graph classification, homophilic node classification and heterophilic node classification (see Section 6.2 and Appendices C.2-4), which all show that our framework performs well across the board. Due to space considerations, we delegated some experimental results to the appendix, but we are open to integrate some homophilic results to the body of the paper if the reviewer thinks that this will strengthen the message of the paper. > W1: Co-GNN should be compared with some SOTA GNNs. Currently it’s only compared with older baselines like GCN, GraphSAGE, … **Answer:** In Table 2, we compare Co-GNNs to the best-performing baselines from the original benchmarking study (Table 4 in Platonov et al. (2023)). These models outperform all other models - including very complex ones, specifically designed for heterophilic datasets - according to the original paper (Platonov et al. 2023). We can include these models or any other baselines that the reviewer recommends. We think the general trends in the experimental findings remain valid. In all of the experiments, we compare against a variety of models including graph transformers to show the virtue of our approach. > W2: The interpretation of Figure 1 is unclear. In the upper row, it appears that node $u$ listens to all neighbors in the first layer. However, it is not clear why it only listens to node $v$ in the second layer. The presentation is not clear here. Clarification on this would enhance understanding. **Answer:** Figure 1 describes a synthetic example showing the limitations of the standard message-passing paradigm: The choices for the actions are arbitrary as the reviewer points out, but this is merely to showcase a concrete example flow of information that cannot be realized by standard message-passing. We present a task-specific concrete example in the context of **RootNeighbors** experiment. We will clarify this. > W3: Is it possible to extend the robust framework, featuring "standard, listen, broadcast, isolate" modes, to encompass directed graphs? Specifically, it would be helpful to know how a node decides which neighbors to listen to in a directed graph setting. **Answer:** This is an excellent pointer. While we focus on simple, undirected graphs in our paper, there is no fundamental limitation in adapting Co-GNNs to directed, and even multi-relational graphs. We think this is an important future direction, but one that requires a dedicated, separate study. One possible adaptation is by including actions that also consider the directionality: For example, for each node $u$, we can define the actions LISTEN-INC (listen to nodes that have an incoming edge to $u$) and LISTEN-OUT (listen to nodes that have an incoming edge from $u$) and extend the other actions analogously to incorporate directionality (Rossi et al. 2023). Note that similar ideas could be applied to even multi-relational graphs. However, each of these design choices requires a thorough study which is beyond the scope of the current work. Another approach is to use action/environment networks that can adequately handle directed (or multi-relational) graphs (directional GNNs): In this case, we can directly use Co-GNNs with these action/environment architectures, and we do not anticipate any changes in the action space. We will include a discussion on these avenues of future directions. Thank you for pointing this out! References: - Platonov et al., A critical look at the evaluation of GNNs under heterophily: are we really making progress?, ICLR 2023. - Rossi et al., Edge Directionality Improves Learning on Heterophilic Graphs, LoG 2023. # Reviewer zYzs (4) We thank the reviewer for their comments and acknowledging that Co-GNNs have many good properties, achieve good performance and that the presentation, illustration, figures and visualization are clear and convincing. We address each of their questions/concerns below. > Weakness 1: The framework is still limited by some inflexibility. When a node chooses to broadcast or listen, it broadcasts or listens to all its neighbors. It can not flexibly choose the node that it listens to or broadcasts to. **Answer:** This is a very interesting point. Technically, node-based actions can capture all topological configurations that can be obtained by edge-based actions using sufficiently many layers. For example, suppose we are interested in a graph with the edges $(u_1, u_2)$, $(u_1,u_3)$ and the node state $u_1$ is only relevant for $u_3$ but not for $u_2$. This can be achieved if $u_1$ broadcasts, $u_2$ isolates, and $u_3$ listens. It is slightly trickier if $u_2$ has another neighbor whose information needs to be transmitted to $u_2$. In this case, we can 'serialize' the process: after applying one layer as before, we can now isolate $u_1$ and allow $u_2$ to listen. The examples can be made more complex, but the main idea is always the same. In this sense, the node-based actions are powerful. Our framework can trivially be extended to edge-based actions, which could lead to more succinct constructions, but with the downside of having a potentially much larger actions space (and higher runtime complexity). Therefore, we focused on node-based actions in our study. > Weakness 2: Although this framework is novel, some previous papers can also solve the problem of the fixed graph structure to some extent, such as RSGNN [1]. There is no comparison to those previous related works. It would be better if the authors could include some baselines from previous related papers. This is my major concern. **Answer:** Thanks for bringing this work to our attention. We will include a discussion related to this paper. While having a very different motivation (related to handling "noisy edges" and "attacks") than our work, RSGNN also operates over a learned topology. However, once this topology is learned, it is fixed and then used as is. This is fundamentally different from Co-GNNs that can operate on a different topology at every layer, which is a crucial ingredient in their dynamic information flow. Overall, the approaches, the techniques, and the formal results in these works are different. However, we agree that an empirical comparison is technically possible and report our findings in the following tables: | | pubmed | cora | |----------------------|------------|------------| | RSGNN | 85.43 $\pm$ 0.41 | 85.53 $\pm$ 1.76 | | Co-GNN(<em>__*__</em>,<em>__*__</em>) $~~~~~~$ | 89.51 $\pm$ 0.88 | 87.44 $\pm$ 0.85 | | | roman-empire | amazon-ratings | |----------------------|------------|------------| | RSGNN | 29.13 $\pm$ 0.46 | 44.14 $\pm$ 0.32 | | Co-GNN($\mu,\mu$) | 91.37 $\pm$ 0.35 | 54.17 $\pm$ 0.37 | RSGNN is a strong baseline on homophilic datasets but it does not match the performance of Co-GNN models. On heterophilic datasets, we observe that RSGNN performs worse than most of the reported baselines. This can be explained by (i) the fact that RSGNN learns the topology of the input graph using a "homophily assumption" (i.e., two nodes have a link if they are similar) and (ii) the fact that the topology is not dynamic and adaptive across the layers - a key property enabling the strong results for Co-GNNs. We will include a discussion on these avenues of future directions > Weakness 3: The first experiment on the synthetic dataset is not convincing to me, because I think the task can benefit a lot from the model design, and this task is not common in graph learning. **Answer:** We would like to emphasize that this is a very simple task, where all GNNs appear to fail, and our objective was to show whether Co-GNN models could learn the desired information propagation mechanism. In this setup, we know the target function, and we can determine which edges must be 'kept or dropped’ in the optimal information flow scenario. We verify that Co-GNN models decide to 'keep or drop' the edges accurately in >99% of the cases. This has been realized empirically: The trained Co-GNN model extracted the tree shown in Figure 3 (right) from a test instance shown in Figure 3 (left). We find this more insightful than the MAEs alone. > Q1: The performance is worse than some baseline models on graph classification (Table 4) and homophilic node classification (Table 5). Some baseline models that are based on more powerful GNN architecture perform better. If we apply the Co-GNN method to more powerful GNN architectures, can it further improve the performance in graph classification and homophilic node classification? **Answer:** Thanks for this comment. We would like to stress that our aim was to highlight the benefits of our new cooperative message-passing approach, allowing the use of very simple GNN architectures to achieve state-of-the-art results against much more complex models such as Graph Transformers (see Table 2). In fact, we wanted to make sure that the reported improvements can be attributed solely to the novel message-passing mechanism rather than potential confounding factors, such as sophisticated model components. That being said, we agree that Co-GNNs can be used with more powerful architectures to achieve even better performance. To verify this in the homophilic setting, we additionally experimented using the GCNII architecture as the base architecture on **pubmed** and **cora** homophilic datasets. Our approach achieves node classification accuracy of 89.64% and X% respectively, outperforming any other method and achieving new state-of-the-art results. We will include these results in the revised version of the paper. > Q2: What are the most suitable scenarios/tasks for Co-GNN, compared to traditional GNN models? **Answer:** The message-passing paradigm Co-GNNs generalises classical message-passing. Hence, Co-GNNs typically improve over the base GNN that is used irrespectively of the specific context. Co-GNNs are particularly suitable for: * **Heterophilic graphs**: The dynamic and flexible message-passing mechanism of Co-GNNs allows for a task-specific information flow. This makes Co-GNNs a very strong baseline on heterophilic datasets, which typically require delicate information propagation (beyond the homophilic setup) that standard GNNs cannot realize. Notably, Co-GNNs ourperform much more complex models such as Graph Transformers in the heterophilic setting as we report in Table 2. * **Long-range tasks**: Long-range tasks necessitate to propagate information between distant nodes. Our message-passing paradigm can efficiently filter irrelevant information by learning to focus on a subgraph (or even path, as shown in Theorem 5.2) between these two nodes, hence maximising the information flow to the target node. Empirical results for these tasks are presented in Table 2 and 3, highlighting Co-GNNs success in these contexts. # Reviewer k8kJ (7) We thank the reviewer for their comments and for acknowledging that the theoretical insights and the experimental campaign justify the proposed scheme. We address each of their questions/concerns below. > Q2: I would appreciate a more in depth comment on the relation of the proposed scheme with Attention mechanisms. Indeed, while here an action network (a GNN which is shared among layers) is responsible to decide each node role in the information diffusion, also in GAT this process is learned from data (the attention coefficient is computed via a single-layer feedforward neural network which process node features). In GAT, when the attention score is 0, the contribution from that node is still zero, so it can be interpreted as Isolate. > W1: The paper could benefit from a more in-depth discussion on the relation of the proposed approach (using action/environment networks) with other methods that learn how to propagate messages (such as Attention - GAT [1]) and other approaches that prioritize graph information diffusion [2-5]. **Answer:** This is a very good point. Let us highlight the fundamental differences between Co-GNNs and GAT: 1. GATs can learn appropriate attention coefficients to filter out some information from the neighbors but learning the $0$ coefficient (i.e., discard the message altogether) is typically hard in practice (soft attention), whereas the action network in Co-GNNs can easily choose the action “isolate” (as widely observed in our experiments). 2. Suppose for the sake of the argument that GATs can learn the best attention coefficients. This does not always yield fine-grained control over the information flow. The argument hinges on the fact that attention is feature-based and it is normalized via softmax. If GAT attends to a particular node in the neighborhood then it must attend to all other neighbors who have identical features in the exact same way. This causes the following effect: Fix a GAT model and apply it to a test graph that has higher degree nodes that also have more neighbors having identical features. Then, the contribution of these identical node features increases with their frequency (and this effect cannot be avoided), eventually belittling the contribution of features of all other nodes. Intuitively, this is due to softmax normalization, which produces smaller scores for less frequent node features. 3. GATs have other inherent limitations. For example, they cannot even detect node degrees. This is evident in our RootNeighbors experiment, where the information needs to be propagated only from degree-6 nodes, and GAT model, unable to detect node degrees, performs poorly. In other words, GATs may not be able to detect which information needs to be filtered (especially structural ones). 4. The contribution of Co-GNNs is orthogonal to that of GATs. In fact, GATs can be used as a base architecture in Co-GNNs. Our message-passing paradigm generally increases the performance of GATs. To show this, we experiment in Table 1 with Co-GNN($\Sigma$,$\alpha$) in our synthetic example, which results in an 80% decrease in MAE, from the initial GAT performance which was basically a random guess. In this case, the action network allows GAT to determine the right topology, and GAT only needs to learn to average the features. 5. The action network of Co-GNNs can be viewed as a look-ahead network that makes decisions after 'applying' $k$ GNN layers. It can therefore recognize, before the environment network aggregates any information, the relevant nodes based on the $k$-hop information. This is a unique feature of Co-GNNs, which allows the aggregation mechanism to be conditioned on this look-ahead capability, which is not present in GATs. In the context of graph information diffusion methods, the key differences are: 1. In "[2]" and "[4]", the edges are scaled down or up together using a learnable scalar in each layer. Conversely to Co-GNNs where different edges may become directed or removed. 2. In "[3]" a generative model is proposed to learn the different layers in which each node propagates it’s information. Conversely, the actions in Co-GNNs can be expanded to include 2-hop/3-hop/k-hop neighbors and thus are not limited to immediate neighbors. 3. "[5]" uses a parametric controllers to decide the propagation depth for each node, but it lacks the generality that is applied in Co-GNNs where each node can control whether to receive or transmit information. Furthermore, Co-GNNs can dynamically control the propagation of information and to use the look-ahead property that is mentioned in Section 5.1. > Q2: Related to the previous comment, the message underlying the Conditional Aggregation paragraph (lines 220/231) is not very clear to me. **Answer:** Let us elaborate this in concrete steps: 1. At layer $l$, an action network of depth $k$ computes node- or graph-level representations on the original graph topology, which can be seen as $(k+l)$-layer representations. 2. Based on the representations computed in step (1), the action network determines an action for each node, which induces a new graph topology. 3. The environment network at layer $l$ operates on this new graph topology. Environment network operates on a new topology that is determined with the help of $(k+l)$-layer representations. In this sense, the aggregation of environment network is determined by $(k+l)$-layer representations of the action network, which can be viewed as a look-ahead capability. It allows the model to e.g. not aggregate information from nodes which have degree $< c$, where $c\in\mathbb{N}$. We will explain this better in the revised version of our paper. > Q3: Apart from computational/memory efficiency, what is the rationale behind the choice of sharing the action network among different layers? What happens if this constraint is relaxed and every layer leverage a different network? I believe that the current choice introduce a topology/rewiring-bias into the graph: if at a lower layer a node is isolated, why at higher layers (given that it's feature is not updated) should it behave differently? It has the same features, same action network -> same Isolated behaviour. I would appreciate a comment on this. **Answer:** Initially, we also found the performance of the shared action network surprisingly good. Indeed, one can imagine are scenarios, where using a different action network in every layer (or some layers) is more beneficial, but this is very rare in practice. This can be explained by two factors: - First, the node features change at every layer even if they choose the isolate action, because this action does not "freeze" a node, but rather disconnects it from the rest of the network, but it is still subject to a node-wise update. Since the action network learns a function over the node features and their direct neighborhood, it can decide on different actions at different layers, as we widely observed in the experiments. - Second, the action and environment networks are trained together on a joint loss function. In essence, this means that the action network decides on actions for all environment network layers in a more holistic way. Following Reviewer's comment, we conducted two experiments to make the action network layer-specific. First, we used a different action network at every layer and the model performance worsened, possibly due to over-parametrization. Second, we concatented the layer numbers to the node representations that the action network receives to make the action network layer-specific without over-parametrization. This second approach did not impact the model performance in any significant way. We think these empirical observations are in line with the explanations above. > Q4: What happens if you exploit a more expressive standard MPNN (such as GIN [6]) in the task (Section 6.1)? **Answer:** The key aspect for solving the RootNeighbors synthetic dataset lies in Co-GNNs ability to alter the flow of information. Technically, both SumGNN (Morris et al. 2019) and GIN (Xu et al. 2019) match the expressiveness of 1-WL with sufficiently many layers and they are both upper bounded by 1-WL. This makes them equally expressive in terms of graph distinguishability, but neither of these models can solve the task: - SumGNN achieves an MAE of 0.196 (as reported in the original submission) - GIN achieves an MAE of 0.22 closely resembling the SumGNN result (experimented during the rebuttal following reviewer's request). Even though these models are expressive, the signals/messages from different nodes mix quickly, as these models cannot selectively aggregate information. > Q5: I found very interesting the analysis on Figure 4. It seems that Co-GNN is capable to dynamically rewire the graph in an effective way depending the task. I am curious wheter the relation of number of layers vs. performances is different in Co-GNN w.r.t. standard MPNNs. **Answer:** Co-GNNs can alleviate over-smoothing that manifests itself with the use of more layers in GNNs. This is possible due to the flexible message-passing paradigm of Co-GNNs enabled by the action network that can, for instance, decide to isolate the nodes that do not benefit from additional information. Following the Reviewer's request, we repeated our experiments with Co-GNN($\mu$,$\mu$) architecture over the homophilic **cora** dataset and heterophilic **roman-empire** dataset using an increasing number of layers and report the accuracies: | \#Layers | cora | | roman-empire | | | -------- | ----------------------- | ---------------- | ----------------------- | ----------------- | | | __Co-GNN($\mu$,$\mu$)__ | __MeanGNN__ | __Co-GNN($\mu$,$\mu$)__ | __MeanGNN__ | | 1 | 85.51 $\pm$ 1.10 | 83.62 $\pm$ 1.84 | 81.01 $\pm$ 0.53 | 78.92 $\pm$ 0.01 | | 5 | 86.40 $\pm$ 1.40 | 28.87 $\pm$ 2.17 | 90.36 $\pm$ 0.42 | 87.15 $\pm$ 0.01 | | 10 | 86.08 $\pm$ 1.80 | 28.87 $\pm$ 2.17 | 91.30 $\pm$ 0.29 | 86.22 $\pm$ 1.17 | | 20 | 85.23 $\pm$ 2.00 | 28.87 $\pm$ 2.17 | 91.05 $\pm$ 0.81 | 75.11 $\pm$ 4.12 | | 30 | 83.02 $\pm$ 1.80 | 28.87 $\pm$ 2.17 | 91.07 $\pm$ 1.09 | 58.46 $\pm$ 7.51 | | 40 | 82.04 $\pm$ 2.00 | 28.87 $\pm$ 2.17 | 90.44 $\pm$ 1.10 | 38.67 $\pm$ 11.80 | The results indicate that the performance is generally retained for deep models and that Co-GNNs are effective in alleviating the oversmoothing phenomenon even though their base GNNs suffer from performance deterioration already with a few layers. References: - Morris et al., Weisfeiler and Leman go neural: Higher-order graph neural networks, AAAI, 2019. - Xu et al, How powerful are graph neural networks? ICLR, 2019. # Replies # Reviewer zYzs (4) We thank the reviewer for going through our rebuttal and raising their score. We would like to address their remaining concerns: **Concern 1:** > I think the paper could further benefit from adding more frontier baseline models in Table 4 and Table 5. **Answer:** We are happy to include additional baselines following the reviewers suggestion in the final version of our paper. **Concern 2:** > Besides, I have an additional concern. As Table 1 suggests, using different model architectures for Co-GNN has different results. The paper could further benefit from analyzing the model architecture choice. **Answer:** We agree that an ablation study over the choice of environment and action networks would provide additional insights. To address this, we ablated over all the different combinations of action and environment networks in Co-GNNs, resulting in 25 combinations, as suggested by another reviewer. Specifically, we ran additional experiments using MeanGNN, GAT, SumGNN, GCN and GIN on the heterophilic graph __minesweeper__ and on the synthetic dataset __RootNeighbors__. We report our detailed findings and their interpretation in the following comment. In the following table, we report the accuracies over __minesweeper__ where the action networks are shown in *italic* and the environment networks in **bold**: | | MeanGNN | GAT | SumGNN | GCN | GIN | |-------------|---------|---------|--------|---------|---------| | *MeanGNN* | 97.31 | 97.46 | 98.24 | 95.15 | 93.19 | | *GAT* | 97.02 | 96.97 | 97.00 | 93.74 | 92.94 | | *SumGNN* | 97.02 | 96.06 | 95.09 | 95.14 | 91.83 | | *GCN* | 97.44 | 96.52 | 96.35 | 94.43 | 92.89 | | *GIN* | 97.27 | 95.93 | 91.83 | 92.02 | 92.41 | Co-GNNs achieve multiple state-of-the-art results on __minesweeper__ with different choices of action and environment networks. Specifically, we draw the following general observations from the ablation results: We observe that MeanGNN and GAT yield consistently robust results when they are used as environment networks regardless of the choice of the action network. This makes sense in the context of the __minesweeper__ game: In order to make a good inference, a $k$-layer environment network can keep track of the average number of mines found in each hop distance. The task is hence well-suited to mean-style aggregation environment networks, which manifests itself as empirical robustness. GCN performs worse as environment network, since it cannot distinguish a node from its neighbors. In the following table, we report the accuracies over the __RootNeighbors__ dataset, where the action networks are shown in *italic* and the environment networks in **bold**: | | MeanGNN | GAT | GCN | GIN | SumGNN | |---------|---------|---------|---------|---------|--------| | *MeanGNN* | 0.339 | 0.352 | 0.356 | 0.354 | 0.353 | | *GAT* | 0.334 | 0.323 | 0.354 | 0.370 | 0.355 | | *GCN* | 0.336 | 0.353 | 0.351 | 0.370 | 0.354 | | *GIN* | 0.088 | 0.104 | 0.154 | 0.171 | 0.176 | | *SumGNN* | 0.079 | 0.085 | 0.139 | 0.181 | 0.196 | The empirical results on __RootNeighbors__ further support our analysis from Section 6.1: a mean-type aggregation environment network and a sum-type aggregation action network solves the task. Unlike __minesweeper__, the choice of the action network is very critical for this task: SumGNN and GIN - both using sum aggregation - yield best results across the board when they are used as action networks. In both experiments the empirical findings suggest that the optimal choice of action and environment networks heavily depends on the underlying task. We are happy to include these results and discussions in the revised version of our paper. We hope these answer your remaining concerns. Based on this, we would greatly appreciate your reconsidering your evaluation and increasing your score. # Reviewer k8kJ (7) We thank the reviewer for their comment highlighting our contributions and the overall quality of the paper. We will surely include all the novel findings from the rebuttal period to the revised version of our paper. # Ping # Reviewer sMNf (6) Thank you for suggesting an ablation study over the choice of environment and action networks. We believe to have addressed all of your concerns. As the discussion period is closing soon, we would highly appreciate feedback at this stage since this would give us a last chance to address any remaining issues. We look forward to hearing from you. Thank you. # Reviewer MprR (5) Thank you for your comment regarding the ability to extend Co-GNNs to encompass directed graphs. We believe to have addressed all of your concerns. As the discussion period is closing soon, we would highly appreciate feedback at this stage since this would give us a last chance to address any remaining issues. We look forward to hearing from you. Thank you. # AC Dear AC, We would like to thank you for monitoring the discussion so far. We have heard from **Reviewers k8kJ** and **zYzs** during the rebuttal, but not from **Reviewers sMNf** and **iUXd**. This may be because they are already positive about the paper, but we wanted to bring this to your attention in case any further clarifications are needed from our side before the closure of the author-reviewer discussion period. Best, Authors

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.