Graph neural network outputs are almost surely asymptotically constant - ICML Rebuttal

# Graph neural network outputs are almost surely asymptotically constant - ICML Rebuttal # Global Response We thank the reviewers for their comments. We respond to each concern in detail in our individual responses, and we provide a summary of our rebuttal: 1. **Key dividing line of our work (Reviewer vMeV and Reviewer Zu9M)** We highlight the difference bmetween a GNN converging to some limit object – in some topology related to the random graph distribution – and a GNN being constant asymptotically almost surely (a.a.s.). Many existing works fall under the former category (Cordonnier et al. 2023), investigating conditions that guarantee convergence of a GNN on a random graph model to an infinite limit object. With such a convergence result in hand, one has reduced probabilistic analysis of the GNN’s behaviour to analysis of the limit object, although the latter may be quite non-trivial. The existence of some limit object does not imply a.a.s. convergence to a constant as established in our work. We are interested in a different dividing line: When does a GNN graph classifier reduce to a *constant* distribution asymptotically almost surely (a.a.s.), This is an extremely strong notion, which implies that the values cluster closer and closer to a constant distribution on a larger and larger % of the random graphs. This dividing line is highly non-trivial: it does not hold in general for GNNs, and does not hold even for restricted GNNs on an arbitrary Erdos-Renyi distribution of graphs. Our result gives broad conditions on which the a.a.s. convergence holds, conditions which can be established in the broad framework of term languages. This allows us to derive results for a large class of architectures without the need for proving convergence seperately for each model architecture. We also note that convergence results concerning, e.g., graphons, require strong architecture-specific conditions (see Theorem 6.1 of Cordonnier et al. (2023)). This may not be a shortcoming of the individual studies, but may rather represent a trade-off against the study of such general random graph models: because of their generality, these results seem to require strong conditions at the GNN architecture level. We make a fine-grained analysis considering graph distributions (dense and sparse), and for each specific random graph class we can dispense with architecture-specific conditions -- all of our results apply broadly to the term language. 2. **On the generality of the term language (Reviewer vMeV and Reviewer TLEg)**: The class of architectures that the term language captures is very broad, including more general message passing components, such as global readouts (Battaglia et al. 2018), arbitrary Lipschitz-continuous update functions, different forms of attention (e.g., GATv2 (Brody et al. 2022)). Our results go beyond message-passing architectures. They apply to graph transformers and contain common architectural choices employed in practice, such as skip and jumping connections. We are not aware of any convergence result with such broad applicability. 3. **Additional experiments: (Reviewer TLEg, Reviewer Zu9M)** - *Experiments with other non-linearities (Reviewer TLEg)*: We have run similar experiments to those reported in the paper using tanh activations. The empirical findings manifest convergence, much like the case for ReLU. - *Experiments with real-world graphs (Reviewer Zu9M)*: We experimented with a large real-world graph which is presented with different subgraphs of different sizes: Specifically, we used "Tiger" which is a real-world planar graph based on the geographic faces of Alaska, provided by the U.S. Census Bureau. The original dataset TIGER-Alaska-93K has 93366 nodes. Dimitrov et al. (2023) extracted the smaller datasets TIGER-Alaska-1K, 5K, 10K,25K and 90K, which are subsets of the original dataset with 1000, 5000,10000,25000 and nodes, respectively. We enriched these graphs with random features and report the 5 class probabilities of a single MeanGNN using the same setup from the paper, and observe that the **standard deviations** decrease with scale in line with our results: | Class | 1k | 5k | 10k | 25k | 90k | |-------|--------------|--------------|--------------|--------------|--------------| | 0 | 3.76 | 0.93 | 0.78 | 0.54 | 0.18 | | 1 | 4.07 | 1.01 | 0.88 | 0.64 | 0.20 | | 2 | 0.15 | 0.04 | 0.04 | 0.02 | 0.01 | | 3 | 1.45 | 0.33 | 0.37 | 0.22 | 0.07 | | 4 | 0.41 | 0.10 | 0.09 | 0.06 | 0.02 | 4. **Presentation and discussions (Reviewer BKe7, Reviewer Zu9M, Reviewer TLEg)** We fixed the problem in the definition of a.a.s. as identified by Reviewer Zu9M. We have also included a discussion regarding the independence assumption on features, in line with our response to Reviewer BKe7. We have made some adjustments to the organisation of the paper to give more insights on the proofs and the related discussions in the body of the paper, as requested by Reviewer TLEg. References: - Brody et al., How Attentive are Graph Attention Networks? ICLR 2022. - Cordonnier et al., Convergence of Message Passing Graph Neural Networks with Generic Aggregation On Random Graphs, GRETSI 2023. - Dimitrov et al,. PlanE: representation learning over planar graphs. NeurIPS 2023. - Battaglia et al., Relational inductive biases, deep learning, and graph networks, 2018. # Reviewer vMeV (5) We thank the reviewer for their comments and address each of their questions/concerns below. > “The reader has to wait for the main result only to be presented on page 5 and it remains somewhat unclear until then what the actual main results are. Maybe a high-level discussion / overview in the early pages would help the reader. E.g., the distinction between "root" and logarithmic growth can be already mentioned before.” **Answer:** Thank you for your suggestion. We will modify the introduction to provide a more substantial idea of the results earlier on. > “Missing in-depth discussion of related work. You mostly only mention related work but do not further discuss or compare your results to these previous ones. Please compare your results more directly to previously achieved convergence rates and results and / or discuss why previous results do not apply here. Also some important related work seems to be missing. Keriven [1] also talks about convergence to constant node embeddings in a fairly general random graph model. Please discuss and compare.” **Answer:** We will include a detailed discussion to further highlight the difference with prior work. It will follow the general line of our global response, which highlights the main distinction between general convergence and our strong *a.a.s. convergence*. Note that Geerts et al. (2021) studies the *over-smoothing* phenomenon of GNNs, which is a quite different problem: How do the node features of GNNs evolve as the number of layers $k$ increases? There are many papers on this issue, and it is orthogonal to the one addressed in our work, which concerns the behaviour of any fixed GNN – even one with very few layers – as larger and larger random graphs are considered. > “Please define the star notation for self-containedness.” **Answer:** We see how this could be a little unclear. When the star notation is first introduced, it is done so only as a formal symbol in the term language. The interpretation is given later in Definition 4.2. While such a presentation is common when defining formal languages, we agree that this could be clarified, and we have added a sentence to Definition 4.1 to address this. > “I think sub-linear O(n^-beta) growth for is more common than "root growth".” **Answer:** We note that a convergence law does not hold for this more general condition. For example, consider a trivial feature function that is always 1. We can write a term that is 0 if some node in the graph is isolated, 1 otherwise. In the non-sparse, this term will converge to the constant 1, since a.a.s. no nodes will be isolated. While in the sparse case it will converge to something smaller. It follows that if $p(n)$ oscillates between $n^{-1/2}$ (root growth) and $1/n$ (sparsity), the value of this term will also oscillate, so we cannot obtain a convergence law. > “Are graphons also possible? They would be the next step after ER random graphs and SBMs and have been used by some of the mentioned related work.” **Answer:** Graphons do generalise dense ER and SBM models, which makes such an analysis intriguing. We do not know of any prior results concerning our strong a.a.s. convergence for GNNs over graphons. This is indeed an interesting question, but one that requires a separate and dedicated analysis to answer, possibly considering the appropriate class of graphons to also cover sparse cases. > “Please clarify if the 4 cases in Thm 5.1 are all possible cases. If not, what happens in the remaining cases? Why does convergence not hold there.” **Answer:** For all non-sparse cases, a term will converge to the same thing. However, as the oscillating example above indicates, one cannot get convergence for arbitrary $p(n)$, or even for $p(n)$ converging to 0. > “How is your term language related to other general paradigms encompassing many GNNs such as [2,3].” **Answer:** Thanks for pointing out these papers, which we will discuss. These languages have very different goals than ours, and are incomparable in expressiveness and in results: * Geerts et al. (2021) concerns a query language on matrices with operations for common matrix operations and applies the language to adjacency matrices of graphs, characterizing the expressiveness of the resulting language. Our term language is incomparable to their query language. On the one hand, we allow arbitrary Lipschitz functions and a general weighted mean aggregation, while on the other hand they have aggregation with unbounded sums, which allow one to express terms that diverge to infinity for any of our random graph models (and hence cannot be in our language). Geerts and Reutter et al. (2022) define a family of even richer languages, including not only divergent terms based on unbounded summation, but higher-order GNNs, which also cannot be expressed in our framework. Yet, our weighted average operators are still not available there. Further to these differences in the languages, our focus is on "uniform expressiveness", where we fix one model and investigate its capabilities on arbitrary-size graphs, which is fundamentally different from the expressiveness studies conducted in these papers. * Jogl et al. (2023) is concerned with different notions of simulation between “non-standard message-passing paradigms” and message passing GNNs. Although several kinds of non-standard message-passing regimes are considered in this work, we assume that the reviewer is referring to Augmented Message Passing (AMP) GNNs, which are presented as a term language. We should reiterate here that our strong convergence does not hold even for all standard message-passing GNNs – since unbounded Sum aggregation clearly violates it. Thus our term language does not subsume the AMPs of Jogl et al. (2023). Overall, our term language is quite versatile, but we do not claim that it is a universal umbrella language for GNNs (arguably the goal of many of these papers): our term language is geared instead for capturing the strongly convergent GNNs, hence the difference. References: - Geerts et al., On the expressive power of linear algebra on graphs, Theory of Computing Systems 65 (2021): 179-239. - Geerts and Reutter et al., Expressiveness and Approximation Properties of Graph Neural Networks, ICLR 2022, - Jogl et al., Expressivity-preserving GNN simulation, NeurIPS 2023. # Reviewer TLEg (7) We thank the reviewer for their comments and address each of their questions/concerns below. > “W1. However, it still can be tough to understand the proof sketches (especially after Lemma 5.4 until the end of the section). As the paper devotes a lot of space to plots of model convergence it could make sense to reduce some redundant figures and instead use more space for proof sketches / intuition.” **Answer:** We agree with your suggestion and to address this we will move the plots for MeanGNN from Figure 1 to the appendix, since MeanGNN can be seen as a special case of GAT and arguably it is not essential for the main body. We will use the space provide more insights regarding the proofs and constructions. > “W2. Despite the claims that proposed term language allows to generalize the results to many other models the authors only explicitly investigate three models: message passing graph neural networks, graph attention networks and graph transformers.” **Answer:** We highlight three concrete model architectures in the paper: MeanGNN, GAT, and GPS+RWPE. This is primarily to establish a connection to some well-known architectures that the term language includes. The language is very general, since it contains many types of message passing neural networks and graph transformers, which are families of model architectures. The class of architectures that the term language captures includes more general message passing components, such as global readouts (Battaglia et al. 2018), arbitrary Lipschitz-continuous update functions, different forms of attention (e.g., GATv2 (Brody et al.2022)). Our results go beyond message-passing architectures, applying to graph transformers and containing common architectural choices employed in practice, such as skip and jumping connections. Moreover, the term language allows the use of mixtures of architectures – e.g. a layer of architecture X followed by a layer of architecture Y, provided X and Y are in the term language. We are not aware of any convergence result with such broad applicability. > “W3. In particular, the experiments are performed only for ReLU activations. Please consider running the experiments with other activation functions.” **Answer:** Thanks for the suggestion! We have run similar experiments to those reported in the paper using tanh activations. The empirical findings manifest convergence, much like the case for ReLU. > “The paragraph about graph types of tuples and feature type controllers is difficult to understand and might benefit from some examples. In particular, I do not understand: feature-type controllers and why we need wedge-descriptors.” **Answer:** We have rewritten this paragraph to improve its clarity. The basic idea is to identify, for each term having free variables $v_1, ...,v_n$, what is the information about an n-tuple of nodes it depends on a.a.s. In the non-sparse case a term simplifies to a Lipschitz function of the tuple's features along with the graph type of the tuple: the conjunction of graph atoms that the tuple satisfies. > “Is Lemma 5.5 correct? First, you define a wedge-description but then do not use it and instead use Descr(u) which I do not think is defined.” **Answer:** Thank you for pointing this out; it is indeed an abbreviation which we didn’t define. Descr(u) was intended to denote the maximal description of a tuple u of nodes. However, since this notion coincides with GrTp(u) we have decided to remove it, to avoid redundancy. > “But by local type convergence, eventually all but finitely many types contribute very little. We will use this when we apply the lemma to prove the sparse case of the theorem” This text concludes the theory section but sounds like there should be more coming. I think, that is mainly due to writing “we will use this”. **Answer:** We have changed this to the present tense to improve clarity. References: - Brody et al., How Attentive are Graph Attention Networks? ICLR 2022. - Battaglia et al., Relational inductive biases, deep learning, and graph networks, 2018. # Reviewer BKe7 (3) We thank the reviewer for their comments and address each of their questions/concerns below. > “The assumptions. This paper assumes a probabilistic graph as the underlying structure, and independent node features. In the most of the real world setting, the node features are associated with graph, which is the main reason why we use GNNs. If the graph feature is useless, we only see the topology of the graph; in that case, the common practice is to take the identity matrix as node features e.g. [1]. None of which is applied to the convergence analyses in this paper.” **Answer:** Thank you for this comment. Below we address the identity matrix question and give some ways in which our results (almost) already apply to distributions with dependence between features and graph structure. However, we believe that considering independent node features is a key first step in understanding the asymptotic almost sure convergence behaviour of these models. The proofs are already quite involved, and we believe they provide important insights into the way GNNs propagate and aggregate information. While including dependence is an important direction, we know of no standard general framework for graph distributions with dependent node features which could be investigated. > “If the node features are somehow associated with graph structures, does the same convergence result hold?” **Answer:** We have some provisional results in this direction: - First, observe that in order to prove the SBM case, we introduce new symbols to the first-order language for the block number, and now the feature controllers can depend on atomic formulas including these symbols. This means in fact that the proof still works when we allow the feature distribution to depend on the block number (i.e. each feature distribution can be different for each block). - Second, we expect our results to extend also to other cases where the feature distributions depend on the graph structure, so long as it is possible to extend the first-order language of graphs in a way which plays nicely with the graph distribution. - Third, as noted in the paper, the term language is quite powerful. We can thus use it to compute features based on the graph structure, which can then be input into any GNN model, and our results will still hold. A simple example: we compute a one-hot encoding for whether a node has any degree-k neighbours. Furthermore, since we include the random walk positional encoding in our term language, this allows these to be used as *structure-dependent* initial node features. > “If we use input as identity matrix, does the same convergence result hold?” **Answer:** This follows immediately from our results via two different routes. First, the constant node feature distribution is already one of those to which the theorem applies. Second, any term without node feature atoms (H($v$)) ignores the node features and can be considered as taking the identity matrix as input. We can take any expression of a GNN model in the language and replace each H($v$) with a constant 1. The resulting term captures the GNN model acting with an identity matrix. > “Do you have any immediate applications of the proposed term language, other than the optimization realm?” **Answer:** In the context of GNN analysis, the term language adds robustness to the results, in contrast to proving things one-GNN-architecture-at-a-time. From a theoretical perspective, the language allows us to forge results that can be situated within the rich body of work on convergence laws for logics. In terms of application domains, it is worth noting that aggregate query languages - extending standard database query languages with aggregate functions - are common in traditional database systems and in graph database systems. We are hopeful that other connections to the broader ML domain as well as to data management can emerge out of our study. > “It'd be nice to have an example using the simple GNNs, such as SGC [2]. I'm now unsure how this result holds for the SGC.” **Answer:** SGCs are a special case of GCNs, and we do not have results for GCNs in the submission. GCNs perform a weighted average where the square root of the product of node degress is used for the weighting, and we are not sure that this can be expressed within our term language. But we believe our techniques can be extended to GCNs in general, and hence apply to SGC. We will comment on this in the revised version. > “A nice question to ask is that the same result can be applied to degree-corrected SBM. If my criticism to the assumption is vein, the result is a bit shocking (in a good sense). Thus, I'd like to know where the result comes from, the underlying graph is too simple or GNNs for more general graph just outputs the constant?” **Answer:** Would you mind clarifying the exact graph distribution you are interested in? As far as we are aware, in the degree-corrected SBM one must specify the expected degree for each node. In order to analyse this asymptotically, it is necessary to define these expected degrees in a way which for every size of graph $n$.Concerning the question of the source of the convergence phenomenon, we make two observations. - First, there are natural extensions of the term language where we no longer observe convergence. For example, if we include comparison operators > and < on feature values as primitives, this introduces thresholding behaviour, where the output can depend discontinuously on the node features, the results break. If we add supremum (as would be needed to accommodate max aggregation), the strong convergence no longer holds in the same generality. For example, in the case of $p(x)= 1/n^q$ for $q$ a rational between $0$ and $1$, results of Shelah and Spencer et al. (1988) show that even very simple terms that use sup and Lipschitz functions diverge wildly. - Second, it is possible to come up with Erdős–Rényi distributions in which a convergence law does not hold – even ones in which the edge distribution $p(n)$ converges. For example, if $p(n)$ oscillates between $n^{-1/2}$ (root growth) and $1/n$ (sparsity), we cannot obtain a convergence law because on some terms converge to different values on ER($n$, $n^{-1/2}$) and ER($n$, $1/n$); please also see response for Reviewer vMeV for an example. We hope this illustrates that a.a.s. convergence for a particular term language and distribution is not obvious. This strong convergence depends in a subtle way on the term language (and thus the GNN architecture) and the graph distribution. References: - Shelah and Spencer et al., Zero-one laws for sparse random graphs, JAMS 1988. # Reviewer Zu9M (4) We thank the reviewer for their comments and address each of their questions/concerns below. > “The notations are poorly defined. For example, a.a.s. convergence w.r.t is defined in p.2 l.89 right column to be the convergence of , which seems problematic to me. (i) This definition alone implies only that the expectation w.r.t. the graph model converges, not that the output will be the same regardless of the input graph, as stated in p.2 l.259 left column. (ii) is a random variable since the expectation is not taken for the node feature distribution . The convergence of random variables should be formally defined. Or the expectation may be implicitly taken w.r.t. as well. I request the authors that the notation be defined more rigorously. This is essential for a correct understanding of the theoretical results.” **Answer:** Thank you for pointing this out. In fact we show convergence for the following much stronger notion, which is used in the key lemma, and which is what was intended in the phrase a.a.s. : Given any function $F$ from featured graphs to real vectors and a random featured graph model $\mu$, we say $F$ converges asymptotically almost surely (converges a.a.s.) to a vector $z$ with respect to $\mu$ if for all $\epsilon, \theta > 0$ there is $N\in \mathbb{N}$ such that for all $n\ge N$, with probability at least $1-\theta$ when drawing featured graphs G from $\mu n$, we have that $|F(G)-z|< \epsilon$. All the proofs in the appendix use this stronger notion. Of course, this implies convergence in mean. We are happy to clarify any other notational issues the reviewer has in mind here. We have updated the passage above accordingly. > “Experiments are weak. It would be desirable to conduct experiments on real world time series graphs as well and discuss the practical implications.” **Answer:** While synthetic, our experimental setup is carefully designed to empirically verify the presented theoretical results. In this respect, we find the rather quick convergence on even very sparse graph models interesting. Following the reviewer's recommendation, we investigated real-world time series graphs, but noted that these graphs do not strictly grow: while these graphs add nodes/edges, they also remove nodes/edges, which makes it hard to apply in our context. With this in mind, we decided to experiment with a large real-world graph which is presented with different subgraphs of different sizes: Specifically, we used "Tiger" which is a real-world planar graph based on the geographic faces of Alaska, provided by the U.S. Census Bureau. The original dataset TIGER-Alaska-93K has 93366 nodes. Dimitrov et al. (2023) extracted the smaller datasets TIGER-Alaska-1K, 5K, 10K, 25K, and 93K, which are subsets of the original dataset with 1000, 5000,10000, 25000, and 93000 nodes, respectively. We enriched these graphs with random features and report the 5 class probabilities of a single MeanGNN using the same setup from the paper, and observe that the **standard deviations** decrease with scale in line with our results: | Class | 1k | 5k | 10k | 25k | 90k | |-------|--------------|--------------|--------------|--------------|--------------| | 0 | 3.76 | 0.93 | 0.78 | 0.54 | 0.18 | | 1 | 4.07 | 1.01 | 0.88 | 0.64 | 0.20 | | 2 | 0.15 | 0.04 | 0.04 | 0.02 | 0.01 | | 3 | 1.45 | 0.33 | 0.37 | 0.22 | 0.07 | | 4 | 0.41 | 0.10 | 0.09 | 0.06 | 0.02 | > “Convergence in "continuous" random graphs such as ER and SBM models is not very surprising. See, for example, the literature on graphon neural networks.”... “Could you provide the connection to Graphon Neural Networks? My understanding is that we can explicitly write the limits using graphons rather than showing the existence of the limits.” **Answer:** In terms of the second part of the question, for our notion of convergence, there is no issue of writing the limits: they are just real numbers, and as long as one knows the rate of the convergence one can approximate the value as precisely as one likes. Our proofs are constructive, and convergence rates can be extracted from them; but we focused here on showing a.a.s convergence , leaving the determination of the convergence rate to future work. Graphon Neural Networks and other continuous limits of GNNs deal with a different phenomenon than the one we study. The convergence we deal with is extremely strong – the values concentrate around a constant as the random graphs get larger. And exhibiting this convergence involves interplay between the term language (or the GNN) and the random graph model: as mentioned above, we do not have it for every flavor of GNN, nor do we have it for our term language for every instance of ER. In the context where we do not have our strong convergence – or where it is not meaningful – we can have convergence in a weaker sense. And perhaps the canonical example of this is convergence to a function on graphons, such as a Graphon Neural Network or other continuous counterparts like the cMPGNNs of Cordonnier et al. (2023) or the Continuous GCNs of Keriven et al. (2020). Note that existing results obtaining these alternative forms of convergence still require strong technical assumptions, specific to the particular family of GNN being studied: see the sparsity requirement in Theorem 1 of Keriven et al. (2020), or Assumption 5.6 in Theorem 5.7 of Cordonnier et al. (2023). So although our convergence notion is incomparable to those studied in these works, we believe that our term language could be useful in giving a broader framework for stating results, even for these broader notions of convergence. References: - Cordonnier et al., Convergence of Message Passing Graph Neural Networks with Generic Aggregation On Random Graphs, GRETSI 2023. - Keriven et al., Convergence and stability of graph convolutional networks on large random graphs, NeurIPS 2020. - Dimitrov et al., PlanE: representation learning over planar graphs, NeurIPS 2023.