We thank the reviewer for the positive feedback on our revised manuscript. We are happy to note the increase in score. As suggested, we have further emphasized the counterfactual component of our study in the abstract and the contributions section (Section 1.1).
Overall Summary
--------------
We thank the reviewers for their insights and constructive suggestions. A comprehensive point-by-point response to the reviewers' comments is presented below. We have updated the main manuscript and the appendix to address these comments. The changes made in the main manuscript are highlighted in *blue* color. The *major additional experiments changes* are listed below.
* **Enhancements in Presentation:** We have undertaken a series of presentation enhancements, which includes the following aspects:
1. A thorough investigation of related works is now incorporated, offering a clear differentiation of the novel contributions of our study in comparison to existing benchmarking studies.
2. A more rigorous mathematical formulation has been included for clarity.
3. We have expanded the graphical representation of empirical results by integrating error bars, spider-plots, and visual assessments of explanations.
* **Additional experiments:** We have conducted additional experiments to further enhance the anslysis. These experiments include the (i) inclusion of a new dataset from OGB to analyze the scalability of explainers, and (ii) assessing stability of explainers against adversarial topological attacks and feature perturbation.
We believe that these revisions will satisfactorily address the concerns raised by the reviewers and elevate the overall quality of our work.
Review 32ww
--------------
We thank the reviewer for the insightful suggestions to further improve our manuscript. Please find the point-by-point response below to how they have been addressed.
**[Black-box is not whitened] The biggest weakness of this paper: given the motivation, "gnns are NNs and therefore are blackbox" -- the paper, only presents metrics, keeping everything a blackbox. As a reader, I only see numbers. I dont see any actual explanation of any prediction. I would be hoping to see something -- e.g., subgraphs? on chemistry? on citation? While these are domain-specific, it would help to show at least one or two examples on different domains.**
**Response:** We appreciate this constructive feedback. We have added visualization based analysis of both factual and counterfactual explanations of a GCN model trained to detect mutagenecity (Sec 4.5). The key insights that emerge from this analysis is that:
1. Statistically, strong performance may not always correspond to identifying the explanations offered by domain scientists. This represents a pressing need for real-world datasets endowed with ground truth explanations, a resource that the current field unfortunately lacks.
2. Visualization based analysis of counterfactual recourses further substantiates our earlier claim that they often violate the structural and functional integrity of the domain and thereby may not always be meaningful in the eyes of the domain experts.
**[Lacking summary of hierarchy] The figure lists multiple styles of instance-level explanability tools. However, only the style name is listed. It is important to have at least one sentence or two to describe the style -- listing the papers is insufficient, IMO.**
**Response:** As suggested, we have added the descriptions of the various branches of explinability works on GNNs in Section 1.
**[Lacking formalism] This paper is clear and easy-to-understand for a very general audience. It is worth being a bit more specific and formal, on behalf of generality. Examples include: there is no explicit relation between $Y$ and $\Phi(G)$ (I assume they are the same); It should say that $\Phi(G)$ is categorical; The Definitions can be accompanied by some math expression.**
*Response:* As suggested, we have made the definitions more formal by including mathematical expressions (See Def. 1 and 2). We have clarified that $\Phi(G)$ is assumed to be categorical and the benchmarking study focuses on classifications tasks. We have also replaced $Y$ with $\Phi(G)$ to keep the notation consistent. In addition, we have reviewed the entire draft thoroughly and introduced formal descriptions wherever it elevates the discourse.
**[one terminology inconsistent with literature] Terminology "computation graph" is already reserved in computer science for a different meaning. Specifically, paper terms "computation graph" as the subgraph that is used to make the decision (e.g., centered around the node, in node classification task). However, "computation graph" is strictly the equation being computed. For example, (sparse)matrix-multiply, followed by activation, followed by concatenation, followed by ..., softmax, etc. Please do not use this term. Paper can be creative in choosing another name, e.g., the "inference subgraph"?**
*Response:* We have incorporated this suggestion by replacing all references to "computation graph" with "inference subgraph".
**Figure 2 can be simplified by spiderplot (combining all charts into one, using polar coordinates). A picture is worth a thousand words! The spiderplot can be an amendment (e.g., average across y-axis values into single point per dataset).**
*Response:* This is indeed a wonderful suggestion, which we earlier did not think of. We have now added the spiderplot (Fig. N in Apendix with reference from Sec 4.1 while discussing Fig. 2). Similar to the conclusions drawn earlier, PGExplainer is consistently inferior. GNNExplainer and RCExplainer appear to have the most consistent performance although none of the explainers emerge as the best under all scenarios.
**Sufficiency: Personally, I was not familiar with this. However, it seems that a "stupid model" e.g., always outputting a fixed class, would always win.**
*Response:* Sufficiency (or Fidelity) is one of the most popular evaluation metrics used across all works on perturbation-based factual and counterfactual reasoning over GNNs. Nonetheless, we agree with the above comment that for a model that predicts the same class will have high sufficiency (since regardless of the explanation subgraph being chosen, the prediction will be the same) and in such situations sufficiency is not a good metric. Nonetheless, it should be noted that the objective is to evaluate explainers and not models. So if the model is "stupid", then all explainers will do equally well. On the other hand, if the model prediction has enough variance as a function of the graph topology, then this situation is unlikely to arise. Tables J and K in the appendix presents the accuracies of the various models on *class-balanced* test sets. As we observe, that the accuracy is consistently high from which we can conclude that they do not always predict the same class.
**Table 6: Stability as a function of instances. I can argue that less stability implies diversity. For instance, in a large graph, "explaining" why a node gets classified as class C can have different intepretations. In each interpretation, another subset of nodes (or edges) can be selected. Does it make sense to "score" the explanation? It is not necessary to modify the paper at the moment, this is more of thought-provoking (perhaps you can discuss in updated version)**
*Response:* This perspective is intriguing. Indeed, less stability indicates more diversity in the explanations. Presently, it remains ambiguous whether this phenomenon is advantageous. Nevertheless, we concur that this perspective is distinctive and thought-provoking. To highlight this alternative viewpoint, we have integrated the subsequent statement in Section 4.2 to bring it to our readers' attention:
> Finally, we note that while diminished stability is typically construed as a less favorable result, it also signifies heightened explanation diversity for a given instance. In specific scenarios, this could align with a desirable objective.
Review KD7T (Accept)
----------------------
We thank the reviewer for the positive comments. Please find the point-by-point response below to the outstanding concerns.
**The authors may need to explain more why some methods don't follow the sufficiency pattern of the factual explainers against the explanation size, e.g. the result on Mutag.**
Answer: We have added the following dicussion to Sec 4.1 to explain the non-monotonicity of sufficiency against explanation size.
> In Fig. 2, the sufficiency does not always increase monotonically with the explanation size (such as PGExplainer in Mutag). This behavior arises due to the combinatorial nature of the problem. Specifically, the impact of adding an edge to an existing explanation on the GNN prediction is a function of both the edge being added and the edges already included in the explanation. An explainer seeks to learn a proxy function that mimics the true combinatorial output of a set of edges. When this proxy function fails to predict the marginal impact of adding an edge, it could potentially select an edge that exerts a detrimental influence on the explanation's quality. Consequently, the occurrence of non-monotonic behavior arises in certain instances.
**Also, from the appendix, the results of some explainers look undulating as the scale increases. Will it affect by some special features of the datasets? If so, the authors may need more description in Section 4.1.**
**Response:** The reason is the same as above. Specifically, due to the combinatorial nature of factual and causal explanations, deterioration with increasing size is a feasible possibility.
**[Combining same metric figures for different methods] When presenting the results of comparative analysis on factual explanations, it is preferred to show all the explanations than just left the results of GEM and SubgraphX in the appendix. Otherwise, the authors should make some explanation.**
*Response:* As suggested, we have combined GEM and SubgraphX results to the main paper in the revised version.
**The random seeds selected in Tables 6 and 8 are different from the expression in the appendix. Should the author double check and unify the expression?**
*Response:* As suggested, we have made the expression consistent for Tables 6 and 8 in the main paper and Table M in the Appendix. Note that the experiments in Tables O and P in the Appendix are different, and thus the headings differ from Tables 6, 8, and M.
Review HBNL (Rating 4 - Ok but not good enough - rejection)
---------------
We thank the reviewer for the positive comments. Please find the point-by-point response below to each of the comments.
**C1: Some of the claims in the paper are not entirely correct. For example, the authors mention that "various facets of explainability pertaining to GNNs, .... have yet to be systematically investigated", but existing benchmarks like GraphFrameX [1] and GraphXAI [2] have proposed systematic evaluation framework for GNN explainability.**
*Response:* We appreciate the reviewer's feedback. We have now removed the referred statement and added a new section (Section 2.2) in related work to clearly discuss the gaps in the current benchmarking literature on GNN explainability and how the proposed study addresses them. We produce the text verbatim below.
> GraphFrameX and GraphXAI represent two notable benchmarking studies within the domain of Graph Neural Network (GNN) explainability. While both investigations have contributed valuable insights into GNN explainers, certain unresolved investigative aspects persist.
> 1. **Inclusion of Counterfactual Explainability**: GraphFrameX and GraphXAI have focused on factual explainers for GNNs. The incorporation of counterfactual explainers remains underexplored.
> 2. **Achieving Comprehensive Coverage**: Existing literature encompasses seven perturbation-based factual explainers. However, GraphFrameX and GraphXAI collectively assess only a subset, including GnnExplainer, PGExplainer, and SubgraphX. Consequently, a comprehensive benchmarking study that encompasses all factual explainers remains a crucial endeavor.
> 3. **Empirical Investigations**: How susceptible are the explanations to topological noise, variations in GNN architectures, or optimization stochasticity? Do the counterfactual explanations provided align with the structural and functional integrity of the underlying domain? To what extent do these explainers elucidate the GNN model as opposed to the underlying data? Are there standout explainers that consistently outperform others in terms of performance? These are critical empirical inquiries that necessitate attention.
>These inquiries represent gaps in the current research landscape, which we endeavor to address in our forthcoming study.
**C2: The paper misses some key related works.**
**Response:** As clarified above, we now discuss the novel contributions of our work when compared to GraphFrameX and GraphXAI in Section 1.
**c3: The authors should provide a definite comparison between their proposed framework and existing benchmarks. It is unclear i) why the GNN explainability community requires a new benchmarking framework?, and ii) what is missing in existing benchmarks that are solved in GNNX-Bench.**
*Response.* As outlined in C1, we have clearly mentioned the gaps in the existing GNN benchmarking literature and how the proposed study addresses them. A more comprehensive elaboration on this subject can be found in our response to point C5, wherein we meticulously highlight that out of the total of 19 investigations undertaken in this study, 17 of them stand as original contributions. This implies that these 17 investigations have not been previously encompassed within either GraphXAI or GraphFrameX.
**C4: While the existing benchmarks like GraphFrameX and GraphXAI handle different types of explainability techniques, including gradient-, surrogate-, and perturbation-based GNN explainers, the proposed benchmark only focuses on perturbation-based methods.**
*Response:* While we acknowledge the reviewer's observation that the cited references encompass algorithms beyond perturbation-based techniques, as clarified in C1 and C3, it is essential to highlight that our work includes distinct original investigations not covered in GraphFrameX or GraphXAI. These include research into counterfactual methods, the examination of four additional factual explainers, the assessment of topological stability, an exploration of necessity and reproducibility, and an evaluation of feasibility, among others.
Given the constraint of adhering to a limited 9-page primary content format, we have made a conscious decision to prioritize depth over breadth in our study.
**C5: Most of the metrics used in the current work are similar or derivative of metrics in previous works. Further, there are no discussions on the problems of existing metrics or why the metrics used in GNNX-Bench are better.**
*Response:* We use five metrics for 19 empirical investigations. The five metrics are explanation size, sufficiency, accuracy, Jaccard to measure topological similarity of explanations and feasibility. The first three are standard metrics used across all perturbation-based GNN explainability works. The novelty lies not in the design of metrics, but rather its usage in conducting new empirical investigations offering fresh insights.
The table below provides a precise description of our original contributions. It may be noted that **out of the 19 empirical investigations 17 are original contributions.**
We also emphasize that our intent is not to assert the superiority of our metrics over those of GraphFrameX or GraphXAI. Rather, the proposed empirical investigations are novel, offering fresh perspectives and augmenting the understanding of the benchmarked methodologies.
| Experiment | Figure/Table (Ours)| GraphFrameX | GraphXAI | Comment |
|------------|-|------------|----------|---------|
| How does the sufficiency increase as we allow the factual explanation size to grow?| Fig. 2 | $\checkmark$ | | |
| Feasibility of counterfactual explanations| Table 9| | | Counterfactual explanations generate a new graph that serves as a recourse. For a practical recourse, the generated graph should respect the structural and functional integrities of the domain. We study this aspect for the first time.|
| Necessity: If the factual explanation is removed from the original graph, can the GNN still predict the groundtruth label? | Fig K | $\checkmark$| | Do the factual explainers explain the model or data. Through necessity and reproducbility, we show that the explainers explain the model, but not the data. This is an original contribution. |
| Reproducibility: If the factual explanation is removed from the original graph, and the GNN is retrained on these residual graphs, can it predict the groundtruth label on unseen residual graphs? | Fig L | | | Same as above in Necessity |
Topological stability of factual explanations to noise in terms of *Jaccard similarity*. | Fig. 3| | | GraphFrameX have studied if sufficiency is stable to noise, while we focus on how the explanations themselves change to noise. This is a deeper analysis since topologically different explanations may provide similar sufficiency |
| Topological stability of factual explanations to stochasticity of explainer model parameters in terms of *Jaccard similarity*. | Table 6 | | | This aspect has not been studied either with respect to sufficiency or topology |
| Topological stability of factual explanations to GNN architecture in terms of *Jaccard similarity*. | Table 7 | | | This aspect has not been studied either with respect to sufficiency or topology |
| Sufficiency of counterfactual explainers | Tables 4 and 5| | | Benchmarking counterfactual explanations is our original contribution |
| Size of counterfactual explainers |Tables 4 and 5 | ||Benchmarking counterfactual explanations is our original contribution|
| Accuracy counterfactual explanations | Tables 4 and 5| | | Benchmarking counterfactual explanations is our original contribution |Benchmarking counterfactual explanations is our original contribution|
| Sparsity of counterfactual explanations |Tables U and V in Appendix | | | Benchmarking counterfactual explanations is our original contribution |
| Topological stability of counterfactual graphs to noise in terms of *Jaccard similarity*. | Fig. I in the appendix | | | Benchmarking counterfactual explanations is our original contribution |
| Topological stability of counterfactual graphs to stochasticity of explainer model parameters in terms of *Jaccard similarity*. | Table 8 | | | Benchmarking counterfactual explanations is our original contribution |
| Topological stability of counterfactual graphs to GNN architecture in terms of *Jaccard similarity* | Table Q in Appendix | | | Benchmarking counterfactual explanations is our original contribution |
| Impact of noise on counterfactual sufficiency | Fig J(a) in Appendix | | |Benchmarking counterfactual explanations is our original contribution |
| Impact of noise on counterfactual explanation size| Fig. J(b) in Appendix | | | Benchmarking counterfactual explanations is our original contribution|
| Impact of GNN architecture on counterfactual explanation size | Table S | | |Benchmarking counterfactual explanations is our original contribution |
| Stability of counterfactual sufficiency against stochasticity of explainer seed | Table O | | | Benchmarking counterfactual explanations is our original contribution|
| Stability of counterfactual explanation size against stochasticity of explainer seed |Table P | | | Benchmarking counterfactual explanations is our original contribution. |
<!--| 1, 2stage exp vs sufficiency(TAGEx)| | | | | -->
**C6: It would be great if the authors could justify why we need a new benchmark for evaluating GNN explanations and why the proposed stability metrics cannot be incorporated into existing benchmarking frameworks.**
*Response:* We hope our response to C1 and C5 clarifies the need for a new study.
Regarding stability, while GraphXAI focuses on measuring the stability of explanation performance, our study revolves around assessing the stability of the explanations themselves. To elaborate, explanations, which are subgraphs in our context, enable individuals to grasp the inner workings of a model. If the explanations, i.e., subgraphs, change substantially due to minor noise injection, GNN architecture, or the inherent stochasticity in the explanation algorithms, then reliability and trustworthiness is compromised.
While prior investigations have assessed stability by examining the consistency of explanatory performance, our inquiry is directed towards ascertaining the stability of the explanations themselves. This entails comparison between the topology of the original explanation and the altered explanation resulting from the introduction of a variability factor (e.g., noise). This distinction is pivotal, as it unveils deeper insights that extend beyond existing assessments of explanatory stability.
************************
Reviewer AzgL [4 -Reject ]
**Any reasons why only consider the perturbation-based explainability methods? It seems there is no bottleneck of comparison between perturbation-based methods and the other methods.**
*Response:* The applicability of metrics for evaluating perturbation-based explanation methodologies does not extend seamlessly to alternative explanation mechanisms. We elaborate below:
* **Distinction Between Instance and Model-Level Explainers:** As visualized in Figure 1 of our manuscript, GNN explainers can be broadly grouped into the categories instance-level and model-level. Instance-level explainers operate by accepting a graph as input and producing a subgraph that explains the input graph. Conversely, model-level explainers operate on the neural model itself. Consequently, the utilization of identical metrics is not feasible due to the inherent divergence in both the inputs and objectives.
* **Uniqueness of perturbation-based methods:** Pertaining specifically to instance-level explainers, all perturbation-based methods yield a subgraph as the output with one of the objectives being minimizing its size. In contrast, other instance-level explainers yield diverse outputs encompassing structures like Directed Acyclic Graphs (DAGs) in PGMExplainer, model weights in Graph-Lime, node sets in Grad-CAM, among others. As a result, evaluation inquiries surrounding sufficiency-size trade-off, topological stability, and reproducibility do not hold relevance.
Based on the aforementioned considerations, our scope is confined to perturbation-based methods.
We realize that our choice to benchmark only perturbation-based methods requires further emphasis in the manuscript. Towards that, we have incorporated the following revisions.
* In Section 1.1, we clearly describe the objectives of the various categories of \gnn explainers. The exact text is reproduced verbatim below.
> To keep the benchmarking study focused, we investigate only the perturbation-based explainability methods (highlighted in green in Fig. 1. While model-level explainers operate on the GNN model, other forms of instance-level explainers yield diverse outputs spanning Directed Acyclic Graphs in PGMExplainer, model weights in Graph-Lime, node sets in Grad-CAM, among others. Consequently, enforcing a standardized set of inquiries across all explainers is not meaningful.
* In Sec. 1, we have elaborated our discussion on the various paradigms of GNN explainers. Particularly, we discuss the input space, optimization objective and output of each category to better elucidate their differences.
**Typo in Definition 2. It should be G' rather than G_s.**
*Response:* Thank you for pointing this out. We have corrected this typo.
**Graph-based adversarial attacks should also be included in the category of "perturbations in topological spaces". Meanwhile, how about the small perturbations in the node feature space?**
[Samidha]
*Response:* We have added impact of feature perturbations and adversarial attacks to stability in Section B of the Appendix (Fig H and Table T), which is referenced from Sec 4.2 in the main paper. The key observations that emerge are as follows:
* **Adversarial attacks:** Stability of explanations reduce by $\approx 10%$ when compared to random perturbations (Fig H (right inset)). However, the trends remain the same. RCExplainer exhibits the highest stability, whereas GNNExplainer displays the lowest stability. This discrepancy in stability could potentially be ascribed to the transductive nature of GNNExplainer, as opposed to the remaining methods, which are all inductive. Transductive methods do not generalize to unseen data and hence when the topology is perturbed, stability suffers. TAGExplainer is task agnostic and hence shows more more instability compared to the other inductive methods.
* **Feature-space perturbations :** While GNNExplainer and TAGExplainer exhibit a decline in the stability of factual explanations, RCExplainer and PGExplainer demonstrate resilience against feature perturbations (refer to Fig. H, left inset). This stability hierarchy aligns consistently with the rankings observed in the case of topological perturbations. <!--However, when it comes to counterfactual explanations (as shown in Table T), RCExplainer's performance falls short in comparison to CF$^2$ and CLEAR. This discrepancy can be attributed to CF$^2$ and Clear's dual focus on both the topological and feature spaces, in addition to being inductive. In contrast, RCExplainer's reasoining capability is limited to the topological space.-->
**The adopted graph datasets are relatively small, as shown in Table 3. Large-scale graph datasets from OGB should be considered.**
*Response:* We have incorporated ogbg-molhiv into our benchmark dataset, and the outcomes span a spectrum of experiments (refer to Tables 3, 4, 8, k, O, P, Q, R, S, V,). Notably, within the array of benchmarked algorithms, CLEAR encounters challenges in scaling to accommodate this dataset. In contrast to its counterparts, which learn masks corresponding to explanations, CLEAR takes a distinct approach by learning a generative model. The generative model produces the subgraph corresponding to the explanation. Our empirical investigations reveal that the generative modeling component of CLEAR demands an exorbitant quantity of GPU memory, proving to be prohibitively resource-intensive.
**[Error Bars] In my opinion, especially for benchmark papers, the error bar and multiple runs need to be presented. For example, the performance in Figure 2 is quite close, and the conclusion might change when we consider the error bars.**
*Response:* We agree with this suggestion, which has now been incorporated. We have added the error bars derived from five different runs to plots and tables.
Even after accounting for the error bars, we do not observe any change in the conclusions drawn for Fig. 2. We find PGExplainer to be consistently inferior, while RCExplainer and GNNexplainer appear to be the best performers on average. No technique dominates throughout.
<!--
We thank the reviewer for pointing us to existing benchmarks on GNN explainers. We summarize an overview of the existing works and GNN-X-Bench in the following table.
| Benchmark | Syn. Datasets | Real-world Datasets | Metrics | Task | Methods | Code? | Published? | Computation Time | Features + Structure based (baselines and metrics)
|-------------|---------------|---------------------|---------|------|---------|--------|------------|------------|------------|
| GraphXAI | ShapeGEN (synthetic graph data generator)| GC: Mutag, Benzene, Fluoride-Carbonyl, Alkane-Carbonyl, NC: German credit graph, Recidivism graph, Credit defaulter graph | Accuracy, Faithfulness, Stability to topological noise, Counterfactual Fairness , Group Fairness | NC + GC | Random, Grad, GradCAM, GuidedBP, IG, GNNExplainer, PGMExplainer, PGExplainer, SubgraphX | Yes (https://github.com/mims-harvard/GraphXAI) | Scientific Data Journal, March 2023 | No | Structure + Features |
| GraphFramEx | BA-House, BA-Grid, Tree-Cycle, Tree-Grid, BA-Bottle | Cora, Citeseer, Pubmed, Chameleon, Squirrel, Actor, Facebook, Cornell, Texas, Wisconsin, eBay (Case study) | Characterization score(CS): weighted combination of Fid+, Fid-, CS for wrong vs right predictions by gnn, Classifying explanations as sufficient or necessary based on CS, Variance in CS with model seeds, CS vs Mask size, CS vs Mask Entropy, CS vs Mask maximum value, CS vs mask connected component ratio, Fid+/Fid- for different GNNs | NC (Discussion on limitations of explainers for GC, no experiments)| Random, Distance-based, PageRank, Saliency, Integrated Gradients(IG), Grad-CAM, Occlusion, GNNExplainer, PGExplainer, PGM-Explainer, SubgraphX | Yes (https://github.com/GraphFramEx/graphframex) | NeurIPS 2022| Yes (Time vs CS score analysis)|Structure + Features based baselines, no feature based metric |
| GNN-X-Bench (ours) | Tree-Grid, Tree-Cycles, BA-House | Mutagenicity, Proteins, IMDB-B, AIDS, Mutag, NCI1, Graph-SST2, DD, Reddit-B | Size, Sufficiency(Fidelity), Accuracy, Feasibility, Sparsity, Stability to: Topological Noise, Model parameters, GNN architectures, | NC + GC | *Factual:* GNNExplainer, PGExplainer, TAGExplainer, GEM, SubgraphX , RCExplainer($\lambda>0$), CF^2($\alpha>0$) *Counterfactual:* RCExplainer($\lambda=0$), CF^2($\alpha=0$), CFGNNExplainer, CLEAR | Working code compatible with recent version of libraries | Under review | No | Structure + Features based baselines, no feature based metric |
*Note: GC and NC stands for Graph and node classification respectively.*
With reference to the table, we point to the following novel contributions in our work:
1. **In-depth analysis of seminal and state-of-the-art perturbation-based explainers:** It is correctly pointed out by the reviewer that the focus of our benchmarking study is on perturbation-based GNN explainers only. As evident from the above table we see a comprison between few seminal gradient/feature-based, surrogate-based, decomposition-based and perturbation-based explainers in the existing works. In this study, we forsake breadth for the sake of depth and evaluate a dense body of literature consisting of seminal and state-of-the-art(SOTA) perturbation-based explainers only. Moreover, owing to the vast amount of research interest growing in counterfactuals, we see many novel methods emerging and pushing the boundaries of GNN interpretability. This opens an avenue for an in-depth analysis of the counterfactual and factual explainability methods, which are the categories of perturbation-based methods. Thus, through our study we attempt to create grounds for this vertical of GNN explainability methods.
2. **Evaluation of performance of recent explainers:** We have added various recent SOTA methods as baselines in our study. We evaluate recent two-phase explainability method TAGExplainer, as well as the hybrid perturbation methods CF^2 and RCExplainer. We further evaluate the hybrid methods in isolated settings, i.e., purely counterfactual or purely factual, by use of relevant hyperparameters. Since these methods, have their own set of metrics for comparing with existing work, and gauging their relevance becomes difficult. In GNN-X-Bench we fill these gaps.
3. **Metrics:** We agree with the reviewever. However, the purpose of this benchmarking study was to create a common-ground on which we can compare all SOTA perturbation-based explainers. Therefore, we use all the metrics in existing literature in order to draw an in-depth performance comparison. Further, we would point the reviewer to the novel insights we draw through the reported derived metrics.
(i) **Analysis of feasibility of counterfactual explanations:** The purpose of finding counterfactual ecxplanations is to have recourse, i.e., what changes to make to the input in order to change the output. For these counterfactuals to be relevant in domains such as drug-discovery where GNNs are used prominently, there is a need to study the validity of these counterfactuals. For e.g. A mutagenic drug is hazardous for humans. In drug-discovery, we would want to know what makes the molecule mutagenic and edit those components. However post editing we should get a feasible molecule which can potentially be used for treatment of diseases. Therefore, we evaluate the counterfactuals generated by existing explainers for feasibility, and show that even quantitavely superier methods do not produce valid molecules. This leads to an open research direction.
(i) We do **structural comparison** of explanations in the face of topological noise, change in GNN architectures, and change in explainer paramaters(seed). We define a stability metric that computes the structural variance(jaccard) in graph space of the generated explanations under different variance dimensions. Stability is imperative for explainers to be useful in the real-world setting. We further evaluate all baselines at different settings quantitively for a comprehensive performance assessment.
-->