# KDD24-Review-Rebuttal
# Reviewer JtYY:1
## Weakness
> While the paper studies an important problem (evaluation of graph contrastive methods), its premises are unfounded. The paper recognize two limitations
> (1) extensive tuning of hyperparameters in pre-training, often using validation set of downstream tasks;
> (2) they are evaluated on a single downstream task. These limitations are not valid, especially in recent efforts [1,2,3]. During their pre-training, there is no knowledge of any downstream tasks, and multiple types of downstream tasks have been evaluated. (Relatively minor to this point is that these papers are not cited or discussed.)
> Given that the current trends are fundamentally different from what the paper has assumed, I'm afraid this paper has to be significantly reworked and/or repositioned.
> [1] Universal Prompt Tuning for Graph Neural Networks. NeurIPS'24.
>
> [2] GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. WWW'23.
>
> [3] All in One: Multi-task Prompting for Graph Neural Networks. KDD'23 (best paper award).
## Reply Skeleton
1. Point out the scope differences
2. Check the paper first, so far according to All-in-one even the references will not use the validation set to select the model, the pre-training stage still need to choose hyper-parameters where the issues we discovered for GCL could also happen.
3. For our method to be evaluated, there are some inherent limitations to the evaluation task level(node, graph level)
4. The references are using multiple downstream tasks for a more comprehensive evaluation which aligns with our arguments.
# Reply to Reviewer JtYY:1
> While the paper studies an important problem (evaluation of graph contrastive methods), its premises are unfounded. The paper recognize two limitations.(1) extensive tuning of hyperparameters in pre-training, often using validation set of downstream tasks;(2) they are evaluated on a single downstream task. These limitations are not valid, especially in recent efforts [1,2,3]. During their pre-training, there is no knowledge of any downstream tasks, and multiple types of downstream tasks have been evaluated.
> [1] Universal Prompt Tuning for Graph Neural Networks. NeurIPS'24.
>
> [2] GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. WWW'23.
>
> [3] All in One: Multi-task Prompting for Graph Neural Networks. KDD'23 (best paper award).
We appreciate the reviewer for highlighting the papers related to graph prompt learning, which indeed represent an important emerging methodology for utilizing GCL. In this paper, we have mainly focused on the fine-tuning paradigm, because of its widespread adoption and the extensive body of work surrounding it. While our paper primarily focuses on the fine-tuning paradigm, we acknowledge the significance of graph prompt learning techniques in the GCL landscape. We will explore and incorporate these methodologies into our evaluation framework.
Upon carefully examining the referenced papers [1,2,3], we found that they exhibit similar limitations to those we discussed in our paper, albeit to varying degrees. We detail these limitations as follows:
- Regarding the first limitation concerning hyper-parameter selection for pre-training: It is commonly assumed in existing GCL literature that "no knowledge of any downstream tasks is required during the pre-training stage." However, as we elaborate in our paper, different hyper-parameters are often chosen based on specific datasets and GCL models, a process that may necessitate information about downstream tasks. This underscores the importance of considering hyper-parameter sensitivity in evaluation, as highlighted in our work. While graph prompt learning techniques offer an alternative approach for utilizing pre-trained models, they still rely on pre-training with specifically selected hyper-parameters. Therefore, evaluating these methods should also account for hyper-parameter sensitivity.
- Regarding the second limitation concerning multiple downstream tasks: While some of the referenced papers [1,3] do utilize multiple downstream tasks for a single dataset, aligning with our argument for comprehensive evaluation, it is essential to note that most existing GCL studies (especially the ones with fine-tuning pradiam) only assess performance on a single downstream task, which we contend is inadequate. Our work explicitly advocates for the incorporation of multiple downstream tasks to assess a single pre-trained model comprehensively. We propose a more comprehensive benchmarking protocol to address this need. Additionally, our proposed tasks can also be utilized for comprehensively evaluating the prompting-based methods, ensuring a thorough comparison across different GCL approaches. We appreciate the reviewer for bringing these papers to our attention, and we will ensure to include and discuss them in our revision. Additionally, we will explore the inclusion of datasets and their corresponding downstream tasks in our evaluation protocol.
Once again, we appreciate the reviewer for bringing up these relevant papers. We are committed to incorporating and discussing them in our revision. We would also want to emphasize that our proposed evaluation protocol aims to provide a comprehensive assessment framework for GCL methods (for both fine-tuning and prompting paradigm), considering both hyper-parameter sensitivity and the utilization of multiple downstream tasks.
<!--
Thank you for bringing up these papers related to graph prompt learning. In our paper, we have mainly focused on the fintuning paradigm, which was more widely adopted and has more work. We agree that graph prompt learning also represents an important emerging methodology for utilizing GCL. We have carefully checked the referred papers [1,2,3] and found they exhibit or partially exhibit similar limitations as the pre-training and fine-tuning paradigm. Specifically, we discuss the two limitations as follows. -->
<!-- * For the first limitation about hyper-parameters selection for pre-training:
Kindly note that in all existing GCL papers, it is claimed that "no knowledge of any downstream tasks is required during the pre-training stage". However, as we described in our paper, different hyper-parameters are often selected for different datasets and GCL models. Such a hyper-parameter selection process often requires downstream task information. This is also why we propose to include the hyper-parameter sensitivity into consideration. The graph prompt learning techniques provide an alternative way for utilzing the pre-trained models. However, they are still based on pre-trained models. To pre-train these models, the hyper-parameters are still required to be specifically selected. Hence, it is also important to consider the sensitivity of hyper-parameters when evaluating these methods.
* For the second limitation about multiple downstream tasks
Some of the referdnced papers [3,1] indeed utilize multiple downstream tasks for a single dataset. This aligns well with our argument that only one downstream task is not comprehensive or representative for evaluation. However, we would like to emphasize that while multiple-downstream tasks are adopted in these two papers, most existing GCL papers only utilize a single downstream task for evaluation, which is insufficient. Our work explicitly provokes the need for utilizing multiple downstream tasks to access a single pre-trained model for a more comprehensive evaluation, and we propose a more comprehensive benchmarking protocol accordingly. Again, we appreciate the reviewers to bring these two papers into our mind. We will be sure to include and discuss them in our revision. Additionally, we will explore to include the datasets and their corresponding downstream tasks into our evaluation protocl.
Again, we appreciate the reviewer to bring up these papers. We will include and discuss them in our revision. We would like to highlight that our proposed evaluation protocol is comprehensive -->
<!-- Furthermore, it is also important to note that multiple downstream tasks adopated in the some of the refered papers are somehow mannualy crafted from the same task. For instance, in [3], the link prediction and graph classification task are constructed by sampling edges and subgraphs from the original dataset for node classification where the label of an edge is decided by its two endpoints and the subgraph label follows the majority of the subgraph nodes. The multiple downstream tasks -->
<!-- In particular, methods in these papers also include hyper-parameters into consideration. -->
<!-- For instance, an albation study is conducted in [2] to investigate how the bhyper-prameters affect the performance. In [1], detailed hyper-parameters settings for different dataset are provided in Table 11 of their appendix. These hyper-parameters -->
<!--
We argue that the limitation related to the hyper-parameters still exist for prompting learning setting. In particular, all the three papers are still based on pre-trained models. To pre-train these models, the hyper-parameters are still required to be specifically selected. In general we observe that different set of hyper-parameters are utilized for different datasets and different pre-trained models across the three paper (**please check and confirm**). More specifically, in [3], we found in their public code, they use a specific set of hyper-parameters [**For all datasets**?]. In [2], the authors explictly mention that they tune and select a specific set of hyper-parameters for their proposed method in the Appendix D. *However, in their Appendix E for Parameter sensitivity, they also find that their proposed method have differetn preferences to hyper-parameters for differetn downstream tasks. For instance, they find that for the hidden dimension, a smaller dimension is better for node classifcation while for graph classifcation, a larger dimension might be better.*(**please check: for baseline or for their methods? Here we are talking about their proposed method**). In [1], the authors combines their proposed prompting method with many different pre-training methods. For different datasets and diffferent pre-training methods, the hyper-parameters adopted are different **[Please confirm]** (this can be found in Table. 11 of their appendix "B.8 Hyper-parameter settings"). Though, not enough details are provided on how these hyper-parameters are selected, it does not mean that these hyper-parameters are not tuned. Especially given that different hyper-parameters parameters are adopated for different senarios. Therefore, we still need to consider the sensitivity to hyper-parameters of pre-trained model. Also, we also observe that the prompting methods typically gain 2-3% of performance gain over fine-tuning methods[1,3]. Hence, we generally beleive that if the performance of fine-tuning methods are senstive to the hyper-parameters of the pre-trainede model, it is very likely that the prompting methods suffer from the same issue. Hence, when conducting the evalaution for prompting methods, it is also neccesary to consider multiple set of hyper-parameter for pre-trained models. -->
<!-- * For the second limitation about multiple downstream tasks
These refernced papers indeed utilize multiple downstream tasks for a single dataset. This aligns well with our argument that only one downstream task is not comprehensive or representative for evaluation. Again, we apprecaite the reviewers to bring them into our mind. However, we would like to emphasize that existing GCL papers that follow pretraining and finet-tuning setting (the methods we mainly explore in this paper) only utilize a single downstream task, which is insufficient. Furthermore, it is also important to note that the multiple downstream tasks adopated in the three refered papers are somehow mannualy crafted from the same task. For instance, in [3], the link prediction and graph classification task are constructed by sampling edges and subgraphs from the original dataset for node classification where the label of an edge is decided by its two endpoints and the subgraph label follows the majority of the subgraph nodes. **please provide more details**. -->
<!-- Moreover, even though they provided the specific set of hyper-parameters adapted during their experiments, how to selece such a specific set is still very unclear.
* For [3]'All-in-one' paper, we carefully checked their paper and code repo. And we got some observations
* There are no explicit instructions to their pre-training settings, after checking their github repo we found that all their experiments are based on only one specific set of hyper-parameters. Therefore, it still needs to consider the hyper-parameter tuning or selection.
* The performance improvement of their proposed prompting could be marginal on some datasets, e.g., the Cora dataset. Therefore, we think even with prompt learning we suffer from the same issue of hyper-parametrers if we need pre-trainining.
* For [2]"GraphPrompt" paper, the authors mentioned that they need to tune the hyper-parameters for the baselines pre-training. They put the set of hyper-parameters settings they used for each baseline and their proposed method in the Appedix D.
However, how these hyper-parameters are chosen is still very unclear without any further instructions.
* For [1]"Universal Prompt Tuning for Graph Neural Networks" paper, the authors propose their method combing with many different pre-training methods.
For different datasets and diffferent pre-training methods, the hyper-parameters adopted could be different as shown in their appendix "B.8 Hyper-parameter settings" Table. 11. The performance increasement with their proposed method could also be marginal for some datasets.
-->
<!--
* Thank you for bringing up the papers related to prompt learning.
In our work, we have mainly focused on pre-training and fine-tuning paradigms and they suffer from the existing evaluation protocol. Our ultimate aim is to provide insights for a more comprehensive and feasible evaluation protocol that can better benchmark different pre-training/GCL approaches.
<!-- * For your kindly referred work, their proposed method is to make a given pre-trained model perform better. The primary goal is to improve model performance, rather than evaluating which pre-trained method is superior. -->
<!-- * For the first limitation about hyper-parameters selection for pre-training:
We think the limitations related to the hyper-parameters still partially exist for your referenced paper as they still need hyper-parameters during pre-training stage. Moreover, even though they provided the specific set of hyper-parameters adapted during their experiments, how to selece such a specific set is still very unclear.
* For [3]'All-in-one' paper, we carefully checked their paper and code repo. And we got some observations
* There are no explicit instructions to their pre-training settings, after checking their github repo we found that all their experiments are based on only one specific set of hyper-parameters. Therefore, it still needs to consider the hyper-parameter tuning or selection.
* The performance improvement of their proposed prompting could be marginal on some datasets, e.g., the Cora dataset. Therefore, we think even with prompt learning we suffer from the same issue of hyper-parametrers if we need pre-trainining.
* For [2]"GraphPrompt" paper, the authors mentioned that they need to tune the hyper-parameters for the baselines pre-training. They put the set of hyper-parameters settings they used for each baseline and their proposed method in the Appedix D.
However, how these hyper-parameters are chosen is still very unclear without any further instructions.
* For [1]"Universal Prompt Tuning for Graph Neural Networks" paper, the authors propose their method combing with many different pre-training methods.
For different datasets and diffferent pre-training methods, the hyper-parameters adopted could be different as shown in their appendix "B.8 Hyper-parameter settings" Table. 11. The performance increasement with their proposed method could also be marginal for some datasets. --> -->
From the listed facts above, as you kindly mentioned, "During their pre-training, there is no knowledge of any downstream tasks".
However, we think the limitations related to the hyper-parameters still partially exist in the referenced paper as hyper-parameters are still needed but how they select the specific set for their pre-trained model is not clearly explained.
<!-- * In the papers you referenced, we note that their pre-training stage involves hyper-parameters, such as the number of epochs and hidden size for the differnt pre-trianing methods used in [3]. These studies primarily focus on enhancing the performance of a given pre-trained model, without specific regard to the choice of hyper-parameters involved in pre-training, since the objective is to improve model performance.
* Additionally, the issue we discuss concerns the limitations of the current evaluation protocol. Given the nearly infinite combinations of hyper-parameters, the current evaluation protocol often requires using the downstream task's validation set to select (tune) hyper-parameters. The encoder pre-trained with this selected set will be further assessed performance, facilitating further benchmarking against other pre-training methods. Moreover, for the referenced papers, the issues we discovered in our work could also happen in the case that model pre-trained with only one set of hyper-parameters is not sufficient, e.g., when people want to further explore which pre-training method can be improved most with their propose method or quantifing the improvement caused by their proposed methods in a more comprehensive way. -->
<!--
* For the second limitation about evaluation downstream tasks:
These refernced papers indeed utilize multiple downstream tasks for a single dataset. This align with our argument that only one downstream task is not comprehensive or representative for evaluation. However, the datasets they used for multiple downstream tasks evluation in fact are mannually crafted from the same task.
* For [3]'All-in-one' paper, with each single dataset, the multiple tasks are constructed from the same information (node labels) with induced graph. i.e., the edge label is defined by its endpoints node label and the graph label is defined by the majority node label in the induced graph .Thus, these tasks may still not able to represent a diverse space of downstream tasks.
* For the [2]'GraphPrompt' paper, they conducted evaluation experiments by graph and nodel classification, however to construct their few shot tasks, they used systhentic datasets generated from the same base datasets for graph classification (PROTEINS and ENZYMES) by sampling nodes for constructing node classification tasks. Compared with [3] their settings seem to be more reasonale as the information contained in the label is naturally more diverse.
<!-- In this case, their label space may not be as naturally diverse compared with the multi-label classification if we are considering the classification of each label as a task. -->
<!-- * [1] "Universal Prompt" paper jsut propose to use subgraph GNN to extend the evaluation from graph-level to node-level task without empirical results.Their idea is similar to ‘All-in-one’ paper as they mentioned in their Appendix A1, ” the node representations can also be interpreted as graph representations of the induced subgraphs.” . For their empirical results, their proposed method is only evaluated by molecular prediction. --> -->
In contrast, ours evaluation with different downstream tasks utilize totally different datasets which provided better diversity compared the settings in the above referenced paper. Their proposed tasks provide some insights when abundant different downstream tasks datasets are not available, so we will consider to investigate these tasks into evaluation.
<!-- * For the GCL methods we investigated, there are some inherent limitations to the evaluation task level. For instance, their approach to constructing contrastive views may be local-global, rendering them unsuitable for graph-level tasks, such as graph classification.
* The papers you mentioned employ multiple downstream tasks to assess the preformance improvment to different tasks for a more comprehensive evaluation, which aligns with our argument that only one downstream task is not comprehensive enough for evaluating different methods.
* The works you kindly referenced aim to improve the performance of a specific pre-trained model instead of tring to select one pre-trained model among multiple pre-trained models with different configurations with a specific pre-training method for further comparing with other pre-training methods. Therefore, there is no need for them to use a validation set to tune or select the hyper-parameters involved in the pre-training and no knowledge to the downstream tasks. However, according to our argument ablve, we think they still need to consdier the issues we discussed in our paper.We will mention this key difference in the further modification.
-->
# Reviewer JRMc:4
## Weakness 1
> 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g. link prediction, graph classification, etc.) would further strengthen the work.
## Reply Skeleton1
1. Thanks for the kindly suggestion. Explain why the methods we are evaluating can not be applied to the link prediction and graph classification from a task design level.
2. We acknowledge that experiments on more tasks could be helpful so we take experiments on some graph-level methods that can be evaluated with graph classification
## Weakness 2
> 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance?
## Reply Skeleton2
1. Thanks for the suggestion and explain the scope of this work is not focused on which part of the method is critical. Carefully illustrate the issue we want to address.
2. There are some existing works to discuss which part could be critical to a method’s performance. **Check: https://arxiv.org/pdf/2109.01116.pdf and update**
## Weakness 3
> 3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive.
## Reply Skeleton 3
1. Acknoledge that it would be good to take these information into consideration and we will adjust.
2. Carefully describe the execution time of different methods for pre-training.
3. Some methods could take more time for training, with more possible hyper-parameters combination, it could be very expensive to find the optimal set.
## Question 1
> (1) How might the protocol be further extended to handle an even greater diversity of tasks (e.g. link prediction, graph classification, etc.)? What would be the challenges?
## Reply Skeleton 1
1. Explain why some task like link-prediction is not applicable for the SSL methods.(which may also be challenges)
2. Emphasize our motivation that using multiple downstream task is benefitial for more comprehensive evaluation.
3. Consider for graph-level is there any other tasks like clustering for evaluation
## Question 2
> (2) Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results?
## Reply Skeleton2
Refer to the relpy to weakness 2
## Question 3
> (3) What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable?
## Reply Skeleton3
1. Combine with Reply to Weakness 3
2. With our proposed methods, which provides the insights for more comprehensive evaluation, the community can try to evaluate the pre-training methods more comprehensively with doable computing overhead.
# Reply to Reviewer JRMc:4
Dear Reviewer,
Thank you for your feedback and recognizing the strengths of our paper.
We respond to your comments below:
> 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g., link prediction, graph classification, etc.) would further strengthen the work.
> (1). How might the protocol be further extended to handle an even greater diversity of tasks (e.g., link prediction, graph classification, etc.)? What would be the challenges?
Thank you for the suggestion of extending experiments.
We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. We intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation.
Moreover, we have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph[1] and MVGRL(G)[2], on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported.
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 |
Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRL(G) is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work.
[1] InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization, ICLR 2020
[2] Contrastive Multi-View Representation Learning on Graphs, ICML 2020
<!-- Furthermore, we also conduct experiments on datasets that can be evaluated with graph classification. We conduct experiments with two GCL methods InfoGraph and MVGRL(G) and the results on NCI1, PROTEINS, PTC_MR are attached below. The metric used for evaluation is accuracy. Min, Max, Ave, Std columns stand for the min, max, average and standard_deviation value among experiments among 20 sets of hyper-parameters with the best epoch reported. Our preliminary study suggest that some of the GCL models for graph classification tasks are also sensitive to hyper-parameters. Therefore, a similar evlaiuation framework for GCL designed for graph-level tasks are also needed. We -->
<!-- We would like to clarify that it is not that native or natural for a specific pre-trained model on a specific dataset to conduct such different multiple downstream tasks due to the limitations of model structure and the datasets.
For example, the datasets constructed for node classification typically contain many labeled nodes on the same graph to provide enough samples for conducting node classification, like Cora dataset. With such samples, the model can learn the node embeddings.
In contrast, to conduct graph classification, the dataset should contain multiple labeled graphs to provide enough samples, like NCI1 dataset, meanwhile the model should be able to learn the graph embeddings.
In this case, if we want to evaluate a pre-trained model trained on the dataset constructed for node classification on with graph classification, it is not natrual as the pre-trained model got no information about the graph label and it's also not applicable for getting the embeddings at a graph level.
For link prediction, it would require us to remove part of the datasets and use them as test set but we will not know the 'ground truth' at the pre-training stage. -->
<!-- * For instance, their approach to constructing contrastive views may be local-global, rendering them unsuitable for graph-level tasks, such as graph classification.
* The training stategy of link-prediction typically requires to mask the existing link/edge to train the model to predict link while the data augmentation strategies applied in the GCL methods, e.g., drop-edge, may be conflict with it, rednering the unapplicability. -->
<!-- We acknowledge that experiments on more tasks could be helpful to provide more insights.
Therefore, for other datasets that can be evaluated with graph classification, we conduct experiments on InfoGraph and MVGRL(G) and the results are attached below. The metric used for evaluation is accuracy. Min, Max, Ave, Std columns stand for the min, max, average and standard_deviation value among experiments among 20 sets of hyper-parameters with the best epoch reported.
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRLG | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRLG | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRLG | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | -->
> 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance?
>(2). Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results?
Thank you for your detailed feedback. We would like to clarify that the primary focus of our work is on proposing a new evaluation protocol rather than dissecting the critical components of each GCL method.
Regarding your questions about the performance differences among models, we have some understanding to share. We believe that these differences may stem from two main factors: the number of hyper-parameters involved and the contrastive objective employed.
- As discussed in our paper, the Barlow Twins loss (BT) involves fewer hyper-parameters, namely, the drop edge rate and drop node rate for graph augmentations on two views. Conversely, SFA introduces two additional hyper-parameters: k, representing the number of iterations used in its spectral feature augmentation, and $\tau$, utilized in its adopted InfoNCE loss. The increased number of hyper-parameter combinations in SFA may contribute to its sensitivity to hyper-parameters, thereby resulting in inconsistent results.
- Furthermore, compared to the InfoNCE loss, BT does not require negatives and instead focuses on enhancing positive pair similarity and feature diversity. This simplicity in strategy could be a contributing factor to its robust and consistent performance.
In summary, while we acknowledge that further investigation into the performance dynamics of each method would be valuable, our primary emphasis remains on addressing the limitations of existing evaluation protocols. We appreciate your insights and suggestions and will consider them for future research endeavors.
<!-- Thanks for your detailed feedback. We would like to clarify that the scope of this work is not focused on which part of each method is critical.
The issue we want to address is the inherent limitation of existing evaluation protocol which prevents it from being more comprehensive and doable, i.e., the difficulties for selecting hyper-parameters among infinite combinations and the insufficienty of evluating with only one downstream tasks.
-->
<!-- To your questions,we think the performance consistency difference may be caused by two aspects, the number of hyper-parameters involved and the contrasitive objective
As we mentioned in our paper, the Barlow Twins loss (BT) contains fewer hyper-parameters only *drop edge rate* and *drop node rate* for graph augmentations on two views.
While SFA got two more hyper-parameters *k* as the number of iterations used in its spectral feature augmentation, and $\tau$ in its adopted InfoNCE loss. The more combinations could make SFA more sensitive to the hyper-parameters rendering the inconsistent results
Compared with InfoNCE, BT does not require negatives, focusing instead on enhancing positive pair similarity and feature diversity. The simple strategy could be a factor contributing to its strong and consistent performance.
-->
<!-- ~~* Moreover, there are some existing works [1] to discuss which part could be critical to a method’s performance. The performance consistency could be affected by many factors, e.g., contrasting mode and contrastive objective.
Barlow Twins loss (BT) proposes to encourages similar representations between augmented views of a sample, while minimizing the redundancy within the latent representation vector. It obtains promising performance
on par with negative-sample-based counterparts yet reduce the computational burden without explicit negative samples.
We deloy the InfoNCE version of SFA as it performs better than BT version according to the results reported in the original paper . From our observation, the design of SFA is complicated due to many tailored modules which also involves more hyper-parameters and reduce the performance consistency.
**[1] [An Empirical Study of Graph Contrastive Learning](https://arxiv.org/pdf/2109.01116.pdf)**~~ -->
>3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive.
>(3).What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable?
Thank you for your informative suggestion. We appreciate your emphasis on the practical considerations, particularly regarding the computational cost associated with our proposed evaluation protocol. In the revised version of our paper, we will include information about the computational cost, specifically the execution time, to provide a more comprehensive understanding of the trade-offs involved.
To illustrate the computational cost of different methods, we conducted experiments to measure the pre-training time per epoch on the DBLP dataset. The results are summarized in the table below:
<!-- Thank you for your informative suggestion. We will include the information about the computation cost(execution time) in the revised version.
Here is the table for the pre-training time per-epoch of the methods we investigated on the dataset DBLP.
As shown in the table, some methods could take more time for pre-training. e.g. the pre-training time of COSTA could be 12 times than GBT .However, this kind of trade-off is not taken into consideration for the existing evaluation protocol. -->
| Model | Time (s/epoch) |
| --------- | ----- |
| BGRL | 0.116 |
| CCA | 0.113 |
| COSTA | 0.968 |
| DGI | 0.038 |
| GBT | 0.076 |
|GRACE | 0.227 |
| MVGRL | 0.860 |
| SFA | 0.298 |
| SUGRL | 0.223 |
As shown in the table, some methods exhibit significantly higher pre-training times compared to others. For example, the pre-training time of COSTA is approximately 12 times longer than that of GBT. However, such trade-offs in computational cost are not currently considered in the existing evaluation protocol.
Moving forward, we acknowledge the importance of incorporating computational cost considerations into our evaluation framework. To address this, we plan to explore strategies such as using varying numbers of sets of hyper-parameters rather than a consistent 20. By adapting the evaluation framework to account for the computational cost of different methods, we aim to make it more practical and feasible for adoption by the community.
<!-- * With almost infitite hyper-parameters combinations, it could be prohibitively expensive to find the optimal set for a fair enough benchmarking, rendering the existing evaluation protocol not doable nor fair considering the different computational overhead of different methods.
Our proposed method is more achivable towards a more fair and comprehensive evaluation by allivating the above issues.
So we think the community can try to evaluate the pre-training methods more fairly and comprehensively with doable computing overhead with our proposed method. -->
<!-- * We would like to clarify that our motivation to propose such an evaluation protocol is to provide insights towards a more comprehensive, fair and doable benchmarking setting rather than tuning the models parameters.
For the existing evaluation protocol, we argue that banchmarking different methods is not doable and not fair with only one set of hyper-parameters and not comprehensive with only one downstream task for evaluation as discussed in out paper.(To make the existing evaluation protocol fair, 'optimal' performance of each method is desiered but the cose to find the real 'optimal' set of parameters is prohibitively expensive. To make it doable, existing works only report the results using one set of hyper-parameters with only one evaluation task and it's not fair and comprehensive as we discussed in out paper.) -->
# Reviewer 8gAS:4
## Weakness 1
> 1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP.
## Reply Skeleton 1
1. Thank your for the suggestions
2. We will provide the experiments results on the DBLP dataset for the methods evalmulti-node classifcions.
## Weakness 2
> 2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness.
## Reply Skeleton 2
1. Thanks for the suggestions
2. We will conduct experiments for some methods that can be applicable to be evaluated by the graph classification.
## Weakness 3
> 3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability.
## Reply Skeleton 3
1. Thanks for the suggestions
2. The proposed methods are not directly related to the industry but did provide more insights on evaluation
3. Moreover, the benefits can be related to the analysis of the execution time. As we can’t invest infinite resources to try every possible combinations of hyper-parameters.
Questions are corresponding to the weakness
# Reply to Reviewr 8gAS:4
Thank you for the feedback and recognizing the strengths of our paper.
We respond to your comments below:
>1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP.
>It is suggested that the authors enrich their work by incorporating experiments on the NCI1 and DBLP datasets.
Thank you for your detailed feedback regarding the extension of our experiments. We have conducted experiments on the DBLP dataset, and the results are presented below. The "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from experiments conducted with 20 sets of hyper-parameters, each corresponding to the best epoch reported. We observed that these results are generally consistent with the findings presented in our paper.
| Model | Min | Max | Ave | Std |
| ----- | ------ | ------ | ------ | -------------- |
| BGRL | 0.6678 | 0.7857 | 0.7492 | 0.0332093927 |
| CCA | 0.7428 | 0.8404 | 0.8251 | 0.02264025111 |
| COSTA | 0.5482 | 0.8364 | 0.7649 | 0.09611825843 |
| DGI | 0.7975 | 0.8370 | 0.8226 | 0.01258848951 |
| GBT | 0.8178 | 0.8409 | 0.8279 | 0.006522476488 |
| GRACE | 0.7896 | 0.8387 | 0.8266 | 0.01334265861 |
| MVGRL | 0.8167 | 0.8432 | 0.8363 | 0.006628464498 |
| SFA | 0.7609 | 0.8415 | 0.8158 | 0.0203650853 |
| SUGRL | 0.7986 | 0.8331 | 0.8182 | 0.009150276264 |
**We would like to note that while the results of the NCI dataset and other graph classification datasets are not included in this response, we will provide them in the response to the subsequent questions.** We appreciate your suggestion to enrich our work by incorporating experiments on additional datasets, and we believe that these additional experiments will further enhance the comprehensiveness of our evaluation.
<!-- Thank you for your detailed feedback for extending the experiments, we conduct experiments with on DBLP dataset and the results are presented below. Min, Max, Ave, Std columns stand for the min, max, average and standard_deviation value among experiments among 20 sets of hyper-parameters with the best epoch reported. The observations from these results are generally consistent with the ones in our paper. -->
>2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness.
>The authors are encouraged to extend their analysis to include graph-level tasks.
Thank you for your detailed suggestion. We have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph and MVGRL(G), on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported.
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 |
Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRLG is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work.
<!-- Thank you for your detailed suggestion. We conducted experiments for some methods that can be applicable to be evaluated by the graph classification and the results are presented in the table below.
We still find the phenomenons that pre-training methods are sensitive to hyper-parameters and they have different preferences to hyper-parameters on different datasets. -->
<!--
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRLG | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRLG | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRLG | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | -->
>3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability.
>The authors should elucidate the link between the findings of this paper and their implications for industry.
Thank you for your feedback. Here, we would like to explicitly elucidate the link between the findings of our work and their implications for industry.
In industrial settings, selecting a suitable pre-training model is a crucial step when applying GCL methods. In practical scenarios, where GCL methods are often used to address unseen tasks, extensively tuning models for specific scenarios is not feasible. Hence, there is a preference for GCL models that demonstrate less sensitivity to hyperparameters, ensuring robust performance across diverse tasks without extensive tuning. However, existing evaluation protocols often lack fairness and comprehensiveness, as highlighted in our paper.
Our proposed evaluation protocol addresses these challenges by offering a more comprehensive and fair comparison of various GCL methods. By identifying models less sensitive to hyperparameters and consistently performing well across multiple tasks, our protocol facilitates the industrial adoption of GCL methods. This streamlined model selection process empowers industry practitioners to confidently utilize GCL methods for diverse applications.
In summary, our proposed evaluation protocol provides valuable insights for the industrial applications of GCL methods by offering an improved benchmarking framework. By tackling challenges related to model sensitivity and limited task representation, we equip industry practitioners with robust tools for effectively leveraging GCL methods in real-world scenarios.
<!-- When considering the application of GCL methods in industrial settings, a crucial step is selecting a suitable pre-training model. In practical industrial applications, where GCL methods are often utilized to tackle unseen tasks, it is not feasible to extensively tune models for specific scenarios. Therefore, there is a preference for GCL models that exhibit less sensitivity to hyperparameters, ensuring robust performance across diverse tasks without the need for extensive tuning. However, the existing evaluation protocols often fall short in fairness and comprehensiveness, as discussed in our paper. Our proposed evaluation protocol addresses these challenges by providing a more comprehensive and fair comparison of different GCL methods. By identifying models that are less sensitive to hyperparameters and consistently perform well across multiple tasks, our evaluation protocol facilitates the industrial use of GCL methods. This ease of model selection streamlines the process for industry practitioners, enabling them to confidently adopt GCL methods for various applications.In summary, our proposed evaluation protocol offers insights that support the industrial applications of GCL methods by providing a better benchmarking framework. By addressing the challenges associated with model sensitivity and limited task representation, we aim to empower industry practitioners with robust tools for leveraging GCL methods in real-world scenarios. -->
<!-- Thank you for your feedback.Here we would like to explicitly elucidate the link between the findings of our work and their implications for industry.
When we would like to use GCL methods for industrial applications, an essential step is to select a suitable pre-training method by benchmarking many different methods.
However, the existing evaluation protocol is not fair nor comprehensive as we discussed in our paper due to the model's sensitivity to hyper-parameters and limited task.
As there is no unlimited computational and time budget, it is essential to find a more doable and comprehensive way to evaluate them.
Our proposed evaluation protocol provide a better and more fair way for comparing different GCL methods.
Therefore, we think our proposed methods provide insights to supports the industrial applications by better benchmarking different methods.
-->
# Reviewer HV8V:2
## Weakness 1
> 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks.
## Reply Skeleton 1
1. Carefully illustrate that our methods already take both hyper-parameters and diverse downstream tasks into consideration.
2. Also our goal is not to conduct the hyper-parameters experiments but to show that a method’s performance could be affected much by its hyper-parameters
## Weakness 2
> 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification.
## Reply Skeleton 2
1. Carefully illustrate that these methods inherent restrictions to make them only for node level tasks.
2. We agree that more evaluation task categories could be beneficial and that’s why we propose to also consider multi-label node classification as a new evaluation task.
## Question 1
> Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration?
## Reply Skeleton 1
1. We are arguing that only one down-stream task may not be enough.
2. The currenlty evaluated methods are not designed for them,we acknowledge that doing experiments for different downstream tasks might be helpful but that could be further work
## Question 2
> Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?
## Reply Skeleton 2
1. We consider that it should be relatively easier for relatively good methods to find proper hyper-parameters to achieve satisfying results. i.e., if the methods are too sensitive, it will be much harder and require more time and efforts to decide the number of settings for training.
2. Explain why Our scope is not to conduct hyper-parameter experiments to find this relationship between hyper-parameters and different tasks
<!-- # Reply to Reviewer HV8V:2
Dear Reviewer,
Thank you for the feedback and recognizing the strengths of our paper.
We respond to your comments below:
>Weakness 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks.
* Thank you for your feedback.
However, we would like to clarify that our goal is to propose a more fair,comprehensive and doable evaluation protocol but not to assess the GCL performance variance with hyper-parameter changes across diverse tasks.
The method's sensitivity to hyper-parameters and different preferences to hyper-parameters for different downstream tasks are our findings to support our motivation to propose a more fair,comprehensive and doable evaluation protocol as we can't invest unlimited computational budget to make the existing evaluation protocol satisfactorily fair or comprehensive.
>Weakness 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification.
* Thank you for the suggestion of extending experiments.
We would like to clarify that the datasets we used in this paper are mostly for node classification, which are not naturally applicable for graph classification for evaluation.
For link prediction, it would require us to remove part of the datasets and use them as test set.Therefore, we focused on the node classification task.
We agree that more evaluation task categories could be beneficial and that’s why we propose to also consider multi-label node classification as a new evaluation task.
>Question1.Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration?
* Thank you for your insightful question.
We would like to clarity that it is not that native in a single dataset to conduct such different multiple downstream tasks, especially most datasets are created for a certain task.
* For example, for the datasets suitable for node classification, they contain many labeled nodes on the same graph to provide enough samples for conducting node classification, like Cora dataset.
* In contrast, to conduct graph classification, the dataset should contain multiple labeled graphs to provide enough samples, like NCI1 dataset.
* Therefore, a single dataset typically can not be applicable to multiple downstream tasks due to the inherent requirements of different tasks.
* We are arguing that only one downstream task is not enough. So we conduct the experiments evaluated by multi-label node classification in addition to multi-class node classification. -->
<!-- * We didn't include evaluation tasks like link prediction and graph classification due to the inherent limitation of datasets we applied for the same reason in our reply to weakness 2.We agree that more evaluation task categories could be beneficial and that’s why we propose to also consider multi-label node classification as a new evaluation task. -->
<!--
>Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?
* Thank you for your question. However, we would like to clarify that our scope is not to conduct hyper-parameter experiments to find this relationship between hyper-parameters and different tasks. Our findings in the paper that the GCL methods have different sensitivities to hyper-parameters and different preferences to hyper-parameters for different tasks motivated us to propose a more fair,comprehensive and doable evaluation protocol to address the issues in the existing evaluation protocol.
* We argue that the hyper-parameter sensiticity and the performance on different tasks should be considered as a whole when evaluating a method.Intuitivly, we consider that it should be relatively easier for relatively better methods to find proper hyper-parameters to achieve satisfying results on different tasks. i.e., if the methods are too sensitive, it will be much harder and require more time and efforts to decide the hyper-parameters for training and this kind of tremendous computational cost should be considered especially for benchmarking different methods to choose the propoer one to be deployed in industry. -->
# Revised Reply to Reviewer HV8V:2
Dear Reviewer,
Thank you for the feedback and recognizing the strengths of our paper.
We respond to your comments below:
>Weakness 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks.
>Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?
Thank you for your feedback. In our paper, we have indeed included the sensitivity of hyper-parameters across different tasks in our evaluation considerations. In Section 5.2 of our paper, specifically for the multi-label classification tasks, we conducted experiments with 20 sets of randomly selected hyper-parameters for pre-training GCL encoders. Then, we reported the average performance with standard deviation, as depicted in Figure 5, 6, and 7. This approach allows for a fair evaluation of GCL models across multiple tasks, as models that are sensitive to hyper-parameters will exhibit larger variances in the F1 scores.
Regarding the question `Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?`, we would like to emphasize that we do not aim to select a specific set of hyper-parameters of GCL model for each task (this is impractical as we discussed in our paper). Instead, we trained 20 pre-trained GCL models with different sets of hyper-parameters and evaluated them across all tasks. After the pre-training stage, the GCL encoder remained fixed, and only the task head was tuned using the corresponding validation dataset. We deliberately designed our evaluation protocol in this manner to provide a comprehensive assessment of the pre-trained models.
We argue that both hyper-parameter sensitivity and performance across different tasks should be considered together when evaluating a GCL method. Essentially, a model with low sensitivity (lower variance) and high performance (large F1 scores) is deemed favorable.
We appreciate your feedback, and we will revise our paper to clarify our experimental settings accordingly.
<!-- Thank you for your feedback. In our paper, we have included the sensitivity of hyper-parameters across different tasks into the consideration. In Section 5.2 of our paper, for multi-label classification task, we also conducted experiments with 20 sets of randomly selected hyper-parameters (for pre-training GCL models) and reported the average performance with standard deviation (see Figure 5, 6 and 7). By doing this, we allow the GCL models to be fairly evaluated with multiple tasks. In particualr, models that are sensitive to hyper-parameters will have large variance in the F1 scores. In regarding to the question `Have you considered the sensitivity of hyper-parameters when switching between different tasks`, we would like to emphasize that we did not try to select a specific set of hyper-parameters of the pre-trained model for each task (this is not feasible as we discussed in our paper). Rather, we use the 20 sets of hyper-parameters to train 20 pre-trained models and evaluate them on all the tasks. After the pre-training stage, the GCL encoder will be fixed and the task head is tuned using the corresponding validation dataset. We delibrately design our evaluation protocol in this way for a comprehensive acessment of the pre-trained models. We argue that the hyper-parameter sensitivity and the performance on different tasks should be considered as a whole when evaluating a method(i.e., both of them are critical in our proposed evaluation protocol). Essentially, a model with low sensitivity (lower variance) and high performance (large F1 scores) are considered good. -->
<!-- We will revise our paper to clarify our experiment setting. -->
<!--
In fact, we did find the phenomenon that the investegated methods have different prefernces to hyper-parameters to different tasks.
Following the settings in our paper to construct 121 tasks, we found that for all methods we investegated, different set of hyper-parameters among the 20 sets will be selected as the optimal set for different tasks among the 121 tasks. For instance, for CCA,9 out of 20 sets of hyper-parameters have been selected as the optimal one for some certain tasks among the 121 tasks, while for DGI and SFA there could be 12 and 11 out of 20 sets selected. GBT and COSTA have fewer sets(5 and 6 out of 20) selected but we can still draw the conclusion that for different tasks there is sensitivity across different tasks as the methods prefer different sets of hyper-parameters across different tasks.
* However, we would like to clarify that our goal is to propose a more fair,comprehensive and doable evaluation protocol to conduct hyper-parameter experiments to find this relationship between hyper-parameters and different tasks.
Our findings in the paper that the GCL methods have different sensitivities to hyper-parameters and different preferences to hyper-parameters for different tasks motivated us to propose a more fair,comprehensive and doable evaluation protocol to address the issues in the existing evaluation protocol.
Our motivation is strong as we can't invest unlimited computational budget to select the hyper-parameters for making the existing evaluation protocol satisfactorily fair or comprehensive.
* Moreover, we argue that the hyper-parameter sensitivity and the performance on different tasks should be considered as a whole when evaluating a method(i.e., we think both of them are critical in our proposed evaluation protocol).Intuitivly, we consider that it should be relatively easier for relatively better methods to find proper hyper-parameters to achieve satisfying results on different tasks. i.e., if the methods are too sensitive, it will be much harder and require more time and efforts to decide the hyper-parameters for training and this kind of tremendous computational cost should be considered especially for benchmarking different methods to choose the propoer one to be deployed in industry. -->
>Weakness 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification.
>Question1.Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration?
Thank you for your insightful question and the suggestion to extend our experiments.
We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges.
Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. In future work, we intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation.
Thank you for raising this point, and we appreciate your feedback. We will consider additional evaluation tasks in our future research endeavors to enhance the comprehensiveness of our evaluation protocol.
<!-- Thank you for your insightful question and the suggestion of extending experiments.
We would like to clarify that it is often not feasible to utilize a specific pre-trained model on a specific dataset for conducting such multiple downstream tasks including node classification, link prediction and graph classification together. The limitations mainly lie in two perspectives, the model structure and the datasets. Specifically, for node classification datasets such as Cora and Coauthor, typically we are only given a single graph in each dataset, which contain labels for nodes. Thus, it is not possible to conduct graph classification tasks on such datasets. Furthermore, most of the GCL models designed for node classification cannot naturally be utilized for graph classification as these models are meant for learning node representations. It is also not natively applicable for performing link prediction since we will need to remove part of edges and use them as test set. In that case, we will need to pretrain another model on this partial dataset with edges removed. Hence, in this paper, we opt to utilize multi-label classification to miminc the scenario of dealing with multiple downstream tasks. We agree that more evaluation/downstream task categories could be beneficial to comprehensivly evaluate a pre-trained model. We will explore more datasets where different categories of tasks are possible in future work.
-->
<!--
The limitations mainly lie in two perspectives, the sttructure itself and the datasets constructed for the task.
For example, for the datasets suitable for node classification, they contain many labeled nodes on the same graph to provide enough samples for conducting node classification, like Cora dataset. With such samples, the model can learn the node embeddings. In contrast, to conduct graph classification, the dataset should contain multiple labeled graphs to provide enough samples, like NCI1 dataset, to make the model be able to learn the graph embeddings. In this case, if we want to evaluate a pre-trained model trained on the dataset constructed for node classification(where there are scarce graph labels or even no graph label) on with graph classification, it is not natrual as the pre-trained model got no information about the graph label and it's also not applicable for getting the embeddings at a graph level.
For link prediction, it would require us to remove part of the datasets and use them as test set but we will not know the 'ground truth' at the pre-training stage. -->
<!-- * We agree that more evaluation/downstream task categories could be beneficial to comprehensivly evaluate a pre-trained model(pre-training method) and that’s why we propose to also consider multi-label node classification as a new evaluation task in addition to multi-class node classification because the pre-trained models we investegated are suitable for the node-level embeddings and can utilize both multi-class and multi-label node classification as the downstream task. -->
<!-- * We would like to clarify that it is not that native in a single dataset to conduct such different multiple downstream tasks, especially most datasets are created for a certain task.
* For example, for the datasets suitable for node classification, they contain many labeled nodes on the same graph to provide enough samples for conducting node classification, like Cora dataset.
* In contrast, to conduct graph classification, the dataset should contain multiple labeled graphs to provide enough samples, like NCI1 dataset.
* Therefore, a single dataset typically can not be applicable to multiple downstream tasks due to the inherent requirements of different tasks. -->
<!-- * We are arguing that only one downstream task is not enough. So we conduct the experiments evaluated by multi-label node classification in addition to multi-class node classification. -->
<!-- * We didn't include evaluation tasks like link prediction and graph classification due to the inherent limitation of datasets we applied for the same reason in our reply to weakness 2.We agree that more evaluation task categories could be beneficial and that’s why we propose to also consider multi-label node classification as a new evaluation task. -->
<!-- >Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?
* Thank you for your question. However, we would like to clarify that our scope is not to conduct hyper-parameter experiments to find this relationship between hyper-parameters and different tasks. Our findings in the paper that the GCL methods have different sensitivities to hyper-parameters and different preferences to hyper-parameters for different tasks motivated us to propose a more fair,comprehensive and doable evaluation protocol to address the issues in the existing evaluation protocol.
* We argue that the hyper-parameter sensiticity and the performance on different tasks should be considered as a whole when evaluating a method.Intuitivly, we consider that it should be relatively easier for relatively better methods to find proper hyper-parameters to achieve satisfying results on different tasks. i.e., if the methods are too sensitive, it will be much harder and require more time and efforts to decide the hyper-parameters for training and this kind of tremendous computational cost should be considered especially for benchmarking different methods to choose the propoer one to be deployed in industry. -->
# Response 1 for Reviewer HV8V
Dear Reviewer,
Thank you for the feedback and recognizing the strengths of our paper.
We respond to your comments below:
>Weakness 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks.
>Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?
Thank you for your feedback. In our paper, we have indeed included the sensitivity of hyper-parameters across different tasks in our evaluation considerations. In Section 5.2 of our paper, specifically for the multi-label classification tasks, we conducted experiments with 20 sets of randomly selected hyper-parameters for pre-training GCL encoders. Then, we reported the average performance with standard deviation, as depicted in Figure 5, 6, and 7. This approach allows for a fair evaluation of GCL models across multiple tasks, as models that are sensitive to hyper-parameters will exhibit larger variances in the F1 scores.
Regarding the question `Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?`, we would like to emphasize that we do not aim to select a specific set of hyper-parameters of GCL model for each task (this is impractical as we discussed in our paper). Instead, we trained 20 pre-trained GCL models with different sets of hyper-parameters and evaluated them across all tasks. After the pre-training stage, the GCL encoder remained fixed, and only the task head was tuned using the corresponding validation dataset. We deliberately designed our evaluation protocol in this manner to provide a comprehensive assessment of the pre-trained models.
We argue that both hyper-parameter sensitivity and performance across different tasks should be considered together when evaluating a GCL method. Essentially, a model with low sensitivity (lower variance) and high performance (large F1 scores) is deemed favorable.
We appreciate your feedback, and we will revise our paper to clarify our experimental settings accordingly..
# Response 2 for Reviewer HV8V
>Weakness 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification.
>Question1.Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration?
Thank you for your insightful question and the suggestion to extend our experiments.
We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges.
Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. In future work, we intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation.
Thank you for raising this point, and we appreciate your feedback. We will consider additional evaluation tasks in our future research endeavors to enhance the comprehensiveness of our evaluation protocol.
# Response 1 for Reviewer JRMC
Dear Reviewer,
Thank you for your feedback and recognizing the strengths of our paper.
We respond to your comments below:
> 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g., link prediction, graph classification, etc.) would further strengthen the work.
> (1). How might the protocol be further extended to handle an even greater diversity of tasks (e.g., link prediction, graph classification, etc.)? What would be the challenges?
Thank you for the suggestion of extending experiments.
We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. We intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation.
# Response 2 for Reviewer JRMC
Moreover, we have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph[1] and MVGRL(G)[2], on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported.
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 |
Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRL(G) is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work.
[1] InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization, ICLR2020
[2] Contrastive Multi-View Representation Learning on Graphs, ICML2020
# Response 3 for Reviewer JRMC
> 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance?
>(2). Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results?
Thank you for your detailed feedback. We would like to clarify that the primary focus of our work is on proposing a new evaluation protocol rather than dissecting the critical components of each GCL method.
Regarding your questions about the performance differences among models, we have some understanding to share. We believe that these differences may stem from two main factors: the number of hyper-parameters involved and the contrastive objective employed.
- As discussed in our paper, the Barlow Twins loss (BT) involves fewer hyper-parameters, namely, the drop edge rate and drop node rate for graph augmentations on two views. Conversely, SFA introduces two additional hyper-parameters: k, representing the number of iterations used in its spectral feature augmentation, and $\tau$, utilized in its adopted InfoNCE loss. The increased number of hyper-parameter combinations in SFA may contribute to its sensitivity to hyper-parameters, thereby resulting in inconsistent results.
- Furthermore, compared to the InfoNCE loss, BT does not require negatives and instead focuses on enhancing positive pair similarity and feature diversity. This simplicity in strategy could be a contributing factor to its robust and consistent performance.
In summary, while we acknowledge that further investigation into the performance dynamics of each method would be valuable, our primary emphasis remains on addressing the limitations of existing evaluation protocols. We appreciate your insights and suggestions and will consider them for future research endeavors.
# Response 4 for Reviewer JRMC
>3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive.
>(3).What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable?
Thank you for your informative suggestion. We appreciate your emphasis on the practical considerations, particularly regarding the computational cost associated with our proposed evaluation protocol. In the revised version of our paper, we will include information about the computational cost, specifically the execution time, to provide a more comprehensive understanding of the trade-offs involved.
To illustrate the computational cost of different methods, we conducted experiments to measure the pre-training time per epoch on the DBLP dataset. The results are summarized in the table below:
| Model | Time (s/epoch) |
| --------- | ----- |
| BGRL | 0.116 |
| CCA | 0.113 |
| COSTA | 0.968 |
| DGI | 0.038 |
| GBT | 0.076 |
|GRACE | 0.227 |
| MVGRL | 0.860 |
| SFA | 0.298 |
| SUGRL | 0.223 |
As shown in the table, some methods exhibit significantly higher pre-training times compared to others. For example, the pre-training time of COSTA is approximately 12 times longer than that of GBT. However, such trade-offs in computational cost are not currently considered in the existing evaluation protocol.
Moving forward, we acknowledge the importance of incorporating computational cost considerations into our evaluation framework. To address this, we plan to explore strategies such as using varying numbers of sets of hyper-parameters rather than a consistent 20. By adapting the evaluation framework to account for the computational cost of different methods, we aim to make it more practical and feasible for adoption by the community.
# Response 1 for Reviewr 8gAS
Thank you for the feedback and recognizing the strengths of our paper.
We respond to your comments below:
>1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP.
>It is suggested that the authors enrich their work by incorporating experiments on the NCI1 and DBLP datasets.
Thank you for your detailed feedback regarding the extension of our experiments. We have conducted experiments on the DBLP dataset, and the results are presented below. The "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from experiments conducted with 20 sets of hyper-parameters, each corresponding to the best epoch reported. We observed that these results are generally consistent with the findings presented in our paper.
| Model | Min | Max | Ave | Std |
| ----- | ------ | ------ | ------ | -------------- |
| BGRL | 0.6678 | 0.7857 | 0.7492 | 0.0332093927 |
| CCA | 0.7428 | 0.8404 | 0.8251 | 0.02264025111 |
| COSTA | 0.5482 | 0.8364 | 0.7649 | 0.09611825843 |
| DGI | 0.7975 | 0.8370 | 0.8226 | 0.01258848951 |
| GBT | 0.8178 | 0.8409 | 0.8279 | 0.006522476488 |
| GRACE | 0.7896 | 0.8387 | 0.8266 | 0.01334265861 |
| MVGRL | 0.8167 | 0.8432 | 0.8363 | 0.006628464498 |
| SFA | 0.7609 | 0.8415 | 0.8158 | 0.0203650853 |
| SUGRL | 0.7986 | 0.8331 | 0.8182 | 0.009150276264 |
**We would like to note that while the results of the NCI dataset and other graph classification datasets are not included in this response, we will provide them in the response to the subsequent questions.** We appreciate your suggestion to enrich our work by incorporating experiments on additional datasets, and we believe that these additional experiments will further enhance the comprehensiveness of our evaluation.
# Response 2 for Reviewr 8gAS
>2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness.
>The authors are encouraged to extend their analysis to include graph-level tasks.
Thank you for your detailed suggestion. We have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph and MVGRL(G), on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported.
| Model | Dataset | Min | Max | Ave | Std |
| --------- | -------- | ----------- | ----------- | ------------ | -------------- |
| MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 |
| MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 |
| MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 |
| InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 |
| InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 |
| InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 |
Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRLG is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work.
# Response 3 for Reviewr 8gAS
>3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability.
>The authors should elucidate the link between the findings of this paper and their implications for industry.
Thank you for your feedback. Here, we would like to explicitly elucidate the link between the findings of our work and their implications for industry.
In industrial settings, selecting a suitable pre-training model is a crucial step when applying GCL methods. In practical scenarios, where GCL methods are often used to address unseen tasks, extensively tuning models for specific scenarios is not feasible. Hence, there is a preference for GCL models that demonstrate less sensitivity to hyperparameters, ensuring robust performance across diverse tasks without extensive tuning. However, existing evaluation protocols often lack fairness and comprehensiveness, as highlighted in our paper.
Our proposed evaluation protocol addresses these challenges by offering a more comprehensive and fair comparison of various GCL methods. By identifying models less sensitive to hyperparameters and consistently performing well across multiple tasks, our protocol facilitates the industrial adoption of GCL methods. This streamlined model selection process empowers industry practitioners to confidently utilize GCL methods for diverse applications.
In summary, our proposed evaluation protocol provides valuable insights for the industrial applications of GCL methods by offering an improved benchmarking framework. By tackling challenges related to model sensitivity and limited task representation, we equip industry practitioners with robust tools for effectively leveraging GCL methods in real-world scenarios.