KDD24-Review-Rebuttal

# KDD24-Review-Rebuttal # Reviewer JtYY:1 ## Weakness > While the paper studies an important problem (evaluation of graph contrastive methods), its premises are unfounded. The paper recognize two limitations > (1) extensive tuning of hyperparameters in pre-training, often using validation set of downstream tasks; > (2) they are evaluated on a single downstream task. These limitations are not valid, especially in recent efforts [1,2,3]. During their pre-training, there is no knowledge of any downstream tasks, and multiple types of downstream tasks have been evaluated. (Relatively minor to this point is that these papers are not cited or discussed.) > Given that the current trends are fundamentally different from what the paper has assumed, I'm afraid this paper has to be significantly reworked and/or repositioned. > [1] Universal Prompt Tuning for Graph Neural Networks. NeurIPS'24. > > [2] GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. WWW'23. > > [3] All in One: Multi-task Prompting for Graph Neural Networks. KDD'23 (best paper award). ## Reply Skeleton 1. Point out the scope differences 2. Check the paper first, so far according to All-in-one even the references will not use the validation set to select the model, the pre-training stage still need to choose hyper-parameters where the issues we discovered for GCL could also happen. 3. For our method to be evaluated, there are some inherent limitations to the evaluation task level(node, graph level) 4. The references are using multiple downstream tasks for a more comprehensive evaluation which aligns with our arguments. # Reply to Reviewer JtYY:1 > While the paper studies an important problem (evaluation of graph contrastive methods), its premises are unfounded. The paper recognize two limitations.(1) extensive tuning of hyperparameters in pre-training, often using validation set of downstream tasks;(2) they are evaluated on a single downstream task. These limitations are not valid, especially in recent efforts [1,2,3]. During their pre-training, there is no knowledge of any downstream tasks, and multiple types of downstream tasks have been evaluated. > [1] Universal Prompt Tuning for Graph Neural Networks. NeurIPS'24. > > [2] GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. WWW'23. > > [3] All in One: Multi-task Prompting for Graph Neural Networks. KDD'23 (best paper award). We appreciate the reviewer for highlighting the papers related to graph prompt learning, which indeed represent an important emerging methodology for utilizing GCL. In this paper, we have mainly focused on the fine-tuning paradigm, because of its widespread adoption and the extensive body of work surrounding it. While our paper primarily focuses on the fine-tuning paradigm, we acknowledge the significance of graph prompt learning techniques in the GCL landscape. We will explore and incorporate these methodologies into our evaluation framework. Upon carefully examining the referenced papers [1,2,3], we found that they exhibit similar limitations to those we discussed in our paper, albeit to varying degrees. We detail these limitations as follows: - Regarding the first limitation concerning hyper-parameter selection for pre-training: It is commonly assumed in existing GCL literature that "no knowledge of any downstream tasks is required during the pre-training stage." However, as we elaborate in our paper, different hyper-parameters are often chosen based on specific datasets and GCL models, a process that may necessitate information about downstream tasks. This underscores the importance of considering hyper-parameter sensitivity in evaluation, as highlighted in our work. While graph prompt learning techniques offer an alternative approach for utilizing pre-trained models, they still rely on pre-training with specifically selected hyper-parameters. Therefore, evaluating these methods should also account for hyper-parameter sensitivity. - Regarding the second limitation concerning multiple downstream tasks: While some of the referenced papers [1,3] do utilize multiple downstream tasks for a single dataset, aligning with our argument for comprehensive evaluation, it is essential to note that most existing GCL studies (especially the ones with fine-tuning pradiam) only assess performance on a single downstream task, which we contend is inadequate. Our work explicitly advocates for the incorporation of multiple downstream tasks to assess a single pre-trained model comprehensively. We propose a more comprehensive benchmarking protocol to address this need. Additionally, our proposed tasks can also be utilized for comprehensively evaluating the prompting-based methods, ensuring a thorough comparison across different GCL approaches. We appreciate the reviewer for bringing these papers to our attention, and we will ensure to include and discuss them in our revision. Additionally, we will explore the inclusion of datasets and their corresponding downstream tasks in our evaluation protocol. Once again, we appreciate the reviewer for bringing up these relevant papers. We are committed to incorporating and discussing them in our revision. We would also want to emphasize that our proposed evaluation protocol aims to provide a comprehensive assessment framework for GCL methods (for both fine-tuning and prompting paradigm), considering both hyper-parameter sensitivity and the utilization of multiple downstream tasks.           --> From the listed facts above, as you kindly mentioned, "During their pre-training, there is no knowledge of any downstream tasks". However, we think the limitations related to the hyper-parameters still partially exist in the referenced paper as hyper-parameters are still needed but how they select the specific set for their pre-trained model is not clearly explained.    --> In contrast, ours evaluation with different downstream tasks utilize totally different datasets which provided better diversity compared the settings in the above referenced paper. Their proposed tasks provide some insights when abundant different downstream tasks datasets are not available, so we will consider to investigate these tasks into evaluation.  # Reviewer JRMc:4 ## Weakness 1 > 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g. link prediction, graph classification, etc.) would further strengthen the work. ## Reply Skeleton1 1. Thanks for the kindly suggestion. Explain why the methods we are evaluating can not be applied to the link prediction and graph classification from a task design level. 2. We acknowledge that experiments on more tasks could be helpful so we take experiments on some graph-level methods that can be evaluated with graph classification ## Weakness 2 > 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance? ## Reply Skeleton2 1. Thanks for the suggestion and explain the scope of this work is not focused on which part of the method is critical. Carefully illustrate the issue we want to address. 2. There are some existing works to discuss which part could be critical to a method’s performance. **Check: https://arxiv.org/pdf/2109.01116.pdf and update** ## Weakness 3 > 3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive. ## Reply Skeleton 3 1. Acknoledge that it would be good to take these information into consideration and we will adjust. 2. Carefully describe the execution time of different methods for pre-training. 3. Some methods could take more time for training, with more possible hyper-parameters combination, it could be very expensive to find the optimal set. ## Question 1 > (1) How might the protocol be further extended to handle an even greater diversity of tasks (e.g. link prediction, graph classification, etc.)? What would be the challenges? ## Reply Skeleton 1 1. Explain why some task like link-prediction is not applicable for the SSL methods.(which may also be challenges) 2. Emphasize our motivation that using multiple downstream task is benefitial for more comprehensive evaluation. 3. Consider for graph-level is there any other tasks like clustering for evaluation ## Question 2 > (2) Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results? ## Reply Skeleton2 Refer to the relpy to weakness 2 ## Question 3 > (3) What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable? ## Reply Skeleton3 1. Combine with Reply to Weakness 3 2. With our proposed methods, which provides the insights for more comprehensive evaluation, the community can try to evaluate the pre-training methods more comprehensively with doable computing overhead. # Reply to Reviewer JRMc:4 Dear Reviewer, Thank you for your feedback and recognizing the strengths of our paper. We respond to your comments below: > 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g., link prediction, graph classification, etc.) would further strengthen the work. > (1). How might the protocol be further extended to handle an even greater diversity of tasks (e.g., link prediction, graph classification, etc.)? What would be the challenges? Thank you for the suggestion of extending experiments. We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. We intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation. Moreover, we have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph[1] and MVGRL(G)[2], on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported. | Model | Dataset | Min | Max | Ave | Std | | --------- | -------- | ----------- | ----------- | ------------ | -------------- | | MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 | | MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 | | MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 | | InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 | | InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 | | InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRL(G) is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work. [1] InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization, ICLR 2020 [2] Contrastive Multi-View Representation Learning on Graphs, ICML 2020     > 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance? >(2). Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results? Thank you for your detailed feedback. We would like to clarify that the primary focus of our work is on proposing a new evaluation protocol rather than dissecting the critical components of each GCL method. Regarding your questions about the performance differences among models, we have some understanding to share. We believe that these differences may stem from two main factors: the number of hyper-parameters involved and the contrastive objective employed. - As discussed in our paper, the Barlow Twins loss (BT) involves fewer hyper-parameters, namely, the drop edge rate and drop node rate for graph augmentations on two views. Conversely, SFA introduces two additional hyper-parameters: k, representing the number of iterations used in its spectral feature augmentation, and $\tau$, utilized in its adopted InfoNCE loss. The increased number of hyper-parameter combinations in SFA may contribute to its sensitivity to hyper-parameters, thereby resulting in inconsistent results. - Furthermore, compared to the InfoNCE loss, BT does not require negatives and instead focuses on enhancing positive pair similarity and feature diversity. This simplicity in strategy could be a contributing factor to its robust and consistent performance. In summary, while we acknowledge that further investigation into the performance dynamics of each method would be valuable, our primary emphasis remains on addressing the limitations of existing evaluation protocols. We appreciate your insights and suggestions and will consider them for future research endeavors.    >3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive. >(3).What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable? Thank you for your informative suggestion. We appreciate your emphasis on the practical considerations, particularly regarding the computational cost associated with our proposed evaluation protocol. In the revised version of our paper, we will include information about the computational cost, specifically the execution time, to provide a more comprehensive understanding of the trade-offs involved. To illustrate the computational cost of different methods, we conducted experiments to measure the pre-training time per epoch on the DBLP dataset. The results are summarized in the table below:  | Model | Time (s/epoch) | | --------- | ----- | | BGRL | 0.116 | | CCA | 0.113 | | COSTA | 0.968 | | DGI | 0.038 | | GBT | 0.076 | |GRACE | 0.227 | | MVGRL | 0.860 | | SFA | 0.298 | | SUGRL | 0.223 | As shown in the table, some methods exhibit significantly higher pre-training times compared to others. For example, the pre-training time of COSTA is approximately 12 times longer than that of GBT. However, such trade-offs in computational cost are not currently considered in the existing evaluation protocol. Moving forward, we acknowledge the importance of incorporating computational cost considerations into our evaluation framework. To address this, we plan to explore strategies such as using varying numbers of sets of hyper-parameters rather than a consistent 20. By adapting the evaluation framework to account for the computational cost of different methods, we aim to make it more practical and feasible for adoption by the community.   # Reviewer 8gAS:4 ## Weakness 1 > 1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP. ## Reply Skeleton 1 1. Thank your for the suggestions 2. We will provide the experiments results on the DBLP dataset for the methods evalmulti-node classifcions. ## Weakness 2 > 2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness. ## Reply Skeleton 2 1. Thanks for the suggestions 2. We will conduct experiments for some methods that can be applicable to be evaluated by the graph classification. ## Weakness 3 > 3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability. ## Reply Skeleton 3 1. Thanks for the suggestions 2. The proposed methods are not directly related to the industry but did provide more insights on evaluation 3. Moreover, the benefits can be related to the analysis of the execution time. As we can’t invest infinite resources to try every possible combinations of hyper-parameters. Questions are corresponding to the weakness # Reply to Reviewr 8gAS:4 Thank you for the feedback and recognizing the strengths of our paper. We respond to your comments below: >1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP. >It is suggested that the authors enrich their work by incorporating experiments on the NCI1 and DBLP datasets. Thank you for your detailed feedback regarding the extension of our experiments. We have conducted experiments on the DBLP dataset, and the results are presented below. The "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from experiments conducted with 20 sets of hyper-parameters, each corresponding to the best epoch reported. We observed that these results are generally consistent with the findings presented in our paper. | Model | Min | Max | Ave | Std | | ----- | ------ | ------ | ------ | -------------- | | BGRL | 0.6678 | 0.7857 | 0.7492 | 0.0332093927 | | CCA | 0.7428 | 0.8404 | 0.8251 | 0.02264025111 | | COSTA | 0.5482 | 0.8364 | 0.7649 | 0.09611825843 | | DGI | 0.7975 | 0.8370 | 0.8226 | 0.01258848951 | | GBT | 0.8178 | 0.8409 | 0.8279 | 0.006522476488 | | GRACE | 0.7896 | 0.8387 | 0.8266 | 0.01334265861 | | MVGRL | 0.8167 | 0.8432 | 0.8363 | 0.006628464498 | | SFA | 0.7609 | 0.8415 | 0.8158 | 0.0203650853 | | SUGRL | 0.7986 | 0.8331 | 0.8182 | 0.009150276264 | **We would like to note that while the results of the NCI dataset and other graph classification datasets are not included in this response, we will provide them in the response to the subsequent questions.** We appreciate your suggestion to enrich our work by incorporating experiments on additional datasets, and we believe that these additional experiments will further enhance the comprehensiveness of our evaluation.  >2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness. >The authors are encouraged to extend their analysis to include graph-level tasks. Thank you for your detailed suggestion. We have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph and MVGRL(G), on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported. | Model | Dataset | Min | Max | Ave | Std | | --------- | -------- | ----------- | ----------- | ------------ | -------------- | | MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 | | MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 | | MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 | | InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 | | InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 | | InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRLG is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work.   >3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability. >The authors should elucidate the link between the findings of this paper and their implications for industry. Thank you for your feedback. Here, we would like to explicitly elucidate the link between the findings of our work and their implications for industry. In industrial settings, selecting a suitable pre-training model is a crucial step when applying GCL methods. In practical scenarios, where GCL methods are often used to address unseen tasks, extensively tuning models for specific scenarios is not feasible. Hence, there is a preference for GCL models that demonstrate less sensitivity to hyperparameters, ensuring robust performance across diverse tasks without extensive tuning. However, existing evaluation protocols often lack fairness and comprehensiveness, as highlighted in our paper. Our proposed evaluation protocol addresses these challenges by offering a more comprehensive and fair comparison of various GCL methods. By identifying models less sensitive to hyperparameters and consistently performing well across multiple tasks, our protocol facilitates the industrial adoption of GCL methods. This streamlined model selection process empowers industry practitioners to confidently utilize GCL methods for diverse applications. In summary, our proposed evaluation protocol provides valuable insights for the industrial applications of GCL methods by offering an improved benchmarking framework. By tackling challenges related to model sensitivity and limited task representation, we equip industry practitioners with robust tools for effectively leveraging GCL methods in real-world scenarios.   # Reviewer HV8V:2 ## Weakness 1 > 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks. ## Reply Skeleton 1 1. Carefully illustrate that our methods already take both hyper-parameters and diverse downstream tasks into consideration. 2. Also our goal is not to conduct the hyper-parameters experiments but to show that a method’s performance could be affected much by its hyper-parameters ## Weakness 2 > 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification. ## Reply Skeleton 2 1. Carefully illustrate that these methods inherent restrictions to make them only for node level tasks. 2. We agree that more evaluation task categories could be beneficial and that’s why we propose to also consider multi-label node classification as a new evaluation task. ## Question 1 > Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration? ## Reply Skeleton 1 1. We are arguing that only one down-stream task may not be enough. 2. The currenlty evaluated methods are not designed for them,we acknowledge that doing experiments for different downstream tasks might be helpful but that could be further work ## Question 2 > Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods? ## Reply Skeleton 2 1. We consider that it should be relatively easier for relatively good methods to find proper hyper-parameters to achieve satisfying results. i.e., if the methods are too sensitive, it will be much harder and require more time and efforts to decide the number of settings for training. 2. Explain why Our scope is not to conduct hyper-parameter experiments to find this relationship between hyper-parameters and different tasks    # Revised Reply to Reviewer HV8V:2 Dear Reviewer, Thank you for the feedback and recognizing the strengths of our paper. We respond to your comments below: >Weakness 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks. >Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods? Thank you for your feedback. In our paper, we have indeed included the sensitivity of hyper-parameters across different tasks in our evaluation considerations. In Section 5.2 of our paper, specifically for the multi-label classification tasks, we conducted experiments with 20 sets of randomly selected hyper-parameters for pre-training GCL encoders. Then, we reported the average performance with standard deviation, as depicted in Figure 5, 6, and 7. This approach allows for a fair evaluation of GCL models across multiple tasks, as models that are sensitive to hyper-parameters will exhibit larger variances in the F1 scores. Regarding the question `Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?`, we would like to emphasize that we do not aim to select a specific set of hyper-parameters of GCL model for each task (this is impractical as we discussed in our paper). Instead, we trained 20 pre-trained GCL models with different sets of hyper-parameters and evaluated them across all tasks. After the pre-training stage, the GCL encoder remained fixed, and only the task head was tuned using the corresponding validation dataset. We deliberately designed our evaluation protocol in this manner to provide a comprehensive assessment of the pre-trained models. We argue that both hyper-parameter sensitivity and performance across different tasks should be considered together when evaluating a GCL method. Essentially, a model with low sensitivity (lower variance) and high performance (large F1 scores) is deemed favorable. We appreciate your feedback, and we will revise our paper to clarify our experimental settings accordingly.    >Weakness 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification. >Question1.Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration? Thank you for your insightful question and the suggestion to extend our experiments. We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. In future work, we intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation. Thank you for raising this point, and we appreciate your feedback. We will consider additional evaluation tasks in our future research endeavors to enhance the comprehensiveness of our evaluation protocol.        # Response 1 for Reviewer HV8V Dear Reviewer, Thank you for the feedback and recognizing the strengths of our paper. We respond to your comments below: >Weakness 1.The proposed evaluation protocol does not offer a robust method for assessing how GCL methods' performance varies with hyper-parameter changes across diverse tasks. >Question2.Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods? Thank you for your feedback. In our paper, we have indeed included the sensitivity of hyper-parameters across different tasks in our evaluation considerations. In Section 5.2 of our paper, specifically for the multi-label classification tasks, we conducted experiments with 20 sets of randomly selected hyper-parameters for pre-training GCL encoders. Then, we reported the average performance with standard deviation, as depicted in Figure 5, 6, and 7. This approach allows for a fair evaluation of GCL models across multiple tasks, as models that are sensitive to hyper-parameters will exhibit larger variances in the F1 scores. Regarding the question `Have you considered the sensitivity of hyper-parameters when switching between different tasks, which is a critical aspect of evaluating GCL methods?`, we would like to emphasize that we do not aim to select a specific set of hyper-parameters of GCL model for each task (this is impractical as we discussed in our paper). Instead, we trained 20 pre-trained GCL models with different sets of hyper-parameters and evaluated them across all tasks. After the pre-training stage, the GCL encoder remained fixed, and only the task head was tuned using the corresponding validation dataset. We deliberately designed our evaluation protocol in this manner to provide a comprehensive assessment of the pre-trained models. We argue that both hyper-parameter sensitivity and performance across different tasks should be considered together when evaluating a GCL method. Essentially, a model with low sensitivity (lower variance) and high performance (large F1 scores) is deemed favorable. We appreciate your feedback, and we will revise our paper to clarify our experimental settings accordingly.. # Response 2 for Reviewer HV8V >Weakness 2.The proposed evaluation protocol mainly concentrates on multi-class classification for evaluating GCL methods, neglecting the necessity to test these methods' performance consistency across a wider range of task types beyond classification. >Question1.Why does multi-label evaluation suffice? There are other tasks such as link prediction, graph classification, etc. Why do not include them in consideration? Thank you for your insightful question and the suggestion to extend our experiments. We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. In future work, we intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation. Thank you for raising this point, and we appreciate your feedback. We will consider additional evaluation tasks in our future research endeavors to enhance the comprehensiveness of our evaluation protocol. # Response 1 for Reviewer JRMC Dear Reviewer, Thank you for your feedback and recognizing the strengths of our paper. We respond to your comments below: > 1.The tasks are still homogeneous (all node classification). Evaluation on more diverse task types (e.g., link prediction, graph classification, etc.) would further strengthen the work. > (1). How might the protocol be further extended to handle an even greater diversity of tasks (e.g., link prediction, graph classification, etc.)? What would be the challenges? Thank you for the suggestion of extending experiments. We would like to clarify that incorporating multiple downstream tasks, including node classification, link prediction, and graph classification, simultaneously with a specific pre-trained model on a specific dataset is often not feasible due to limitations in both the model structure and the datasets themselves. From a dataset perspective, node classification datasets such as Cora and Coauthor typically consist of a single graph with labeled nodes, making it impractical to conduct graph classification tasks on such datasets. Additionally, most GCL models designed for node classification are not naturally suited for graph classification tasks, as they are primarily intended for learning node representations. Furthermore, performing link prediction using pre-trained models poses additional challenges, as it requires removing a portion of the edges and using them as a test set. This necessitates pre-training another model on the partial dataset with removed edges. Given these constraints, we opted to focus on multi-label classification as it allows us to simulate scenarios involving multiple downstream tasks. We acknowledge that including a wider range of evaluation tasks could provide a more comprehensive assessment of pre-trained models. We intend to explore datasets that support different categories of tasks to broaden the scope of our evaluation. # Response 2 for Reviewer JRMC Moreover, we have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph[1] and MVGRL(G)[2], on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported. | Model | Dataset | Min | Max | Ave | Std | | --------- | -------- | ----------- | ----------- | ------------ | -------------- | | MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 | | MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 | | MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 | | InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 | | InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 | | InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRL(G) is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work. [1] InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization, ICLR2020 [2] Contrastive Multi-View Representation Learning on Graphs, ICML2020 # Response 3 for Reviewer JRMC > 2.The paper could benefit from an ablation study or more detailed analysis on why certain models excel or struggle under the new protocol. For example, what properties of Barlow Twins contribute to its strong and consistent performance? >(2). Can you share any insights or hypotheses on why models like Barlow Twins perform well while others like SFA show inconsistent results? Thank you for your detailed feedback. We would like to clarify that the primary focus of our work is on proposing a new evaluation protocol rather than dissecting the critical components of each GCL method. Regarding your questions about the performance differences among models, we have some understanding to share. We believe that these differences may stem from two main factors: the number of hyper-parameters involved and the contrastive objective employed. - As discussed in our paper, the Barlow Twins loss (BT) involves fewer hyper-parameters, namely, the drop edge rate and drop node rate for graph augmentations on two views. Conversely, SFA introduces two additional hyper-parameters: k, representing the number of iterations used in its spectral feature augmentation, and $\tau$, utilized in its adopted InfoNCE loss. The increased number of hyper-parameter combinations in SFA may contribute to its sensitivity to hyper-parameters, thereby resulting in inconsistent results. - Furthermore, compared to the InfoNCE loss, BT does not require negatives and instead focuses on enhancing positive pair similarity and feature diversity. This simplicity in strategy could be a contributing factor to its robust and consistent performance. In summary, while we acknowledge that further investigation into the performance dynamics of each method would be valuable, our primary emphasis remains on addressing the limitations of existing evaluation protocols. We appreciate your insights and suggestions and will consider them for future research endeavors. # Response 4 for Reviewer JRMC >3.There is limited discussion on the practical tradeoffs (e.g. computational cost) of the proposed protocol and how it could be realistically adopted by the community. Tuning and evaluating models across many hyperparameter settings and tasks may be prohibitively expensive. >(3).What are your thoughts on the feasibility of the community adopting this protocol given the computational expenses? Are there ways to make it more practically achievable? Thank you for your informative suggestion. We appreciate your emphasis on the practical considerations, particularly regarding the computational cost associated with our proposed evaluation protocol. In the revised version of our paper, we will include information about the computational cost, specifically the execution time, to provide a more comprehensive understanding of the trade-offs involved. To illustrate the computational cost of different methods, we conducted experiments to measure the pre-training time per epoch on the DBLP dataset. The results are summarized in the table below: | Model | Time (s/epoch) | | --------- | ----- | | BGRL | 0.116 | | CCA | 0.113 | | COSTA | 0.968 | | DGI | 0.038 | | GBT | 0.076 | |GRACE | 0.227 | | MVGRL | 0.860 | | SFA | 0.298 | | SUGRL | 0.223 | As shown in the table, some methods exhibit significantly higher pre-training times compared to others. For example, the pre-training time of COSTA is approximately 12 times longer than that of GBT. However, such trade-offs in computational cost are not currently considered in the existing evaluation protocol. Moving forward, we acknowledge the importance of incorporating computational cost considerations into our evaluation framework. To address this, we plan to explore strategies such as using varying numbers of sets of hyper-parameters rather than a consistent 20. By adapting the evaluation framework to account for the computational cost of different methods, we aim to make it more practical and feasible for adoption by the community. # Response 1 for Reviewr 8gAS Thank you for the feedback and recognizing the strengths of our paper. We respond to your comments below: >1.The comprehensiveness of the dataset leaves room for improvement, as it lacks experiments on some commonly utilized datasets such as DBLP. >It is suggested that the authors enrich their work by incorporating experiments on the NCI1 and DBLP datasets. Thank you for your detailed feedback regarding the extension of our experiments. We have conducted experiments on the DBLP dataset, and the results are presented below. The "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from experiments conducted with 20 sets of hyper-parameters, each corresponding to the best epoch reported. We observed that these results are generally consistent with the findings presented in our paper. | Model | Min | Max | Ave | Std | | ----- | ------ | ------ | ------ | -------------- | | BGRL | 0.6678 | 0.7857 | 0.7492 | 0.0332093927 | | CCA | 0.7428 | 0.8404 | 0.8251 | 0.02264025111 | | COSTA | 0.5482 | 0.8364 | 0.7649 | 0.09611825843 | | DGI | 0.7975 | 0.8370 | 0.8226 | 0.01258848951 | | GBT | 0.8178 | 0.8409 | 0.8279 | 0.006522476488 | | GRACE | 0.7896 | 0.8387 | 0.8266 | 0.01334265861 | | MVGRL | 0.8167 | 0.8432 | 0.8363 | 0.006628464498 | | SFA | 0.7609 | 0.8415 | 0.8158 | 0.0203650853 | | SUGRL | 0.7986 | 0.8331 | 0.8182 | 0.009150276264 | **We would like to note that while the results of the NCI dataset and other graph classification datasets are not included in this response, we will provide them in the response to the subsequent questions.** We appreciate your suggestion to enrich our work by incorporating experiments on additional datasets, and we believe that these additional experiments will further enhance the comprehensiveness of our evaluation. # Response 2 for Reviewr 8gAS >2.The study primarily focuses on node classification, overlooking graph classification, thus restricting its comprehensiveness. >The authors are encouraged to extend their analysis to include graph-level tasks. Thank you for your detailed suggestion. We have extended our experiments to include datasets suitable for graph classification evaluation. Specifically, we conducted experiments using two GCL methods, InfoGraph and MVGRL(G), on datasets such as NCI1, PROTEINS, and PTC_MR. The results, measured in terms of accuracy, are provided below. In the table, the "Min," "Max," "Ave," and "Std" columns represent the minimum, maximum, average, and standard deviation values obtained from 20 sets of hyper-parameters, each corresponding to the best epoch reported. | Model | Dataset | Min | Max | Ave | Std | | --------- | -------- | ----------- | ----------- | ------------ | -------------- | | MVGRL(G) | NCI1 | 0.720194647 | 0.759124088 | 0.7457420923 | 0.00898320415 | | MVGRL(G) | PROTEINS | 0.776785714 | 0.794642857 | 0.786607143 | 0.004933091504 | | MVGRL(G) | PTC_MR | 0.6 | 0.771428571 | 0.6957142857 | 0.04069746377 | | InfoGraph | NCI1 | 0.725060827 | 0.790754258 | 0.7596107057 | 0.018405012 | | InfoGraph | PROTEINS | 0.741071429 | 0.821428571 | 0.7781250001 | 0.02304285311 | | InfoGraph | PTC_MR | 0.6 | 0.742857143 | 0.6557142859 | 0.03762555068 | Our preliminary findings suggest that certain GCL models intended for graph classification tasks are also sensitive to hyper-parameters. For example, according to the results above, MVGRLG is generally less sensitive than InfoGraph. Therefore, there is a clear need for a similar evaluation framework tailored specifically for GCL methods designed for graph-level tasks. This presents an interesting avenue for future research, and we plan to explore this aspect further in our future work. # Response 3 for Reviewr 8gAS >3.While the paper excels in model analysis, it appears to draw a scanty correlation with the industry, which could potentially limit its applicability. >The authors should elucidate the link between the findings of this paper and their implications for industry. Thank you for your feedback. Here, we would like to explicitly elucidate the link between the findings of our work and their implications for industry. In industrial settings, selecting a suitable pre-training model is a crucial step when applying GCL methods. In practical scenarios, where GCL methods are often used to address unseen tasks, extensively tuning models for specific scenarios is not feasible. Hence, there is a preference for GCL models that demonstrate less sensitivity to hyperparameters, ensuring robust performance across diverse tasks without extensive tuning. However, existing evaluation protocols often lack fairness and comprehensiveness, as highlighted in our paper. Our proposed evaluation protocol addresses these challenges by offering a more comprehensive and fair comparison of various GCL methods. By identifying models less sensitive to hyperparameters and consistently performing well across multiple tasks, our protocol facilitates the industrial adoption of GCL methods. This streamlined model selection process empowers industry practitioners to confidently utilize GCL methods for diverse applications. In summary, our proposed evaluation protocol provides valuable insights for the industrial applications of GCL methods by offering an improved benchmarking framework. By tackling challenges related to model sensitivity and limited task representation, we equip industry practitioners with robust tools for effectively leveraging GCL methods in real-world scenarios.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.