fse2025rebuttal

We sincerely appreciate the reviewers' insightful feedback! We believe most of the comments can be addressed via clarification and some additional analysis. Please find our responses below. **R1Q1:Why Is Ranking Necessary? Can't the Model Rank Candidates?** 1. Firstly, ranking prioritizes higher-quality comment candidates, enabling developers to quickly identify and select the most appropriate candidate. Secondly, the "OriSimRerank" ranking method helps mitigate the issue of excessive annotations. While LLMs often generate annotations containing additional details like implementation specifics, our target is a concise description of method functionality, not internal logic or implementation. Thirdly, as shown in Table1 and Table2, the “OriSimRerank” method further improves performance. 2. LLMs can rank candidate comments; however, according to our findings([@Link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRr3Gc3jVmqfUHlV9RcpEsV9rl557Fpabkeaa5FLU7TCmMjPeENUmb_XL6DVVAV2uEBeNjl1dcYhSNM/pubhtml)), employing a pairwise-based ranking method, based on prompt and CodeLlama, may lead to a slight average Accuracy degradation in LLMCup results without reranking, decreasing by 0.1%(from 25.4% to 25.3%). **R1Q2:How Was the Prompt Designed? Were Simpler Prompts Explored?** 1. The prompt in Fig.2 is adapted from the approach in [16]. **However, our focus is on comment updating, which inherently involves additional inputs.** These inputs include ②,④,⑥,⑧, while instructions include ①,③,⑤,⑦, and ⑨, which are essential for structuring the inputs and aiding LLMs in comprehending the task. 2. We initially experimented with simpler prompt designs but abandoned them due to poor performance. LLMs using these prompts often overlooked critical aspects, such as unclear definitions of the model’s role, insufficient focus on code modifications, and inadequate adherence to the desired output format. Consequently, the generated comments often failed to reflect changes to function signatures, variable names, and class names. **R1C1:Suggestions in comments** Will revise the paper accordingly, following the detailed plan outlined in [@DocLink](https://docs.google.com/document/d/e/2PACX-1vT1mrlZ4hOZKqWWoY2rgWcWs-N714XT3nKyfV8RrgxffJm7n1j0_vMqrzG4siz7NY6JSAKIeJ0zJpWy/pub). **R2Q1:What Are Alternative Reranking Approaches?** There are three types of alternative reranking approaches: 1. Pointwise Reranking: Methods like \[71](Multiple-Classification-Badsed), treat the ranking problem as a regression or classification task, independently scoring each candidate’s relevance. 2. Pairwise Reranking: Approaches such as \[72](LLM-Based), RankNet\[73](Neural-Network-Based), RankBoost\[74](Boosting-Based) and RankingSVM\[75](SVM-Based), frame the ranking problem as a binary classification task to determine the relative relevance between candidate pairs. 3. Listwise Reranking: Techniques like \[76](LLM-Based), ListNet\[77](Neural-Network-Based), and \[78](Random-Forest-Based), models the ranking problem as an optimization task for the entire candidate list, aiming to improve overall ranking quality globally. **R2C1: Does Minimizing Changes to Original Comments Improve Results? Can Strict Application Lead to Bias and Lower Quality?** From a software development perspective, developers are more likely to accept updated comments that build on the original human-written comments. We conducted a small-scale experiment where participants received updated comments that were semantically similar or identical. Some comments were minimally altered from the original, while others underwent substantial modifications. The results revealed that users preferred comments with smaller changes. Additionally, our analysis of the dataset, comprising 1,496 projects and 98,622 data points, shows that on average, 84.05% of the original comment’s sub-tokens are reused in the updated comment. However, We acknowledge the potential issues raised and will discuss them in the revised version. **R2C2:Why chose These LLMs?** a) Choice of Model Types: To ensure comprehensive research findings, we aimed to include a diverse set of models, spanning: - General-purpose models (e.g., GPT-4o, Llama3, Mistral) - Code-specific models (e.g., Codelllama, DeepSeek-Coder-v2) - Proprietary commercial models (e.g., GPT-4o) - Open-source models (e.g., Llama3, Mistral, Codelllama, DeepSeek-Coder-v2). b) Model Advancement and Representativeness: - GPT-4o is among the latest and most powerful models in the GPT family. - Llama3 and Codelllama are representative of the newest and most advanced iterations of the Llama series. - Gemma and DeepSeek-Coder-v2 are prominent representatives of other popular LLM frameworks. c) Choice of Parameter Scale: To leverage the improved performance associated with larger parameter scales in LLMs, we prioritized selecting models with greater parameter scales (e.g., 7B, 8B, and 16B), within the constraints of available technical resources. **R2C3: Was the Prompt Designed Through Trial-and-error？How were the Two Rules Derived?** The prompt design undergoes a trail-and-error process. See R1Q2 for details. Without the first rule, generated comments often miss updates to function signatures, variable names, and class names in the code. Without the second rule, LLMs tend to retain typos from the original comment instead of correcting them in the updated version. **R2C4：Why Chose the Three Human Evaluation Perspectives?** We chose them for two reasons: (1)They were inspired by the human evaluation approach used in a similar study[79]; (2)Based on our observations, the generated results may have issues such as inconsistencies with the updated code (e.g., copying the original or missing key changes), unnatural expression (e.g., incomplete sentences, repetition, and grammar errors), and failure to convey the code's intent, making it harder for developers to understand the functionality quickly. **R3Q1: Why Recall@3 and Recall@5? How Were They Determined?** 1. Accuracy is equivalent to recall@1, as their calculation methods align. 2. Due to budget constraints for GPT-4o, we generated only 5 candidate annotations per data point. Therefore, **the maximum value of k is 5**. 3. To balance brevity with result comprehensiveness, we reported recall@k for k=1, 3, and 5. 4. Recall@2 and Recall@4 results are provided in [@TableLink](https://docs.google.com/spreadsheets/d/e/2PACX-1vRavyz5F5zVoh5RIDSUiIdhN-VFRNxpiFooOXytAnAdrI2ckmT4WHRF4DRhwf5S6JLMTiYC6Wt0Ienn/pubhtml). **R3Q2: Why Were Specific Temperature Settings Chosen (0.2 for Most Models, 0.1 for Codellama:7b), and How Do They Impact Result Consistency?** Based on experimental results, the temperature setting shown in Fig. 5(a) is optimal. Relevant factors are discussed in lines 683–704, and we will provide additional details as needed. As shown in Fig. 4, increasing the temperature generally reduces LLM accuracy while improving Recall@5. This indicates that higher temperatures decrease consistency in generated results. Specifically, lower temperatures improve the likelihood of generating the correct comment on the first attempt, while higher temperatures increase the chance of producing a correct comment within k attempts(k=5). **R3Q3: Could you elaborate on the types of code changes in the dataset and how LLMCup addresses different modification complexities?** This paper focuses on comment updates rather than categorizing code changes. While code updates at the method level can involve: (1) syntactic changes that do not alter semantics (2) changes in execution logic. We adopt a unified approach leveraging the LLM’s capabilities. For syntactic changes, the LLM, trained on diverse datasets, handles varying complexities effectively. For execution logic changes, we guide the LLM with a customized prompt, as illustrated in Fig.2. We will discuss it in the revised version. **R3Q4:Challenges in Designing and Validating Reranking Strategies (Especially RefSimRerank) for Maintaining Semantic Accuracy?** 1.Challenges Firstly, The design challenge lies in ensuring that top-ranked comments accurately reflect original context and updated code. Secondly, The validation challenge involves developing the metric of reference similarity score to effectively measure semantic accuracy. 2.Response to challenges Firstly, the LLMCup leverages LLMs to generate candidate comments that already reflect the original context and changes in the code. RefSimRerank, as a subsequent step, focuses on reranking these candidates by using a reference similarity metric, which is calculated based on the updated code. Overall, this ensures that the top-ranked candidate comments are more aligned with both the original context and the updated code. Secondly, Validation integrates manual reviews with the metric of semantic similarity to ensure consistency with ground-truth annotations. **R3Q5:How Does LLMCup Handle Poor-Quality Original Comments? Does Minimizing Changes Retain Them?** Prior studies[80-84] have shown that comments in the referenced projects are of high quality. These comments have been proved to be beneficial for various software engineering tasks, including bug detection[85-88], specification inference[89-91], testing[92,93], and code synthesis[94-98]. However, poor-quality comments could impact updates and lead to suboptimal results. We will discuss this issue in the revised paper and explore alternative solutions to mitigate it. [71] Li et al. McRank: Learning to Rank Using Multiple Classification and Gradient Boosting, ICNIPS, 2007. [72] Qin et al. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. 2023. [73] Burges et al. Learning to Rank Using Gradient Descent. ICML, 2005. [74] Freund et al. An Efficient Boosting Algorithm for Combining Preferences. JMLR, 2003. [75] Herbrich et al. Support Vector Learning for Ordinal Regression. 1999. [76] Ma et al. Zero-shot Listwise Document Reranking with a Large Language Model. ARXIV, 2023. [77] Cao et al. Learning to Rank: From Pairwise Approach to Listwise Approach. ICML, 2007. [78] Ibrahim et al. Comparing Pointwise and Listwise Objective Functions for Random-Forest-Based Learning-to-Rank. TOIS, 2016. [79] Zhai et al. CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis. ICSE, 2020. [80] Tenny et al. Procedures and Comments vs. the Banker’s Algorithm. SIGCSE, 1985. [81] Tenny et al. Program Readability: Procedures Versus Comments. TSE, 1988. [82] Woodfield et al. The Effect of Modularization and Comments on Program Comprehension. ICSE, 1981. [83] Hartzman et al. Maintenance Productivity: Observations Based on an Experience in a Large System Environment. CASCON, 1993. [84] Jiang et al. Examining the Evolution of Code Comments in PostgreSQL. MSR, 2006. [85] Rubio-González et al. Expect the Unexpected: Error Code Mismatches Between Documentation and the Real World. PASTE, 2010. [86] Tan et al. iComment: Bugs or Bad Comments? OSR, 2007. [87] Tan et al. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. ICSE, 2011. [88] Tan et al. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. ICST, 2012. [89] Blasi et al. Translating Code Comments to Procedure Specifications. ISSTA, 2018. [90] Pandita et al. Inferring Method Specifications from Natural Language API Descriptions. ICSE, 2012. [91] Zhong et al. Inferring Resource Specifications from Natural Language API Documentation. ASE, 2009. [92] Goi et al. Automatic Generation of Oracles for Exceptional Behaviors. ISSTA, 2016. [93] Wong et al. DASE: Document-Assisted Symbolic Execution for Improving Automated Software Testing. ICSE, 2015. [94] Allamanis et al. Bimodal Modelling of Source Code and Natural Language. ICML, 2015. [95] Gvero et al. Synthesizing Java Expressions from Free-Form Queries. SIGPLAN, 2015. [96] Nguyen et al. Statistical Translation of English Texts to API Code Templates. ICSE-C, 2017. [97] Phan et al. Statistical Learning for Inference Between Implementations and Documentation. ICSE, 2017. [98] Zhai et al. Automatic Model Generation from Documentation for Java API Functions. ICSE, 2016. [99] Nashid et al. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning, ICSE, 2023.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.