Global rebuttal

# Global rebuttal We express our sincere thanks to all reviewers for their valuable feedback and comments. We have addressed each reviewer's comments in our individual rebuttals. Additionally, we are **presenting a collective response here with some information such as additional experiments** and other clarifications. *** ### Motivating Examples added (also posted in rebuttals for several reviewers) We have added a motivating example to illustrate the workflow and need for RRR [here](https://drive.google.com/drive/folders/1dXc0-RNxrwU_n9d17-PDGrULXTG01pKD?usp=sharing). *** ### More models evaluated (also posted in rebuttal for reviewer FWtn) We've broadened our benchmark's evaluation scope by including tests on GPT-3.5, GPT-4, Phi-3-mini, Phi-3-medium, and llama-3-70b-instruct. Selected results are provided below due to space limitations. **Test Rate in % on DETAILED-RepoClassBench:** **Java Results** | Model | BasicPrompting | Reflexion | NaiveRAG | Repocoder | RRR | |--------------|----------------|-----------|----------|-----------|-------| | GPT-4 | 2.18 | 20.28 | 14.56 | 58.12 | 84.27 | | LLama-3-7b-instruct | 1.55 | 9.24 | 7.75 | 32.27 | 79.98 | | Phi-3-mini | 1.15 | 3.17 | 6.41 | 9.16 | 19.85 | | Phi-3-medium | 1.53 | 2.29 | 8.4 | 33.78 | 40.03 | **Python Results** | Model | BasicPrompting | Reflexion | NaiveRAG | Repocoder | RRR | |---------|----------------|-----------|----------|-----------|-------| | GPT-4 | 2.4 | 14.36 | 14.08 | 25.59 | 36.92 | | LLama-3-7b-instruct | 0 | 12.94 | 16.5 | 26.5 | 33.3 | **Conclusion:** RRR consistently outperforms baselines across models. GPT-4 excels in reasoning and tool usage, closely followed by llama-3-70b-instruct, while the smaller phi models are also able to use tools decently well and surpass the RepoCoder baseline. *** ### Our aims for the dataset (also posted in rebuttal for reviewer f3AN) Our dataset is designed to test the following aspects of code-generation: 1. **Class-generation**: Studies [1] demonstrate that LMs underperform in class generation compared to function generation. 2. **Long-form generation**: With an **average of 452 tokens for Java, 1070 for Python and 842 tokens for C#**, our dataset challenges models to produce lengthy code segments. 3. **Repository-context usage:** It tests a model's ability to understand and use repository interdependencies for accurate code generation. 4. **Conversational interaction:** Real-world prompts can be vague. The 'sketchy' dataset variant provides fewer details, and can be used to test models on their ability to clarify requirements through dialogue with devs. *** ### Some drawbacks of other datasets We would like to highlight the following points about prior benchmarks: * **ClassEval**[1]: consists of 100 **synthetically** generated Python classes that do not originate from real-world repositories. These classes are relatively short, with an **average length of 123 tokens**. Furthermore, they **lack dependencies on other parts of the repository**. * **RepoEval** [2]: Focuses on **line/API completion** and **lacks evaluation via testcases** * **RepoBench** [3]: is limited to **line-level** completion and **lacks evaluation via test cases**. * **CoderEval** [4]: is restricted to **function-level tasks** with an average implementation length of **108 tokens**, insufficient for assessing long-form code generation capabilities. [1]: ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation [2]: RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [3]: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [4]: CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models *** ### Regarding size (also posted in rebuttal for reviewer SRud) * We have improved the coverage of our dataset by including a new language **C#** with **60 classes**, showcasing varied test pass rates: **RRR (40%), RepoCoder (2%), and Reflexion (3.11%).** * Strict criteria for inter-file references and test coverage led us to exclude 95% of the initially identified classes. * Our dataset, while comparable to benchmarks like HumanEval (164 samples) and others (ClassEval: 100, CoderEval: 230, RepoEval: 383), is unique as each task includes a complete repository, making direct comparisons with function-level datasets inappropriate. *** ### Clarifications on baselines A comparison between RRR and other baselines across different axes can be viewed in the table in **page 12 Section B**. * Reflexion [5] * Utilizes feedback from failed testcases and compiler errors * **Lacks a retrieval component** and hence, unable to use repo-level context * RepoCoder [6] * an iterative embedding-based retrieval framework * **Excels at retrieving “similar context”** i.e. code snippets that are similar to the target code * **Fails at tasks needing “dependency context”** i.e. to information about code structures on which the generated code may depend on such as parent classes etc * RQ-2 shows RepoCoder’s performance can be limited when similar code samples are not available. * RRR (ours) * Has **embedding-based tools** to fetch “similar context” * Has **static-analysis tools** (such as the ones often available to a developer in an IDE) to fetch “dependent context” . * Uses **feedback from failed testcases and compiler errors** to reason and decide which tools to invoke [5] Reflexion: Language Agents with Verbal Reinforcement Learning [6] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation