# Global rebuttal
We express our sincere thanks to all reviewers for their valuable feedback and comments. We have addressed each reviewer's comments in our individual rebuttals. Additionally, we are **presenting a collective response here with some information such as additional experiments** and other clarifications.
***
### Motivating Examples added (also posted in rebuttals for several reviewers)
We have added a motivating example to illustrate the workflow and need for RRR [here](https://drive.google.com/drive/folders/1dXc0-RNxrwU_n9d17-PDGrULXTG01pKD?usp=sharing).
***
### More models evaluated (also posted in rebuttal for reviewer FWtn)
We've broadened our benchmark's evaluation scope by including tests on GPT-3.5, GPT-4, Phi-3-mini, Phi-3-medium, and llama-3-70b-instruct. Selected results are provided below due to space limitations.
**Test Rate in % on DETAILED-RepoClassBench:**
**Java Results**
| Model | BasicPrompting | Reflexion | NaiveRAG | Repocoder | RRR |
|--------------|----------------|-----------|----------|-----------|-------|
| GPT-4 | 2.18 | 20.28 | 14.56 | 58.12 | 84.27 |
| LLama-3-7b-instruct | 1.55 | 9.24 | 7.75 | 32.27 | 79.98 |
| Phi-3-mini | 1.15 | 3.17 | 6.41 | 9.16 | 19.85 |
| Phi-3-medium | 1.53 | 2.29 | 8.4 | 33.78 | 40.03 |
**Python Results**
| Model | BasicPrompting | Reflexion | NaiveRAG | Repocoder | RRR |
|---------|----------------|-----------|----------|-----------|-------|
| GPT-4 | 2.4 | 14.36 | 14.08 | 25.59 | 36.92 |
| LLama-3-7b-instruct | 0 | 12.94 | 16.5 | 26.5 | 33.3 |
**Conclusion:** RRR consistently outperforms baselines across models. GPT-4 excels in reasoning and tool usage, closely followed by llama-3-70b-instruct, while the smaller phi models are also able to use tools decently well and surpass the RepoCoder baseline.
***
### Our aims for the dataset (also posted in rebuttal for reviewer f3AN)
Our dataset is designed to test the following aspects of code-generation:
1. **Class-generation**: Studies [1] demonstrate that LMs underperform in class generation compared to function generation.
2. **Long-form generation**: With an **average of 452 tokens for Java, 1070 for Python and 842 tokens for C#**, our dataset challenges models to produce lengthy code segments.
3. **Repository-context usage:** It tests a model's ability to understand and use repository interdependencies for accurate code generation.
4. **Conversational interaction:** Real-world prompts can be vague. The 'sketchy' dataset variant provides fewer details, and can be used to test models on their ability to clarify requirements through dialogue with devs.
***
### Some drawbacks of other datasets
We would like to highlight the following points about prior benchmarks:
* **ClassEval**[1]: consists of 100 **synthetically** generated Python classes that do not originate from real-world repositories. These classes are relatively short, with an **average length of 123 tokens**. Furthermore, they **lack dependencies on other parts of the repository**.
* **RepoEval** [2]: Focuses on **line/API completion** and **lacks evaluation via testcases**
* **RepoBench** [3]: is limited to **line-level** completion and **lacks evaluation via test cases**.
* **CoderEval** [4]: is restricted to **function-level tasks** with an average implementation length of **108 tokens**, insufficient for assessing long-form code generation capabilities.
[1]: ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
[2]: RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
[3]: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
[4]: CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
***
### Regarding size (also posted in rebuttal for reviewer SRud)
* We have improved the coverage of our dataset by including a new language **C#** with **60 classes**, showcasing varied test pass rates: **RRR (40%), RepoCoder (2%), and Reflexion (3.11%).**
* Strict criteria for inter-file references and test coverage led us to exclude 95% of the initially identified classes.
* Our dataset, while comparable to benchmarks like HumanEval (164 samples) and others (ClassEval: 100, CoderEval: 230, RepoEval: 383), is unique as each task includes a complete repository, making direct comparisons with function-level datasets inappropriate.
***
### Clarifications on baselines
A comparison between RRR and other baselines across different axes can be viewed in the table in **page 12 Section B**.
* Reflexion [5]
* Utilizes feedback from failed testcases and compiler errors
* **Lacks a retrieval component** and hence, unable to use repo-level context
* RepoCoder [6]
* an iterative embedding-based retrieval framework
* **Excels at retrieving “similar context”** i.e. code snippets that are similar to the target code
* **Fails at tasks needing “dependency context”** i.e. to information about code structures on which the generated code may depend on such as parent classes etc
* RQ-2 shows RepoCoder’s performance can be limited when similar code samples are not available.
* RRR (ours)
* Has **embedding-based tools** to fetch “similar context”
* Has **static-analysis tools** (such as the ones often available to a developer in an IDE) to fetch “dependent context” .
* Uses **feedback from failed testcases and compiler errors** to reason and decide which tools to invoke
[5] Reflexion: Language Agents with Verbal Reinforcement Learning
[6] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation