NeurIPS Rebuttal, MLQT (4)

Thank you for the thoughtful comments. ### 1 ChatGPT baselines on CodeContests We ran three experiments with ChatGPT as the coder on CodeContests: ChatGPT + sample tests, ChatGPT + ChatGPT-generated tests, ChatGPT + ALGO tests. The experiments were conducted on a 50-problem subset of CodeContests. The sample budget for each problem was 20 and we ranked the generated programs with different sets of tests. For each problem we picked the top-ranked program as its solution. We report the pass rates of the top-k-ranked programs (20@k) below. | | 20@1 | 20@3 | 20@7 | |-|-|-|-| | ChatGPT + sample tests | 4.4% | 7.9%|12.2%| | ChatGPT + ChatGPT-generated tests | 6.8% | 8.2% | 11.7%| | ChatGPT + ALGO-generated tests |12.0%|12.0%|14.0%| **The results indicate that ALGO is able to significantly improve upon ChatGPT, and ALGO's test cases are better at discriminating correct solutions from incorrect ones.** ### 2 Clarification for sample budget, temperature, and number of tests **Sample budget.** In the original experiments on LeetCode, ChatGPT and GPT-4 was given a sample budget of 5. Since ALGO was using instruction enumerator, each instruction was given a sample budget of 1 and the "verifier" was used to pick the best program among those generated from different instructions. We understand your concern that this may cause unfair comparison, so we re-ran the experiments by giving the baselines a larger sampe budget equal to the size of the instruction set (10). We report the new results in the following table. **ChatGPT+ALGO is still the best after we increase other baselines' sample budget.** | | pass rate | | -------- | -------- | | ChatGPT (budget = 5) pass@1 | 37.1% | | GPT-4 (budget = 5) pass@1 | 45.7% | | ChatGPT + ALGO (budget = 1 per instruction) 10@1 | **48.6%** | | ChatGPT (budget = 10) pass@1 | 39.6% | | GPT-4 (budget = 10) pass@1 | 46.4% | **Temperature.** There is no temperature mismatch between ALGO+Codex and CodeT. Codex, CodeT and ALGO + CodeX in Table 1 are all using the same set of CodeX-generated candidate programs provided by CodeT. The only difference is how these programs were reranked. The temperature 1.0 we mentioned on Line 160 was during the generation of the "oracles", not the solution programs. We will make it clear in the revision. **Number of tests.** For the main experiments (in Table 1 and Figure 4), the number of tests generated by ALGO was 20. ### 3 Sample budget for oracle generation ALGO used a sample budget of 3 to generate oracle for each problem. ### 4 "Oracle" quality for CodeContests Due to the time limit, we sampled 50 problems from the CodeContests test set to manually check the correctness of the generated "oracles". It turned out that 72% of them were correct. We thank you again for your time and kindly remind you to take a look at our global rebuttal and supplementary files if needed.