NeurIPS Rebuttal, MVRV (7)

Thank you for your valuable comments! ### 1 Inference cost estimation We argue that ALGO is NOT significantly more expensive than the baselines. We compare ALGO's cost with the baselines in two aspects - generation and validation. **ALGO has the same generation cost.** ALGO is a code generation framework where any coder can fit in. It is combined with existing code generation models like CodeX and ChatGPT. We compare ALGO to the baselines by keeping ALGO's sample budget to be the same. In our main experiments on CodeContests, ALGO is used to filter and rerank the exact same set of programs provided by CodeT. So it has the same generation cost. **ALGO has similar or cheaper validation cost.** Validation cost comes from model inference and test execution. For each problem, ALGO uses less then 5 model inferences to generate the "oracle" and the "verifier". However, the number of inferences for CodeT is proportional to the number of test cases it needs (which is 1000 for CodeContests), so ALGO costs much less in model inference. ALGO also costs less in test execution because ALGO can get much better performance with much fewer test cases (20 per problem) than the baseline (1000 per problem). ### 2 The impact of using a more capable model We provide extra baselines to show that ChatGPT is not the only reason why ALGO works. Instead of using ALGO's oracles to generate outputs, we directly used ChatGPT to generate the outputs of ALGO's test inputs. We use three different test sets to rerank ChatGPT candidates on CodeContests: ChatGPT + sample tests, ChatGPT + ChatGPT-generated tests, ChatGPT + ALGO tests. The sample budget for each problem was 20. For each problem we picked the top-ranked programs as its solution. We report the pass rates of the top-k-ranked programs (20@k) below. | | 20@1 | 20@3 | 20@7 | |-|-|-|-| | ChatGPT + sample tests | 4.4% | 7.9%|12.2%| | ChatGPT + ChatGPT-generated tests | 6.8% | 8.2% | 11.7%| | ChatGPT + ALGO-generated tests |12.0%|12.0%|14.0%| As the table shows, even when the test generation module is replaced with ChatGPT, ALGO is still able to significantly improve code generation ability. ### 3 More details for reproducibility The sample budget for oracle generation was 3. The sample budget for CodeT was 1000. We directly took the candidates provided by CodeT without generating new candidates ourselves. The sample budget for PG-TD was 256, which follows the setting in the original paper. The sample budget for ChatGPT and GPT4 was 5. For pass rates with different top-k algorithms, we refer you to the table in the second part of this rebuttal *The impact of using a more capable model*. ### 4 When does ALGO fail? **Most of ALGO's failures come from the coder's inability to generate correct programs within n samples.** **This can be deduced from ALGO's performance gain on CodeContests.** For different coders, we use the same set of oracles and verfiers to guide their generation. For weak coders like Codex and GPT-2 (Table 1), ALGO's improvement on performance is significant even when k is as large as 100, indicating that as the sample budget gets larger and more correct solutions get generated, ALGO is able to discriminate the good from the bad. However, for a strong coder like ChatGPT (the table in *"The impact of using a more capable model"*), ALGO's improvement converges for a very small k, indicating that ALGO is already to accurately find the correct solution (even when only the top-1-ranked program is picked). **This can also be analyzed from which problems were not solved by ALGO.** For the 7 hard-level problems in our leetcode dataset, ALGO's oracle was correct for 5 of them. However, only 1 of the 5 was correctly solved, given the sample budget of 5. 4 out of 5 problems could have been solved and verified by ALGO if we had a stronger coder. Thank you again for your time! We kindly refer you to our global rebuttal for some common issues and some remarks about ALGO's application beyond algorithmic challenges.