NeurIPS Rebuttal, oqZ5 (4)

Thank you for the thoughtful comments. ### 0. Terminology We understand your concerns. We will replace the term *verifier* in the revision. To avoid confusion, we use the terms "verifier" and "oracle" with quotation marks in the rebuttal. ### 1. Motivation and discussion about existing works We highlight that ALGO is a step forward towards verifiability which is fundamentally different from existing works like LEVER, Coder-Reviewer, and CodeT. **ALGO's verdict agrees well with the system judge.** The first three rows in Table 1 indicate that by using ALGO instead of CodeT as reranker on the exactly same set of program candidates, the top-ranked programs perform significantly better. Table 2 of our paper indicates that ALGO's verdict has a high agreement with that of the system judge. **ALGO is more reliable.** LEVER, Coder-Reviewer and CodeT checks programs only with language models. On the contrary, ALGO offloads the symbolic reasoning to a program. This technique makes models' reasoning more reliable and is used by other works such as PAL [1] and PoT [2]. **ALGO is interpretable, existing works are not,** because they depend solely on neural models. ALGO provides test cases and "oracles" interpretable by human. If a user is not sure about ALGO's "verification", they can simply check if the brute-force "oracles" are right. When we manually checked "oracles" (Line 251 - 252), we found brute-force programs were much easier to check than actual solutions. In some sense, ALGO has converted the task of checking AI generated programs to checking AI generated brute-forces. ### 2. Comparison with existing work **LEVER has a different task setting, so it cannot be compared with ALGO.** To check a program, LEVER needs access to test-time system inputs. But ALGO doesn't. This makes things easier for LEVER since it only needs to focus on the correctness of this specific input. As they stated in limitations, LEVER is "ideal for applications such as text-to-SQL and math reasoning where the users are only looking for answers to their questions", but not as suitable for "general programming tasks as MBPP". **ALGO performs significantly better than Coder-Reviewer.** We used the officially released candidate programs of Coder-Reviewer. We replace their reranker with ALGO and keeping everything else the same. We report the results in the following table (same setting as Table 2 in Coder-Reviewer). ||Coder-Reviewer|CodeT|ALGO| |-|-|-|-| |pass@100|66.9%|65.8%|82.3%| ### 3. The scope of algorithm synthesis Here we define algorithm synthesis more formally and demonstrate its difference from functionality synthesis. **Algorithm synthesis.** Following the notations in Section 2.1, we use $J_S$ and $J_T$ to denote the system judges for semantic correctness and time efficiency. We denote all semantically correct programs with $P_S$, such that $J_S(P_i)=\mathsf{True},\forall P_i\in P_S$. $P_S$ can further be divided into two subsets $P_S^0$ and $P_S^1$, such that $J_T(P_i^0)=\mathsf{False},\forall P_i^0\in P_S^0$ and $J_T(P_i^1)=\mathsf{True}, \forall P_i^1\in P_S^1$. In other words, $P_S^0$ is correct but slow, while $P_S^1$ is correct and efficient. A task belongs to *algorithm synthesis* when finding a program within $P_S^1$ is more challenging than within $P_S^0$. **LeetCode and CodeContests are algorithm synthesis, while HumanEval is not.** In the following table, we report the success rates for ChatGPT to generate a program in $P_S^0$ and $P_S^1$ when prompted to. A brute-force is in $P_S^0$ if it is manually checked to be right. An efficient solution is in $P_S^1$ if it passes all system tests in time. For LeetCode and CodeContests, it is much easier to generate brute-forces than efficient solutions. But for HumanEval, it's similarly easy. Therefore, HumanEval is not algorithm synthesis according to our definition. ||LeetCode|CodeContests|HumanEval| |-|-|-|-| |$P_S^0$ (brute-force) succ. rate|88.5%|72.0%|70.12%| |$P_S^1$ (efficient) succ. rate|41.2%|7.9%|72.56%| |relative $\Delta$|114%|811%|-3%| ### 4. Cost analysis ALGO is NOT significantly more expensive than the baselines. We compare ALGO's cost with the baselines in two aspects - generation and validation. **ALGO has the same generation cost.** ALGO is a code generation framework where any coder can fit in. It is combined with existing code generation models like CodeX and ChatGPT. We compare ALGO to the baselines by keeping ALGO's sample budget to be the same. In our main experiments on CodeContests, ALGO is used to filter and rerank the exact same set of programs provided by CodeT. So it has the same generation cost. **ALGO has similar or cheaper validation cost.** Validation cost comes from model inference and test execution. For each problem, ALGO uses less then 5 model inferences to generate the "oracle" and the "verifier". However, the number of inferences for CodeT is proportional to the number of test cases it needs (which is 1000 for CodeContests), so ALGO costs much less in model inference. ALGO also costs less in test execution because ALGO can get much better performance with much fewer test cases (20 per problem) than the baseline (1000 per problem). We will include the above cost analysis in the revision. ### 5. Instructions The instructions we used are included in the supplementary files. We would like to emphasize here that while the instructions can significantly affect the performance, ALGO is able to improve the performance for different coders with different instruction sets (as demonstrated in our experiments). **Thank you again for your time. We kindly refer you to the global response for some common issues. Feel free to ask for further clarification.** ### References [1] PAL: Program-aided Language Models [2] Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks