We thank the reviewers for their thoughtful reviews and comments. We would like to address some common issues mentioned in multiple reviews and clarify the main points of our paper.
### 0. Reminders
**Terminology.** We understand several reviewers' concerns about the proper use of the terms *oracle* and *verifier*, and promise to use more appropriate alternates in the revision. To avoid confusion in the rebuttal, we will refer to the exhaustive search programs generated by language models as "oracles", and the program correctness checker as "verifiers", with quotation marks, throughout the rebuttal period.
**Supplementary files.** We kindly remind the reviewers that the characteristics of the "oracles" and "verifiers" we talk about can be examined in the supplementary files, which contain our prompts, input generators/validators, "oracles", test cases and candidate programs.
### 1. ALGO's "verifier" is fundamentally different from existing works.
We thank the reviewers for pointing out some existing works such as LEVER [1] and Coder-Reviewer [2]. We will cite them and discuss more in the revision. Here we argue that ALGO's "verifier" has fundamental differences compared to the existing works. These differences come from its neurosymbolic nature.
**ALGO provides more reliable feedback.** Neural "verifiers" [1,2] provide little feedback by only giving a "correctness" score. Previous test-based "verifiers" like CodeT [3] provide some feedback with their LM-generated test cases, whose correctness solely relies on the arithmetic and symbolic reasoning ability of language models. However, language models often make mistakes in arithmetic and symbolic reasoning, which is why they solve math problems better by generating programs rather than directly outputting the answer, as demonstrated in PAL [4] and PoT [5]. Therefore, the test cases generated by LMs for a SINGLE problem probably contain both correct and incorrect ones, providing harmful feedbacks to the coder. However, once we have a correct "oracle" for a problem (which we do for 88% Leetcode problems), ALGO guarantees that ALL test cases generated will be correct.
**ALGO is interpretable and easy to validate.** Unlike LEVER and Coder-Reviewer which use inexplainable neural models to score the correctness of a program, ALGO uses "oracles" in the form of exhaustive search programs. For a candidate program, there is no way of telling how a neural model comes to its "correctness" score and whether the score actually reflects correctness. But any programmer is able to tell if a brute-force "oracle" is correct. According to our hired human annotators, manually checking a brute-force program is much easier than checking an efficient one using advanced algorithms. Because efficient solutions are diverse and problem-specific, while brute-force programs are similar and mostly enumerating permutations, subsets and stepwise choices.
### 2. ALGO may be applicable to general algorithm synthesis scenarios beyond programming contests.
In this part, we address reviewers' doubts about the scope of *algorithm synthesis* and if ALGO can be applied beyond programming contests. In general, ALGO can be applied to scenarios where a naive, correct but less satisfactory oracle is much easier to find then a satisfactory solution. Language models can be used to find this naive solution, which can later be used to guide the search for a satisfactory solution.
**Now we give a concrete example on how ALGO can assist LLMs to using programming languages and frameworks that aren't well-represented in the training data.** Here, the naive yet less satisfactory oracle is a program that's written in a popular language/framework well-represented in the training set.
Consider a basic function that fetches a city's name from Google Maps API using GPS coordinates. This program needs to run on an embedded system with limited memory and storage, therefore requiring it to be written in C++ using the `neon` library.
**Directly prompting GPT-4 won't work.** It got us the following code snippet.
[Link to the first code snippet that we sent to the AC]
When we tested the above code using LLM-generated test cases, the generated cases were not correct.
```
User: Generate test cases for a program that fetches city name based on GPS coordinates.
Input: 33.798242689239636, 2.87416689773028
Output:
ChatGPT: Input: 33.798242689239636, 2.87416689773028
Expected Output: City Name: Al Qadarif
```
In the provided example, ChatGPT incorrectly recognizes the location as Al Qadarif, whereas it's actually Laghouat, a city 2000 miles away.
Instead of hinging upon LLM’s ability to map inputs to outputs, ALGO can develop a straightforward "verifier" utilizing popular solutions like Python's requests library to ensure its accuracy. We presented the GPT-4 generated "oracle" in Python below.
[Link to the second code snippet that we sent to the AC]
The above "oracle" can then check and guide the generation of a more efficient solution in C++.
### Reference
[1] LEVER: Learning to Verify Language-to-Code Generation with Execution
[2] Coder Reviewer Reranking for Code Generation
[3] CodeT: Code Generation with Generated Tests
[4] PAL: Program-aided Language Models
[5] Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks