NeurIPS Rebuttal, UrDQ (5)

Thank you for your thoughtful comments and inspiring questions. ### 1 pass@k vs n@k **We followed CodeT's exact setting.** The "setting" in Line 216 means that we used ALGO to rerank the solution candidates (**n=1000** per problem) provided by CodeT, so our results is comparable to those in CodeT's Table 3. **CodeT directly compared pass@k and n@k,** because the pass@k for a coder without reranking is equivalent to the n@k for a coder whose reranker returns a uniformly random permutation. This is why CodeT "use the unbiased definition of pass@k as [their] baseline". **We understand your concern,** pass@k and n@k are not easily compared. But n@k's of two coders with reranking are, which is why we compared them in Table 1. We will clarify the difference between the two, and report pass@k only as metric for reference. ### 2 Originality and comparison to other reranking approaches We do not claim reranking as part of our novelty. We highlight that ALGO is fundamentally different from existing works in verifier quality. We compare ALGO with one more baseline, Coder-Reviewer here, and will include more extensive comparisons in the revision. **ALGO is more reliable.** LEVER, Coder-Reviewer and CodeT checks programs only with language models. On the contrary, ALGO offloads the symbolic reasoning to a program. This technique makes models' reasoning more reliable and is used by other works such as *PAL: Program-aided Language Models*. **ALGO is interpretable, existing works are not,** because they depend solely on neural models. ALGO provides test cases and "oracles" interpretable by human. If a user is not sure about ALGO's "verification", they can simply check if the brute-force "oracles" are right. When we manually checked "oracles" (Line 251 - 252), we found brute-force programs were much easier to check than actual solutions. In some sense, ALGO has converted the task of checking AI generated programs to checking AI generated brute-forces. **ALGO performs significantly better than Coder-Reviewer.** We used the officially released candidate programs of Coder-Reviewer. We replace their reranker with ALGO's and keep everything else the same. We report the results in the following table (same setting as Table 2 in Coder-Reviewer). ||Coder-Reviewer|CodeT|ALGO| |-|-|-|-| |pass@100|66.9%|65.8%|82.3%| ### 3 CodeContests vs LeetCode **We report the performance of ChatGPT and ChatGPT+ALGO on CodeContests (sample budget n=20).** ||k=1|k=3|k=10| |-|-|-|-| |ChatGPT,pass@k|4.4%|7.9%|12.2%| |ChatGPT+ALGO,20@k|12%|12%|14%| ALGO can achieve 12% even when k=1, indicating ALGO's superior ability to verify programs. CodeT cannot be run on LeetCode because its backbone CodeX is no longer accessible. Our experiments of PG-TD on LeetCode is still ongoing because it takes very long time. We will update the results as soon as possible. ### 4 Verifier correctness **We did not assert that all TLE verdicts were correct.** We did manually verify that *every* TLE solution was functionally correct. We believe the misunderstanding about this "assertion" is caused by our writing. We will rephrase and clarify this issue in the revision. **ALGO's verifiers are correct for 72% of CodeContests problems.** Due to the time limit we were not able to verify all 165 problems. To make the samples representative, we sampled 4 contests with a total of 50 problems from CodeContests to manually check them. 64% of all oracles get AC/TLE verdicts. **72%** of all oracles are functionally correct. The extra 8% comes from oracles that get runtime errors because their deep search recursions exceeded stack memory. ### 5 Generating inputs **All input generators/validators we used can be found** in supplementary files, along with the prompts we used to generate them. Examples of input generators and input validators can be found in the Appendix (Figures 11, 12, 13, 15). **ChatGPT can to produce input generators/validators that follow problem-specific constraints.** Some tricky but successful examples for codecontests (see supplemantary files) include: generating trees (1575I), primes (1603E), permutations (1622B). We provide two more examples by ChatGPT that generates bidirectional connected graphs and directed acyclic graphs. ``` def generate_random_connected_graph(n, m): if n < 2 or m < n - 1 or m > n * (n - 1) // 2: return "Invalid n and m." # Step 1: Ensure the graph is connected by creating a tree edges = [] nodes = list(range(n)) random.shuffle(nodes) for i in range(1, n): edges.append((nodes[i - 1], nodes[i])) # Step 2: Add random edges to the graph. while len(edges) < m: u, v = random.randint(0, n-1), random.randint(0, n-1) if u > v: u, v = v, u if u != v and (u, v) not in edges: edges.append((u, v)) # Convert to the desired format. result = f"{n} {m}\n" for edge in edges: result += f"{edge[0]} {edge[1]}\n" return result ``` ``` def generate_random_dag(num_nodes, num_edges): if num_edges > num_nodes * (num_nodes - 1) / 2: raise ValueError("Too many edges to form a DAG") edges = set() while len(edges) < num_edges: u = random.randint(1, num_nodes - 1) # ensure u < v v = random.randint(u + 1, num_nodes) edges.add((u, v)) result = f"{num_nodes} {num_edges}\n" result += "\n".join(f"{u} {v}" for u, v in edges) return result ``` ### 6 Applicability beyond algorithmic challenges We are glad you asked this! The possibilities you mentioned are inspiring and captured the essense of our method: naive yet functionally correct programs that are easy to find. Due to the rebuttal's length limit, we kindly refer you to our global rebuttal, where we introduce an example of ALGO's application - SQL optimization and talk about other possibilities. ### 7 Minor suggestions Thank you for the detailed and comprehensive suggestions. We will address these in the revision.