NeurIPS 2025 Rebuttal ZtaT

Thank you for your thoughtful review. Before providing an itemized response, we would like to reiterate the purpose of HardTestGen, which is significantly different from many seemingly similar test generation methods and constitutes the basis of our design choices and response. Unlike many test generation methods that aims to improve coverage for a single program under test, **HardTestGen is used to provide reward signals for reinforcement learning,** meaning that - It must be scalable and generalizable for problems described in natural language, because the >30k problems in the RL training set exist as natural language descriptions. - There is not a single *program under test* or *focal method*. RL training generates hundreds of programs for each coding problem during rollout. Among the candidate programs there'll be several correct solutions with significantly different approaches. Therefore, coverage-based test generation techniques would need to be applied hundreds of times for one problem, which is less tractable. - The tests need to distinguish fast programs from slow ones. Advanced algorithms and data structures are needed for the algorithmic problems we face. LLMs often generate inefficient programs that are correct. Therefore, tests with high coverage are not enough because they often cannot reach the worst case complexity for inefficient algorithms. - The ultimate criteria for the test cases are reward accuracy and training effectiveness. While coverage is considered an important metric for tests, in reinforcement learning, what actually matters is for the model to get rewarded only for correct outputs and be better trained. Now we respond to your comments. ### W1: over-dependence on LLM performance raises concerns about the independence and generalizability of the framework We respectfully disagree. We would argue that - LLMs are the exact reason why HardTestGen can be generalizable to so many different problems in *natural language*. While traditional coverage-guided/fuzzing techniques can work well for focal methods under test, they are not applicable to natural language coding problems, for which the understanding of the problem is key. - LLMs are growing fast and could be even more effective in the future. As we show in the following table, HardTestGen, when based on LLMs more recent than GPT-4o, gets even better precision and recall. Open-weights models (those with \*) also outperform GPT-4o in many cases. Due to the limited time for response, we conduct these evaluation on a subset of 500 problems. In the table below, GPT-4o is used for candidate program generation, while GPT-4o, Claude-4-Sonnet, Kimi-K2, and Qwen3-Coder are used for test case generation. ||Diff1||Diff2||Diff3||Diff4|| |-|-|-|-|-|-|-|-|-| ||Precision|Recall|Precision|Recall|Precision|Recall|Precision|Recall| |HT+GPT-4o|99.53|99.18|100.0|97.43|96.04|98.45|84.18|98.03| |HT+Claude-4-Sonnet|99.48|99.86|100.0|95.70|98.28|99.35|93.21|96.86| |HT+Kimi-K2*|99.41|99.87|98.30|97.01|98.06|99.13|87.11|98.04| |HT+Qwen3-Coder*|99.47|99.14|99.62|98.88|95.20|99.13|76.83|98.82| - Stronger LLMs are less dependent on external tools, hence dependence on LLMs actually improves the independence of the overall framework. For example, in the original SWE-Agent [1], the authors allowed multiple tools for the LLM and relied on an external syntax checker to prevent it from making incorrect edits, achieving 22% on SWE-Bench Verified. In the recent mini-SWE-Agent [2], a stronger model (Claude Sonnet 4), with only one tool -- bash command -- can achieve 65% without syntax checkers. ### W2: evaluating against more recent and competitive test generation approaches As we argued earlier, our problem setting is not really coverage-guided test generation, because there is not a single program under test. However, we still managed to somewhat compare HardTestGen with SymPrompt, one of the coverage guided methods you mentioned. We use the oracle program as the focal method for SymPrompt and measure precision and recall on 163 GPT-4o generated candidate programs. ||HardTestGen|SymPrompt| |-|-|-| |Line Coverage| 92.83 | 81.59 | |Branch Coverage| 93.36 | 82.84 | |Precision| 95.77 | 62.28 | |Recall| 90.67 | 94.67 | As the table suggests, HardTestGen achieves a slight lower recall but much better coverage and precision than SymPrompt. This is because SymPrompt focuses on providing approximate execution paths to LLMs to maximize coverage, while HardTestGen utilizes the LLMs' algorithmic knowledge to generate difficult test cases that make bad algorithms slow. For example, for the following function that calculates the greatest common divisor with Euclidean algorithm: ```python def gcd(a, b): if a % b == 0: return b return gcd(b, a % b) ``` SymPrompt easily generates test cases that reach 100\% coverage, but the test cases it generate often get solved in very few recursions. However, HardTestGen recognizes from the problem description and generates a hacking input generator that assigns `a` and `b` to be consecutive Fibonacci numbers, which causes the algorithm worst case effieciency of $2\log_2 b+1$ steps. ### W3: lack of verification for the input validators ... incorporating formal methods such as SMT solvers First, we further validate the generated inputs beyond input validators. The inputs are given to 8 oracle programs and only those that elicit matching outputs from all of them are considered valid inputs (see Line 168, Page 5). It is unlikely for invalid inputs to have the same outputs for 8 different implementations. Second, to address your concern, we randomly sampled 100 input validators and manually checked their correctness. 96 out of the 100 are correctly validating test inputs. Third, it is infeasible to use formal methods to validate inputs. Formal specifications **do not exist** for most algorithmic problems, especially at the scale of 47k. To create such specifications requires huge efforts and domain expertise. Even if they were created, the validity of the specifications themselves would remain questionable, resulting in a scenario of "who verifies the verifier". We would appreciate it if the reviewer could point us to any previous work that formally validate inputs for algorithmic problems at scale. Finally, LLMs are much worse on auto-formalization and formal verification than regular code generation. As reported in AlphaVerus [3], GPT-4o can only generate formal specification and formally verify **27.1%** of the problems in HumanEval with **a large budget of 256 samples**. On the contrary, it can correctly solve **90\%** of the problems in Python according to [4] and generate correct input validators for **94\%** with **only one sample**, according to our experiment. ### W4: mechanism for quantitatively controlling the diversity or distributional coverage ... test suite may include redundant inputs or insufficiently capture rare corner cases Our mechanism for test diversity and distributional coverage is using different workflows for different types of tests. As shown in **Appendix A.2.1**, these workflows use very different instructions and generate randomized test generators. Empirically, different test types are covering diverse test distributions. As we reported in Table 1, having 3 types of tests gives much better precision than having only one or two, especially for difficult problems. In Appendix A.5 of our supplementary material (Page 36 to 38), we show examples of how more types of tests can catch corner cases and reduce false positives. To quantitatively measure the redundancy of tests, we conduct an experiment of randomly removing some portion of test cases and measuring the precision after that: |Percentage to Keep(%)|100|90|80|70|60|50|40|30|20|10| |-|-|-|-|-|-|-|-|-|-|-| |Precision(%)| 88.86 | 88.74 | 87.29 | 85.42 | 85.25 | 83.96 | 82.92 | 78.44 | 74.41 | 68.80 | 67.45| 66.23| 65.36| 63.18| 62.17| 59.79| 56.55| As shown in the table, test precision immediately drops as more test cases are discarded, suggesting that we have very few redundant tests. ### Q1: more tests or better tests? To answer your question, we calculate the average number of test cases for the set of problems used for Table 1: ||# of tests|Precision|Recall| |-|-|-|-| |TACO|109.6|69.97|72.53| |HardTests|39.9|85.64|94.77| As the table above suggests, HardTestGen achieved better test quality (precision and recall) with much fewer tests. ### Q2: RL for more steps Given the limited time for rebuttal, we've run the training for 220 more steps, up to 320 total steps. Here's the validation reward for the training process. |step|16|32|48|64|80|96|112|128|144|160| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.04|0.13|0.10|0.11|0.10|0.19|0.17|0.19|0.20|0.19| |HardTests|0.05|0.11|0.10|0.2|0.17|0.19|0.23|0.25|0.24|0.25| |step|176|192|208|224|240|256|272|288|304|320| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.21|0.22|0.24|0.23|0.22|0.24|0.23|0.24|0.23|0.23 |HardTests|0.25|0.27|0.28|0.29|0.28|0.29|0.30|0.31|0.31|0.32| As you can see, HardTests constantly outperforms TACO in terms of validation rewards, until the training converges after 300 steps. ### Q3: why no human-written solutions for AtCoder As we mentioned in **page 30, Line 1000 of the suppelementary material**, we collect codeforces submissions and their offical verdicts from MatrixStudio/Codeforces-Python-Submissions, a dataset with 690k human submitted programs. There is no such dataset available for AtCoder. ### Reference [1] Yang, et al. "Swe-agent: Agent-computer interfaces enable automated software engineering." NeurIPS 2024. [2] Mini-swe-agent. Github Repository: SWE-agent/mini-swe-agent [3] Aggarwal, et al. "AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement." ICML 2025. [4] Hurst, et al. "Gpt-4o system card."