Thank you for sharing your feedback with us.
…
Now we respond to your comments.
### W1&Q1 Missing classical techniques
Traditional fuzzing methods like AFL++ aims to generate random and unexpected output to programs to detect vulnerabilities. However, in competitive programming, the input are expected to follow the specification and the goal for the test cases is to identify a correct and efficient program, not to find a safe program. That said, we still use AFL++ to get the fuzzed inputs for comparison with HardTestGen.
HardTestGen achieves higher precision and recall for the 53 programs we evaluated. We found that the low precision and recall are due to AFL++ generating a large number of invalid inputs. Problem specification is necessary for generating valid inputs. As such, classical fuzzers would not be the suitable in our setting unless paired with custom input mutators, HARDTESTS pipeline could be slightly modified to generate input mutators similarly to how we generate input generators. Due to the time limit of this rebuttal, we did not explore using our pipeline to provide custom mutators to classical fuzzers, but this is an interesting direction that we would explore in the future.
||AFL++|HARDTESTGEN|
|-|-|-|
|Precision|70.00 | 100 |
|Recall|30.43 | 86.96 |
### W2 Downstream results not conclusive
In Section 5, we discuss the important of better test cases in LLM training in 3 different post-training scenarios: teacher-distillation, self-distillation, and reinforcement learning.
Although we found that teacher-distillation benefits more from question scaling than test case quality, the results in section 5 shows that test quality matters significantly for both **self-distillation** and **reinforcement learning**.
Arguably, reinforcement learning is a more important post-training scenario, because there is no teacher model for frontier models to distill from, which is why recent fronter LLMs such as OpenAI o1, Kimi K2, DeepSeek R1, and Grok 4 all use large-scale RL.
In particular, we record below the validation reward for the training process for using TACO test cases and HardTests test cases.
|step|16|32|48|64|80|96|112|128|144|160|
|-|-|-|-|-|-|-|-|-|-|-|
|TACO|0.04|0.13|0.10|0.11|0.10|0.19|0.17|0.19|0.20|0.19|
|HardTests|0.05|0.11|0.10|0.2|0.17|0.19|0.23|0.25|0.24|0.25|
|step|176|192|208|224|240|256|272|288|304|320|
|-|-|-|-|-|-|-|-|-|-|-|
|TACO|0.21|0.22|0.24|0.23|0.22|0.24|0.23|0.24|0.23|0.23
|HardTests|0.25|0.27|0.28|0.29|0.28|0.29|0.30|0.31|0.31|0.32|
HardTests consistently outperforms TACO in terms of validation rewards, until the training converges beyond 300 steps. This downstream result highlights the importance of high-quality test cases.
### W3 Clarity and flow issues
> Figure 2 does not clearly differentiate test types
In the revision, we will update Figure 2 to more clearly differentiate between test types.
> citation formatting is incorrect throughout (e.g., (Face, 2025));
We will correct all citation formatting throughout the paper to comply with the guidelines.
> phrasing is often awkward
We will carefully revise the manuscript to improve the fluency and readability of the writing.
> the appendix appears incomplete
The full appendix is included in the supplementary materials, we will ensure that the revised version integrates and references all relevant appendix content within the main submission for completeness.
> the paper implies that oracle test cases are often missing, yet it still uses them to compute precision
In Section 4.3, We discussed how we obtained the gold labels for a subset of problems:
- For AtCoder, we run candidate programs on official tests that have been previously made available.
- For Codeforces, we submit candidate programs to the website to obtain ground-truth verdicts.
### Q2: oracle programs are missing for many problems, yet uses them to compute precision...
Did the reviewer mean oracle *tests* are missing? Because that was what you suggested in the weakness section.
In Section 4.3, We discussed how we obtained the gold labels for a subset of problems:
- For AtCoder, we run candidate programs on official tests that have been previously made available. Note that AtCoder only constitutes a small portion of the 47k problems we have.
- For Codeforces, we submit candidate programs to the website to obtain ground-truth verdicts, without accessing the oracle, human-written tests.
### Q3: Did the authors try prompting LLMs to produce correct solutions first...
Yes. We acknowledge that in some domains oracles are not available (e.g. synthetically generated problems). We propose an alternative oracle-free approach ReALGO based on ALGO[1]. The details of ReALGO implementation and results is in Appendix A.7 in the **supplimentary materials**. We summarize some of the content here.
- According to [1], for algorithmic problems, LLMs are much better at generating brute-force solutions than generating efficient algorithms.
- ReALGO leverages LLMs to generate a brute-force reference solution by encouraging exhaustive search strategies. It then synthesizes a validator and ten edge-case input generators, for the inputs generated, the corresponding outputs are produced using the brute-force program.
- ReALGO adds a maximum-length test case to detect time complexity issues and ensure coverage of both correctness and efficiency. These 11 handcrafted test cases are intentionally designed to trigger failures in seemingly correct but flawed programs.
- Empirical results on 165 AtCoder problems (with 50 sample solutions each) demonstrate the effectiveness of this oracle-free strategy. As shown in the following table, HardTestGen achieves a significantly lower false positive rate (17.67%) than AceCoder[2] (32.49%) while maintaining a comparably low false negative rate. These findings confirm that HardTestGen remains effective even in the absence of oracle implementations, highlighting the robustness of our approach.
||False Positive Rate (FPR)|False Negative Rate (FNR)|
|-|-|-|
|AceCoder|32.49 | 2.59 |
|HardTestGen|17.67 | 2.19 |
### Q4: What do the authors mean by "For math, ..."?
We meant that for math problems with numerical answers, the reward for reinforcement learning can directly be calculated by comparing the model generated answer with the ground truth answer. For example, people typically use this kind of reward for problems in GSM8k [2] and MATH [3].
By comparing with math, we meant to emphasize that calculating accurate reward is more challenging for coding problems.
### Q5: Improve Figure 2 ... choose milder colors for the tables
In the revision, we will update Figure 2 to more clearly differentiate between test types, and use milder colors for table 1 and 2.
### Reference
[1] Zhang, Kexun, et al. "Algo: Synthesizing algorithmic programs with generated oracle verifiers." Advances in Neural Information Processing Systems 36 (2023): 54769-54784.
[2] Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
[3] Hendrycks, Dan, et al. "Measuring Mathematical Problem Solving With the MATH Dataset." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).