Neurips Rebuttal ZtaT Follow-Up

Thanks for your detailed comments. Our response will consist of the following sections: **0. practical effectiveness** **1. precision and recall remain comparable with an open-source LLM** **2. on symprompt and recall gap** **3. why 4% validator failure does not compromise the dataset** **4. on test redundancy** ### 0. practical effectiveness We appreciate and agree with your emphasis on practicality, which is why we want to reiterate that the purpose of our dataset and our test generation methodology is for **improving reward accuracy in reinforcement learning of LLMs**. In our opinion, the ultimate criterion for the practicality of our method is **how much it helps train better LLMs.** HardTests is not perfect, but compared with previous training data like TACO and CodeContests, it is already the largest in scale (from 25k to 47k), and the highest in test quality (measured by precision and recall). We have shown, in our Section 5.2, how our large amount of problems and higher quality of tests significantly improve three different LLM training scenarios: teacher distillation (from 46% to 53%), self-distillation (from 56% to 60%), and reinforcement learning (from 56% to 64%). This is **direct evidence for practical effectiveness**. ### 1. "Precision drops markedly when GPT-4o is replaced ... Please demonstrate that precision and recall remain comparable with an open-source model, or clarify the framework’s limits." We would like to clarify some important points regarding open-source model performance and framework generality: - We have reported the **results for Kimi-K2** in our previous response, which is also **an open source model** and it's **better than gpt-4o** in 5/8 metrics and comparable in the remaining 3. Also, Qwen3-Coder remains comparable for most of the tasks except for one. We copied the table here for your convenience: ||Diff1||Diff2||Diff3||Diff4|| |-|-|-|-|-|-|-|-|-| ||Precision|Recall|Precision|Recall|Precision|Recall|Precision|Recall| |HT+GPT-4o|99.53|99.18|100.0|97.43|96.04|98.45|84.18|98.03| |HT+**Kimi-K2**|99.41|**99.87**|98.30|97.01|**98.06**|**99.13**|**87.11**|**98.04**| - **HardTestGen actually has less dependence on LLM strength than prior work.** Prior work such as TACO and SymPrompt, when backed by GPT-4, perform much worse than HardTestGen when backed by supposedly weaker open-source models, especially for difficult problems. The precision gap between SymPrompt and HardTests is as large as **almost 50 points** for difficulty 4. This suggest that HardTest dependence on strong LLMs is more relaxed and enables reasonable performance even under constrained model conditions. ||Diff1||Diff2||Diff3||Diff4|| |-|-|-|-|-|-|-|-|-| ||Precision|Recall|Precision|Recall|Precision|Recall|Precision|Recall| |SymPrompt+GPT-4|98.74|98.95|92.64|90.91|81.72|90.99|28.13|93.18| |TACO+GPT-4|**100.0** |73.06|**99.75**| 67.29 |92.74 | 74.08 |62.07| 71.05 | |HT+Qwen3-Coder*|99.47|**99.14**|99.62|**98.88**|**95.20**|**99.13**|**76.83**|**98.82**| ### 2. SymPrompt and recall gap SymPrompt test cases are generated with GPT-4o, the **same model that we used to generate HardTests**. While SymPrompt achieves a higher recall, this performance is potentially superficial, arising not from strong test coverage, but from weak and non-discriminative test cases as suggested by the low F1 value. ||HardTestGen|SymPrompt| |-|-|-| |Precision| 95.77 | 62.28 | |Recall| 90.67 | 94.67 | |F1| 93.15 | 75.13 | To explain this with an extreme case: a test suite that judges any program as correct will have no false negatives, because it won't have **any** negatives at all. Consequently, its recall will be 100%, which is true but meaningless. To concretely illustrate this point and address the reviewer’s request for an example, we provide a detailed case study on Codeforces Problem 191C – Fools and Roads. The input specifications for this problem are in natural language: ``` The first line contains a single integer n (2 ≤ n ≤ 10^5) Each of the next n - 1 lines contains two space-separated integers ui, vi (1 ≤ ui, vi ≤ n, ui ≠ vi), that means that there is a road connecting cities ui and vi. ... The next line contains integer k (0 ≤ k ≤ 10^5) ... ``` In short, this problem involves processing a tree with up to $10^5$ nodes and answering up to $10^5$ pairwise path queries. Efficient solutions require tree traversal preprocessing, and naive or brute-force solutions quickly become infeasible at scale. We observe that SymPrompt-generated test cases for this problem are small across all generations, the maximum observed values are n = 8 and k = 3. These sizes are far below the problem’s limits and do not stress the performance characteristics of candidate programs. In contrast, HardTestGen deliberately generates test cases with n and k values approaching the specification limit (10⁵), effectively stress-testing the scalability and efficiency of candidate programs. The comparison between the two test suites is as follows: - Both methods are able to reject a buggy program with incorrect answer. - SymPrompt correctly accepts the 1 correct solution, but incorrectly accepts all 3 inefficient programs (correct answer but not within time limit). - On the other hand, HardTestGen correctly rejects all 3 inefficient programs, though it incorrectly rejects the 1 correct solution due to tight time constraints. This could be easily resolved by slightly lessening the input scale or better matching the CPUs used by the online judge. This leads to the observed recall gap: SymPrompt has higher recall because it accepts the correct programs that HardTestGen wrongly rejects (performance limit being too harsh). However, it does so **at the cost of substantially lower precision**, accepting several clearly incorrect programs. In fact, none of the SymPrompt-generated test cases induce any timeout error for brute-force programs. HardTests correctly identifies all 36 slow programs among the 163 candidate programs spanning 38 problems, each of which has at least two candidate implementations. Arguably, HardTests is much more practical than SymPrompt. ### 3. "With a 4% validator failure rate...the dataset could be materially compromised." While small, highly-curated datasets may achieve near-perfect validation with huge manual efforts, they lack the scale, diversity, or domain relevance needed for training and evaluating models in competitive programming. We would like to emphasize that HardTests, despite imperfect, already sets a new standard in input validation and overall test quality compared to prior work. As shown in the following table, HardTests outperforms previous training datasets such as TACO and CodeContests across precision, recall, and human validation pass rate, despite using only ~40% as many test cases. This demonstrates that the tests in HardTests are both more effective and more carefully validated. ||# of tests|Precision|Recall| Validation Pass Rate (Human) | |-|-|-|-|-| |TACO|109.6|86.91|78.69|65| |CodeContests|101|84.96|91.52|71| |HardTests|39.9|92.00|94.93|96| Notably, the two datasets, despite their significantly weaker input validation and test quality, have already been instrumental in training many of the top-performing language models (e.g. Berkeley's Sky-T1 [1], Google's FLAN-T5 [2]). On Huggingface alone, 79 models have been trained with CodeContests. It stands to reason that using a higher-quality dataset like HardTests could enable even stronger models. In fact, **this has already been demonstrated** in our paper: HardTests has proven its utility over the prior datasets through empirical results, particularly in Figure 3 and Table 5. In response to the reviewer’s earlier request, we provided additional RL results with more training steps. We would like to reiterate these findings and the substantial utility of HardTests, |step|16|32|48|64|80|96|112|128|144|160| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.04|0.13|0.10|0.11|0.10|0.19|0.17|0.19|0.20|0.19| |HardTests|0.05|0.11|0.10|0.2|0.17|0.19|0.23|0.25|0.24|0.25| |step|176|192|208|224|240|256|272|288|304|320| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.21|0.22|0.24|0.23|0.22|0.24|0.23|0.24|0.23|0.23 |HardTests|0.25|0.27|0.28|0.29|0.28|0.29|0.30|0.31|0.31|0.32| We fully agree that dataset quality is important. But we also recognize that improving data quality is a never-ending journey, and each new dataset builds on lessons from prior work. In this light, we argue that HardTests is a big step forward in both quality and utility from prior, already useful datasets. ### 4. "Precision falling by just three percent after deleting forty percent of the suite signals substantial redundancy." We would like to kindly point out that HardTestGen is not designed for regression testing. Instead, its primary purpose is to provide a reliable reward signal for RL training. That said, since the reviewer has raised recurring concerns about redundancy, we address them directly below. #### 4a. Why redundancy may not be an issue It's worth considering why redundant tests might be a problem. From the reviewers' comments, we see two possible specific concerns: - That redundant tests might be computationally expensive, and - That redundant tests might bias our reward calculation. In terms of **computational costs**, we want to put the numbers into perspective. In practice, the average problem has **40 tests**, which is only 40% as many as the baseline datasets. Each test is limited to 1-2 second of execution time, and all tests are **fully parallelizable**. If any of the tests fails, we declare the solution a failure and skip running the remaining tests. In our experiments, the average execution time for a HardTests test suite is just **5.1 seconds on a single CPU**—a small fraction of the **16 minutes** an LLM takes per training step. We could further optimize here by running the harder (Hacking) tests first to potentially cut a few more unnecessary executions. As far as concern on **reward calculation**, we follow the same setting as most current code RL works [1, 3, 4], where the reward signal from test cases is **binary**. If any test fails, then we consider the solution a failure. Therefore, having redundant tests—even if the distribution of redundant tests was biased towards easier tests—would not bias reward calculation. #### 4b. Why we cannot judge input redundancy at face value Examining redundancy at face value would be **very inaccurate**. Nearly identical inputs ("twins") could mean very different things in terms of runtime. for example, for the following brute-force program for solving diophantine equations, 7 6 7 results in 10^4 iterations of the inner loop, while 6 6 7 results in 10^8 iterations. The former will easily pass, while the latter can cause the program to time out. ```cpp #include <bits/stdc++.h> int main() { int a, b, c, x, y; int p, flag = 0; scanf("%d %d %d", &a, &b, &c); for (x = 0; x <= 10000; x++) { for (y = 0; y <= 10000; y++) { p = x * a + y * b; if (p == c) { flag = 1; break; } } if (flag == 1) break; } if (flag == 1) printf("Yes"); else printf("No"); return 0; } ``` We hope this example can illustrate why input‑feature distance or constraint‑based clustering is **not an adequate way to measure redundancy** for algorithmic problems in competitive programming. In contrast, randomly dropping tests provides a more practical way to approximate redundancy, as it directly reflects the marginal contribution of test cases to evaluation precision.  #### 4c. Why HardTestGen is not as redundant as you may think - **3% drop in precision is already notable**. We ran the random dropping tests experiment with SymPrompt and present the result averaged across 20 runs below. The extent of the precision drop is similar to the behavior (4% after 40% removal) we observed in HardTestGen. |Percentage to Keep(%)|100|90|80|70|60|50|40|30|20|10| |-|-|-|-|-|-|-|-|-|-|-| |SymPrompt Precision(%)| 64.74 | 63.12 | 61.45 | 61.25 | 60.84 | 59.40 | 57.36 | 54.45 | 52.00 | 50.73 | - To further contextualize these results, we also conducted the same experiment on the AtCoder **official test cases** alongside HardTests, as shown below: |Percentage to Keep(%)| 100 | 90 | 80 | 70 | 60 | 50 | 40 | 30 | 20 | 10 | 5 | 4 | 3 | | ----------------------------------------- | ------ | ------ | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | HardTests Precision(%) | 88.86 | 88.74 | 87.29 | 85.42 | 85.25 | 83.96 | 82.92 | 78.44 | 74.41 | 68.80 | 62.17 | 59.79 | 56.55 | | Official Precision(%) | 100.00 | 100.00 | 100.00 | 99.90 | 99.90 | 99.37 | 99.37 | 99.35 | 98.29 | 96.61 | 94.60 | 94.60 | 94.60 | As shown in the table, the drop in precision is much more significant in HardTests than in the official test suite. While removing 60% of the official test cases results in a <1% precision drop, the same reduction in HardTests leads to a notable 6.9% decrease in precision. Therefore, we argue that HardTests tests are less redundant than the official test cases and as concise as those from SymPrompt, while achieving significantly higher precision than SymPrompt. - The high precision of HardTests, despite being only one-third the size of TACO, indicates that our test suite already **contains far less redundancy**. We would like to redirect the reviewer’s attention to the result presented in the rebuttal, which clearly illustrates this point: ||# of tests|Precision|Recall| |-|-|-|-| |TACO|109.6|69.97|72.53| |HardTests|39.9|85.64|94.77| ### Reference [1] Li, Dacheng, et al., "LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, arXiv preprint arXiv:2502.07374 (2025). [2] Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1-53. [3] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. [4] Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., ... & Lin, Z. (2025). Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. ## Follow-Up 2 We appreciate your quick response and we are glad our new results have addressed some of your concerns. We'll further answer your remaining questions here: ### validator quality ... using open-source models We further annotated the pass rate of input validators generated from Kimi-K2 for the same 100-problem subset. 98\% of the input validators generated by Kimi-K2 are correct, slightly higher than those generated by GPT-4o. ### analyzing duplicated test inputs To directly analyze duplicate inputs, we compare all pairs of input tests by calculating their **longest common subsequence** (see Wikipedia: Longest_common_subsequence). Every test input in our dataset is a string. Two test inputs `a` and `b` are considered **near-duplicates** if and only if: - Their lengths do not differ over 10%, i.e., `abs(len(a)-len(b)) / min(len(a),len(b)) <= 0.1`. - Their longest common sequence is longer than 80% of the shorter string, i.e. `len(lcs(a, b)) >= 0.9 * min(len(a), len(b))`. For example, one longest common sequence for `"8\n1 2 3 4 5 6 7 8"` and `"8\n1 3 2 4 5 6 7 8"` is `['8', '\n', '1', ' ', '2', ' ', '4', ' ', '5', ' ', '6', ' ', '7', ' ', '8']`, which has 15 elements, more than 80% of 17 (the length of both inputs). Therefore these two inputs are considered near-duplicates. We ran this analysis for 500 problems in our dataset and compute the portion of test inputs that have at least one near-duplicate. We found that: - There are no two test inputs that are exactly the same, because we did implement deduplication while generating these tests. - On average, each problem has **0.8%** of near-duplicate inputs. - The problem with the most near-duplicate inputs has **3.3%**, i.e. 3 duplicate inputs out of 100 test inputs. Therefore, there is no significant test input duplication in HardTests. ### dataset’s precision and real-world usability As we argued before, we believe that** similar tests do not harm the dataset's precision and real-world usability**, because the model only gets rewarded when a program passes all tests. For a given test suite, adding in a test that is exactly the same as one of the tests in the suite won't change the set of feasible programs that pass the tests. Conceptually, we've also shown an example where two very similar test inputs `7 6 7` and `6 6 7` can cause very different program behaviors. Therefore, **test inputs that are similar in appearance may actually capture nuanced bugs in programs**. We believe that this is also the rationale behind some fuzzers such as AFL, which modifies inputs by flipping random bits. We welcome any additional questions or concerns you may have. If our response has adequately addressed your feedback, we would be grateful if you would consider revising your evaluation.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.