Kath
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Thanks for your detailed comments. Our response will consist of the following sections: **0. practical effectiveness** **1. precision and recall remain comparable with an open-source LLM** **2. on symprompt and recall gap** **3. why 4% validator failure does not compromise the dataset** **4. on test redundancy** ### 0. practical effectiveness We appreciate and agree with your emphasis on practicality, which is why we want to reiterate that the purpose of our dataset and our test generation methodology is for **improving reward accuracy in reinforcement learning of LLMs**. In our opinion, the ultimate criterion for the practicality of our method is **how much it helps train better LLMs.** HardTests is not perfect, but compared with previous training data like TACO and CodeContests, it is already the largest in scale (from 25k to 47k), and the highest in test quality (measured by precision and recall). We have shown, in our Section 5.2, how our large amount of problems and higher quality of tests significantly improve three different LLM training scenarios: teacher distillation (from 46% to 53%), self-distillation (from 56% to 60%), and reinforcement learning (from 56% to 64%). This is **direct evidence for practical effectiveness**. ### 1. "Precision drops markedly when GPT-4o is replaced ... Please demonstrate that precision and recall remain comparable with an open-source model, or clarify the framework’s limits." We would like to clarify some important points regarding open-source model performance and framework generality: - We have reported the **results for Kimi-K2** in our previous response, which is also **an open source model** and it's **better than gpt-4o** in 5/8 metrics and comparable in the remaining 3. Also, Qwen3-Coder remains comparable for most of the tasks except for one. We copied the table here for your convenience: ||Diff1||Diff2||Diff3||Diff4|| |-|-|-|-|-|-|-|-|-| ||Precision|Recall|Precision|Recall|Precision|Recall|Precision|Recall| |HT+GPT-4o|99.53|99.18|100.0|97.43|96.04|98.45|84.18|98.03| |HT+**Kimi-K2**|99.41|**99.87**|98.30|97.01|**98.06**|**99.13**|**87.11**|**98.04**| - **HardTestGen actually has less dependence on LLM strength than prior work.** Prior work such as TACO and SymPrompt, when backed by GPT-4, perform much worse than HardTestGen when backed by supposedly weaker open-source models, especially for difficult problems. The precision gap between SymPrompt and HardTests is as large as **almost 50 points** for difficulty 4. This suggest that HardTest dependence on strong LLMs is more relaxed and enables reasonable performance even under constrained model conditions. ||Diff1||Diff2||Diff3||Diff4|| |-|-|-|-|-|-|-|-|-| ||Precision|Recall|Precision|Recall|Precision|Recall|Precision|Recall| |SymPrompt+GPT-4|98.74|98.95|92.64|90.91|81.72|90.99|28.13|93.18| |TACO+GPT-4|**100.0** |73.06|**99.75**| 67.29 |92.74 | 74.08 |62.07| 71.05 | |HT+Qwen3-Coder*|99.47|**99.14**|99.62|**98.88**|**95.20**|**99.13**|**76.83**|**98.82**| ### 2. SymPrompt and recall gap SymPrompt test cases are generated with GPT-4o, the **same model that we used to generate HardTests**. While SymPrompt achieves a higher recall, this performance is potentially superficial, arising not from strong test coverage, but from weak and non-discriminative test cases as suggested by the low F1 value. ||HardTestGen|SymPrompt| |-|-|-| |Precision| 95.77 | 62.28 | |Recall| 90.67 | 94.67 | |F1| 93.15 | 75.13 | To explain this with an extreme case: a test suite that judges any program as correct will have no false negatives, because it won't have **any** negatives at all. Consequently, its recall will be 100%, which is true but meaningless. To concretely illustrate this point and address the reviewer’s request for an example, we provide a detailed case study on Codeforces Problem 191C – Fools and Roads. The input specifications for this problem are in natural language: ``` The first line contains a single integer n (2 ≤ n ≤ 10^5) Each of the next n - 1 lines contains two space-separated integers ui, vi (1 ≤ ui, vi ≤ n, ui ≠ vi), that means that there is a road connecting cities ui and vi. ... The next line contains integer k (0 ≤ k ≤ 10^5) ... ``` In short, this problem involves processing a tree with up to $10^5$ nodes and answering up to $10^5$ pairwise path queries. Efficient solutions require tree traversal preprocessing, and naive or brute-force solutions quickly become infeasible at scale. We observe that SymPrompt-generated test cases for this problem are small across all generations, the maximum observed values are n = 8 and k = 3. These sizes are far below the problem’s limits and do not stress the performance characteristics of candidate programs. In contrast, HardTestGen deliberately generates test cases with n and k values approaching the specification limit (10⁵), effectively stress-testing the scalability and efficiency of candidate programs. The comparison between the two test suites is as follows: - Both methods are able to reject a buggy program with incorrect answer. - SymPrompt correctly accepts the 1 correct solution, but incorrectly accepts all 3 inefficient programs (correct answer but not within time limit). - On the other hand, HardTestGen correctly rejects all 3 inefficient programs, though it incorrectly rejects the 1 correct solution due to tight time constraints. This could be easily resolved by slightly lessening the input scale or better matching the CPUs used by the online judge. This leads to the observed recall gap: SymPrompt has higher recall because it accepts the correct programs that HardTestGen wrongly rejects (performance limit being too harsh). However, it does so **at the cost of substantially lower precision**, accepting several clearly incorrect programs. In fact, none of the SymPrompt-generated test cases induce any timeout error for brute-force programs. HardTests correctly identifies all 36 slow programs among the 163 candidate programs spanning 38 problems, each of which has at least two candidate implementations. Arguably, HardTests is much more practical than SymPrompt. ### 3. "With a 4% validator failure rate...the dataset could be materially compromised." While small, highly-curated datasets may achieve near-perfect validation with huge manual efforts, they lack the scale, diversity, or domain relevance needed for training and evaluating models in competitive programming. We would like to emphasize that HardTests, despite imperfect, already sets a new standard in input validation and overall test quality compared to prior work. As shown in the following table, HardTests outperforms previous training datasets such as TACO and CodeContests across precision, recall, and human validation pass rate, despite using only ~40% as many test cases. This demonstrates that the tests in HardTests are both more effective and more carefully validated. ||# of tests|Precision|Recall| Validation Pass Rate (Human) | |-|-|-|-|-| |TACO|109.6|86.91|78.69|65| |CodeContests|101|84.96|91.52|71| |HardTests|39.9|92.00|94.93|96| Notably, the two datasets, despite their significantly weaker input validation and test quality, have already been instrumental in training many of the top-performing language models (e.g. Berkeley's Sky-T1 [1], Google's FLAN-T5 [2]). On Huggingface alone, 79 models have been trained with CodeContests. It stands to reason that using a higher-quality dataset like HardTests could enable even stronger models. In fact, **this has already been demonstrated** in our paper: HardTests has proven its utility over the prior datasets through empirical results, particularly in Figure 3 and Table 5. In response to the reviewer’s earlier request, we provided additional RL results with more training steps. We would like to reiterate these findings and the substantial utility of HardTests, |step|16|32|48|64|80|96|112|128|144|160| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.04|0.13|0.10|0.11|0.10|0.19|0.17|0.19|0.20|0.19| |HardTests|0.05|0.11|0.10|0.2|0.17|0.19|0.23|0.25|0.24|0.25| |step|176|192|208|224|240|256|272|288|304|320| |-|-|-|-|-|-|-|-|-|-|-| |TACO|0.21|0.22|0.24|0.23|0.22|0.24|0.23|0.24|0.23|0.23 |HardTests|0.25|0.27|0.28|0.29|0.28|0.29|0.30|0.31|0.31|0.32| We fully agree that dataset quality is important. But we also recognize that improving data quality is a never-ending journey, and each new dataset builds on lessons from prior work. In this light, we argue that HardTests is a big step forward in both quality and utility from prior, already useful datasets. ### 4. "Precision falling by just three percent after deleting forty percent of the suite signals substantial redundancy." We would like to kindly point out that HardTestGen is not designed for regression testing. Instead, its primary purpose is to provide a reliable reward signal for RL training. That said, since the reviewer has raised recurring concerns about redundancy, we address them directly below. #### 4a. Why redundancy may not be an issue It's worth considering why redundant tests might be a problem. From the reviewers' comments, we see two possible specific concerns: - That redundant tests might be computationally expensive, and - That redundant tests might bias our reward calculation. In terms of **computational costs**, we want to put the numbers into perspective. In practice, the average problem has **40 tests**, which is only 40% as many as the baseline datasets. Each test is limited to 1-2 second of execution time, and all tests are **fully parallelizable**. If any of the tests fails, we declare the solution a failure and skip running the remaining tests. In our experiments, the average execution time for a HardTests test suite is just **5.1 seconds on a single CPU**—a small fraction of the **16 minutes** an LLM takes per training step. We could further optimize here by running the harder (Hacking) tests first to potentially cut a few more unnecessary executions. As far as concern on **reward calculation**, we follow the same setting as most current code RL works [1, 3, 4], where the reward signal from test cases is **binary**. If any test fails, then we consider the solution a failure. Therefore, having redundant tests—even if the distribution of redundant tests was biased towards easier tests—would not bias reward calculation. #### 4b. Why we cannot judge input redundancy at face value Examining redundancy at face value would be **very inaccurate**. Nearly identical inputs ("twins") could mean very different things in terms of runtime. for example, for the following brute-force program for solving diophantine equations, 7 6 7 results in 10^4 iterations of the inner loop, while 6 6 7 results in 10^8 iterations. The former will easily pass, while the latter can cause the program to time out. ```cpp #include <bits/stdc++.h> int main() { int a, b, c, x, y; int p, flag = 0; scanf("%d %d %d", &a, &b, &c); for (x = 0; x <= 10000; x++) { for (y = 0; y <= 10000; y++) { p = x * a + y * b; if (p == c) { flag = 1; break; } } if (flag == 1) break; } if (flag == 1) printf("Yes"); else printf("No"); return 0; } ``` We hope this example can illustrate why input‑feature distance or constraint‑based clustering is **not an adequate way to measure redundancy** for algorithmic problems in competitive programming. In contrast, randomly dropping tests provides a more practical way to approximate redundancy, as it directly reflects the marginal contribution of test cases to evaluation precision. <!--In fact, the reviewer themselves used random test removal results to argue for HardTestGen substantial redundancy, implicitly validating its use as a proxy.--> #### 4c. Why HardTestGen is not as redundant as you may think - **3% drop in precision is already notable**. We ran the random dropping tests experiment with SymPrompt and present the result averaged across 20 runs below. The extent of the precision drop is similar to the behavior (4% after 40% removal) we observed in HardTestGen. |Percentage to Keep(%)|100|90|80|70|60|50|40|30|20|10| |-|-|-|-|-|-|-|-|-|-|-| |SymPrompt Precision(%)| 64.74 | 63.12 | 61.45 | 61.25 | 60.84 | 59.40 | 57.36 | 54.45 | 52.00 | 50.73 | - To further contextualize these results, we also conducted the same experiment on the AtCoder **official test cases** alongside HardTests, as shown below: |Percentage to Keep(%)| 100 | 90 | 80 | 70 | 60 | 50 | 40 | 30 | 20 | 10 | 5 | 4 | 3 | | ----------------------------------------- | ------ | ------ | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | HardTests Precision(%) | 88.86 | 88.74 | 87.29 | 85.42 | 85.25 | 83.96 | 82.92 | 78.44 | 74.41 | 68.80 | 62.17 | 59.79 | 56.55 | | Official Precision(%) | 100.00 | 100.00 | 100.00 | 99.90 | 99.90 | 99.37 | 99.37 | 99.35 | 98.29 | 96.61 | 94.60 | 94.60 | 94.60 | As shown in the table, the drop in precision is much more significant in HardTests than in the official test suite. While removing 60% of the official test cases results in a <1% precision drop, the same reduction in HardTests leads to a notable 6.9% decrease in precision. Therefore, we argue that HardTests tests are less redundant than the official test cases and as concise as those from SymPrompt, while achieving significantly higher precision than SymPrompt. - The high precision of HardTests, despite being only one-third the size of TACO, indicates that our test suite already **contains far less redundancy**. We would like to redirect the reviewer’s attention to the result presented in the rebuttal, which clearly illustrates this point: ||# of tests|Precision|Recall| |-|-|-|-| |TACO|109.6|69.97|72.53| |HardTests|39.9|85.64|94.77| ### Reference [1] Li, Dacheng, et al., "LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, arXiv preprint arXiv:2502.07374 (2025). [2] Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1-53. [3] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. [4] Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., ... & Lin, Z. (2025). Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. ## Follow-Up 2 We appreciate your quick response and we are glad our new results have addressed some of your concerns. We'll further answer your remaining questions here: ### validator quality ... using open-source models We further annotated the pass rate of input validators generated from Kimi-K2 for the same 100-problem subset. 98\% of the input validators generated by Kimi-K2 are correct, slightly higher than those generated by GPT-4o. ### analyzing duplicated test inputs To directly analyze duplicate inputs, we compare all pairs of input tests by calculating their **longest common subsequence** (see Wikipedia: Longest_common_subsequence). Every test input in our dataset is a string. Two test inputs `a` and `b` are considered **near-duplicates** if and only if: - Their lengths do not differ over 10%, i.e., `abs(len(a)-len(b)) / min(len(a),len(b)) <= 0.1`. - Their longest common sequence is longer than 80% of the shorter string, i.e. `len(lcs(a, b)) >= 0.9 * min(len(a), len(b))`. For example, one longest common sequence for `"8\n1 2 3 4 5 6 7 8"` and `"8\n1 3 2 4 5 6 7 8"` is `['8', '\n', '1', ' ', '2', ' ', '4', ' ', '5', ' ', '6', ' ', '7', ' ', '8']`, which has 15 elements, more than 80% of 17 (the length of both inputs). Therefore these two inputs are considered near-duplicates. We ran this analysis for 500 problems in our dataset and compute the portion of test inputs that have at least one near-duplicate. We found that: - There are no two test inputs that are exactly the same, because we did implement deduplication while generating these tests. - On average, each problem has **0.8%** of near-duplicate inputs. - The problem with the most near-duplicate inputs has **3.3%**, i.e. 3 duplicate inputs out of 100 test inputs. Therefore, there is no significant test input duplication in HardTests. ### dataset’s precision and real-world usability As we argued before, we believe that** similar tests do not harm the dataset's precision and real-world usability**, because the model only gets rewarded when a program passes all tests. For a given test suite, adding in a test that is exactly the same as one of the tests in the suite won't change the set of feasible programs that pass the tests. Conceptually, we've also shown an example where two very similar test inputs `7 6 7` and `6 6 7` can cause very different program behaviors. Therefore, **test inputs that are similar in appearance may actually capture nuanced bugs in programs**. We believe that this is also the rationale behind some fuzzers such as AFL, which modifies inputs by flipping random bits. We welcome any additional questions or concerns you may have. If our response has adequately addressed your feedback, we would be grateful if you would consider revising your evaluation.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully