Scaling RL Environments for SWE Agents

# Scaling RL Environments for SWE Agents Tianyi Zhang (tianyi.code at gmail dot com) ### TL;DR > Automatically building SWE bench-like environments can unlock RL training, potentially very impactful. My initial experiments achieved a 85% success rate reconstructing instances from SWE bench repos with Claude 3.7. I am open-sourcing everything to call for collaboration. Code at https://github.com/Tiiiger/swe-bench-infinite . # Motivation SWE Bench [1] is now the de facto benchmark for evaluating SWE agents. Its core insight can be summarized as follows: we take pull requests from public repos, split them into patches and associated tests, and then execute the tests to verify whether applying the patch enables previously failing tests to pass. Once we verify these patch-test pairs, we withhold the patch and challenge models to regenerate it. This insight is very generalizable. Following SWE Bench, subsequent efforts have applied its insights to additional repositories [2], multimodal applications [3], programming languages other than Python [4], and more. However, despite this generality, building these benchmarks remains highly manual and time-consuming. Three years ago, I built a executable benchmark called DS-1000, an experience that proved quite challenging. As a result, current executable datasets typically contain only a few thousand examples. In parallel with data advancements, algorithm improvements have also accelerated. As Shunyu recently stated, we've entered the "second half of AI," where RL finally works. I strongly believe that developing automated methods to scale executable SWE environments is one of the most impactful directions right now. After weighing a few ideas, I've decided to build on the SWE Bench insight, primarily because it scales well. My initial estimate suggests we could potentially scale up to between 100K and 1M executable instances, especially if we combine multiple programming languages. To put this number in context, although exact figures for cutting-edge RL models are hard to come by, one reference is DeepSeek R1, which utilizes about 600K reasoning-oriented examples in its supervised fine-tuning phase following reinforcement learning. Over the past three weeks, I developed a simple baseline to assess feasibility and achieved promising results. Specifically, I was able to automatically set up environments from the original 12 SWE Bench repositories with an 85% success rate using Claude 3.7. There are nuances regarding the interpretation, and the evaluation leverages the manually constructed SWE Bench "testing pipeline", meaning real-world automatic performance might be slightly lower. Nevertheless, these results are encouraging. Because I don't have publication pressure (graduating soon) and have quite limited time (doing job search), I have decided to share everything and hope someone can pick up this work. At the end of the post, I outlined a few research and engineering directions that worth exploring. You can checkout the code [here](https://github.com/Tiiiger/swe-bench-infinite). Have fun! # Method Given the constraints of a blog post, readers should have a basic understanding of SWE Bench construction. ## Design Principles To begin, let's dissect the manual tasks involved in constructing SWE Bench instances. Given a pull request containing both patch and tests, we typically need to: 1. Install the required dependencies 2. Identify relevant tests from the PR and write commands to run them 3. Parse test logs to verify if the patch passes new tests SWE Bench authors amortize the efforts by grouping PRs into versions based on regex patterns on version files, then specifying version-specific constants to simplify setup. For my initial experiment, *I focused exclusively on installing the required dependencies*. This choice was motivated by my past experience—dependency installation is usually the most challenging step—and and the ease of evaluation: we simply replace environments constructed manually in the original SWE Bench with automatically installed environments, and check if the tests pass. Inspired largely by Agentless, I designed a general agent scaffold without any repository-specific hacks. The current pipeline is tailored for Python. Before detailing the scaffold, I'd like to share a key trick. Since SWE Bench primarily involves well-established open-source repos, environment setup is generally straightforward for their current versions. However, the PRs of interest may be several years old, and the Python ecosystem's backward compatibility can be problematic. To address this, I developed a simple "time travel" scraper, which finds the newest available package versions at the PR's original timestamp. ## Agent Scaffold We independently set up environments for each PR, without version grouping. For each PR: 1. Base environment: Start from a base Ubuntu 20 image with essential packages (e.g., build-essential, libffi-dev). 2. Localization: Checkout the base commit, list all files using `tree`, and have the model identify up to ten files likely containing setup instructions (e.g., README.md, CI scripts, or documentation). 3. Collection: The model reads the selected files and specifies: Python version, apt-get packages, pip packages, and the repo installation command. We perform the time travel trick for pip packages. 4. Build and retry: Install specified packages using a fixed script. On installation errors, retry the collection step up to three times. 5. Test and retry: Run the SWE Bench-specified test commands. On runtime errors not captured by pytest, retry the collection step up to three times. 6. Report: Run the SWE Bench test log parser after applying the patch, and check if target tests pass. Note: Steps 5 and 6 rely on human-annotated test commands and log parsers. All repos in the original SWE Bench use `pytest` and all except two repos rely on standard `pytest` commands. I am very optimistic that we can automate step 5 and 6, and the current design is motivated by the ease of evaluation. In the very strict sense, you can understand this experiment as measuring only the environment setup part of the end-to-end data collection pipeline. # Experiment ## Setup As mentioned earlier, we reused instances from SWE Bench for evaluation. Due to the skewed distribution of repositories in the original SWE Bench, we conducted stratified sampling across repositories and timestamps to create a balanced set of 300 test instances. Our primary evaluation metric is the success rate of passing the designated target tests. This metric essentially measures the true positive rate, indicating our ability to reconstruct previously verified executable instances. Based on my experience, I expect a low false positive rate—meaning that environments able to correctly pass the provided tests for known patches should also reliably reject incorrect patches. However, it should be noted that this experiment does not measure the false positive rate. While I did not perform hyperparameter tuning, I reviewed logs to resolve infrastructure-related issues. Therefore, the results presented here should be regarded as validation findings. ## Results ![by_repo](https://hackmd.io/_uploads/SyE2Nmeygg.png) Overall, we observe that our simple scaffold achieves a 85.77% success rate, far exceeding my initial expectations. We observed significant variance across different repositories, with Flask and Seaborn achieving perfect accuracy (100%), whereas Astropy and Xarray recorded lower accuracy around 60%. ![by_year](https://hackmd.io/_uploads/SJLp47eyxg.png) Analyzing through the temporal dimension, we see that more recent pull requests are easier to set up environments for. Regarding costs, each run involved approximately 91k input tokens and 8.8k output tokens (including tokens used for reasoning), totaling around $0.40 per execution. # Future Directions ### Short-term 1. Replace manually written test commands and log parsers with automated versions. A principled evaluation approach is comparing acceptance/rejection consistency across many samples. 2. Expand to more Python repositories, starting with the SWE Bench "train" split and then developing a PR scraper. Current benchmarks often focus on high-profile repos. However, in my experience, there are a number of repos that do not have many stars but also contain high quality PRs. 3. Extend to other programming languages. Rust and TypeScript could be strong starting points because I suspect they have better backward compatibility. 4. Improve infrastructure. Shift to serverless or remote execution as local Docker setups currently struggle with repos testing network connections. I also put more discussion in the repo. ### Medium-term 1. Clean PR problem statement. Automatically constructed instances could be trivial or too ambiguous. Developing pipelines to refine problem statements would yield higher-quality datasets. 2. Many pull requests don't have tests. My prior is that synthetically generating these is quite hard, but equipped with many executable demonstrations, I wonder if we can do better. ### Collaboration Unfortunately, I don't have much time to work on this in the next while. I wish someone can carry this on and I believe it will be quite fruitful. If you are already working on similar directions, I would love to chat and learn. Reach me at tianyi.code at gmail dot com. #### Acknowledgement I would like to thank John Yang, Tatsunori Hashimoto, Yu Sun, Caroline Choi, and members of the broader Tatsu Lab for their valuable discussions and insights. #### Citations [1] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” International Conference on Learning Representations (ICLR), 2024. [2] Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang. “Training Software Engineering Agents and Verifiers with SWE-Gym.” arXiv preprint, 2024. [3] John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press. “SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?” International Conference on Learning Representations (ICLR), 2025. [4] Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang. “Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving.” arXiv preprint, 2025.