Official Leaderboards
There's an all-new, challenging SWE-bench Multimodal, containing software issues described with images.
Learn more here.
mini-SWE-agent scores up to 74% on SWE-bench Verified in 100 lines of Python code.
Click here to learn more.
Introducing CodeClash, our new evaluation where LMs compete head to head to write the best codebase!
Click here to learn more.
SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or view all agents [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal).
News
- [11/2025]
Introducing CodeClash, our new eval of LMs as goal (not task) oriented developers! [Link]
- [07/2025]
mini-SWE-agent scores 65% on SWE-bench Verified in 100 lines of python code. [Link]
- [05/2025]
SWE-smith is out! Train your own models for software engineering agents. [Link]
- [03/2025]
SWE-agent 1.0 is the open source SOTA on SWE-bench Lite! [Link]
- [10/2024] Introducing SWE-bench Multimodal! [Link]
- [08/2024] SWE-bench x OpenAI = SWE-bench Verified [Report]
- [06/2024] Docker-ized SWE-bench for easier evaluation [Report]
- [03/2024] Check out SWE-agent (12.47% on SWE-bench) [Link]
- [03/2024] Released SWE-bench Lite [Report]
Acknowledgements
We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.
