SWE-bench
LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals.
evalssandboxpython
- Stars
- 5.2k
- Adoption surface
- slightly complex
- Autonomy
- headless
- Recovery
- resumable
- License
- ✅ open-source
Repository ↗ Example: SWE-bench Verified leaderboard ↗