SWE-bench

LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals.

evalssandboxpython
Stars
5.2k
Adoption surface
slightly complex
Autonomy
headless
Recovery
resumable
License
✅ open-source
Category
Evaluation and benchmarking harnesses

Repository ↗ Example: SWE-bench Verified leaderboard ↗

Related in Evaluation and benchmarking harnesses