best-of-Agent-Harnesses

SWE-bench

LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals.

evalssandboxpython

Stars: 5.2k
Adoption surface: slightly complex
Autonomy: headless
Recovery: resumable
License: ✅ open-source
Category: Evaluation and benchmarking harnesses

Repository ↗ Example: SWE-bench Verified leaderboard ↗

Related in Evaluation and benchmarking harnesses