Evaluation and benchmarking harnesses
Agentic eval systems, reasoning benchmarks, and open agent benchmarks.
| # | Project | Stars | Tier | OSS | Description |
|---|---|---|---|---|---|
| 1 | Agent Lightning evalstrainingpython | 17.3k | complex | ✅ | Microsoft's training-oriented harness: optimization loops for agent behavior—when you need to improve policies over rollouts, not only score a fixed prompt. |
| 2 | SWE-bench evalssandboxpython | 5.2k | slightly complex | ✅ | LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals. |
| 3 | AgentBench evalssandboxragworkflowpython | 3.5k | complex | ✅ | ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface. |
| 4 | inspect_ai evalssandboxpython | 2.2k | complex | ✅ | Inspect AI core: composable eval tasks, sandboxes, scorers, and multi-model runs; the framework behind inspect_evals, not just the task bundle. |
| 5 | WebArena python | 1.5k | complex | ✅ | Realistic web env (e.g. e‑commerce, CMS, dev tools); 812 tasks; measures end-to-end web agent success. |
| 6 | WebVoyager evalsvision | 1.1k | slightly complex | ✅ | End-to-end web agent with LMMs: screenshots + actions on real sites; benchmark on 15 sites, GPT-4V for automatic eval. |
| 7 | ARC-AGI-2 | 715 | super simple | ✅ | ARC Prize task set: grid-based abstraction/reasoning; public and private splits for generalization. |
| 8 | SWE-Gym evalstrainingpython | 694 | slightly complex | ✅ | Training and evaluation for SWE agents and verifiers (ICML 2025). |
| 9 | swe-smith trainingpython | 681 | slightly complex | ✅ | Data generation for SWE agents; 50k+ instances across 128 repos; used for SWE-agent-LM training. |
| 10 | inspect_evals evalssandbox | 547 | slightly complex | ✅ | UK AISI/Arcadia/Vector: GAIA and other evals in Inspect AI; level 1–3, sandboxed, tool-calling solvers. |
| 11 | arc-agi-benchmarking evalsprovider-agnosticpython | 350 | mostly simple | ✅ | Runner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring. |
| 12 | VitaBench | 145 | complex | ✅ | ICLR'26: 66 tools, real-world apps (delivery, travel, retail); 100 cross-scenario + 300 single-scenario tasks; adopted by Qwen/Seed. |
| 13 | AgencyBench evalssandboxpython | 87 | complex | ✅ | Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges. |
| 14 | letta-evals memorypython | 72 | mostly simple | ✅ | Eval harness for stateful Letta agents; configurable suites and grading (LLM or rule-based) so you can measure what you ship. |
| 15 | SUPER sandboxpython | 53 | slightly complex | ✅ | Agents that set up and run ML/NLP from GitHub repos; 45 expert problems, 152 masked tasks, 602 AutoGen tasks; Docker-based. |
| 16 | TRAIL | 19 | mostly simple | ✅ | Trace reasoning and agentic issue localization; 148 long-context traces, 841 errors, 20+ error types; Hugging Face dataset. |