Evaluation and benchmarking harnesses

Agentic eval systems, reasoning benchmarks, and open agent benchmarks.

#ProjectStarsTierOSSDescription
1Agent Lightning
evalstrainingpython
17.3kcomplexMicrosoft's training-oriented harness: optimization loops for agent behavior—when you need to improve policies over rollouts, not only score a fixed prompt.
2SWE-bench
evalssandboxpython
5.2kslightly complexLMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals.
3AgentBench
evalssandboxragworkflowpython
3.5kcomplexICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface.
4inspect_ai
evalssandboxpython
2.2kcomplexInspect AI core: composable eval tasks, sandboxes, scorers, and multi-model runs; the framework behind inspect_evals, not just the task bundle.
5WebArena
python
1.5kcomplexRealistic web env (e.g. e‑commerce, CMS, dev tools); 812 tasks; measures end-to-end web agent success.
6WebVoyager
evalsvision
1.1kslightly complexEnd-to-end web agent with LMMs: screenshots + actions on real sites; benchmark on 15 sites, GPT-4V for automatic eval.
7ARC-AGI-2
715super simpleARC Prize task set: grid-based abstraction/reasoning; public and private splits for generalization.
8SWE-Gym
evalstrainingpython
694slightly complexTraining and evaluation for SWE agents and verifiers (ICML 2025).
9swe-smith
trainingpython
681slightly complexData generation for SWE agents; 50k+ instances across 128 repos; used for SWE-agent-LM training.
10inspect_evals
evalssandbox
547slightly complexUK AISI/Arcadia/Vector: GAIA and other evals in Inspect AI; level 1–3, sandboxed, tool-calling solvers.
11arc-agi-benchmarking
evalsprovider-agnosticpython
350mostly simpleRunner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring.
12VitaBench
145complexICLR'26: 66 tools, real-world apps (delivery, travel, retail); 100 cross-scenario + 300 single-scenario tasks; adopted by Qwen/Seed.
13AgencyBench
evalssandboxpython
87complexLong-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges.
14letta-evals
memorypython
72mostly simpleEval harness for stateful Letta agents; configurable suites and grading (LLM or rule-based) so you can measure what you ship.
15SUPER
sandboxpython
53slightly complexAgents that set up and run ML/NLP from GitHub repos; 45 expert problems, 152 masked tasks, 602 AutoGen tasks; Docker-based.
16TRAIL
19mostly simpleTrace reasoning and agentic issue localization; 148 long-context traces, 841 errors, 20+ error types; Hugging Face dataset.