Evaluation and benchmarking harnesses

Agentic eval systems, reasoning benchmarks, and open agent benchmarks.

#	Project	Stars	Tier	OSS	Description
1	Agent Lightning evalstrainingpython	17.3k	complex	✅	Microsoft's training-oriented harness: optimization loops for agent behavior—when you need to improve policies over rollouts, not only score a fixed prompt.
2	SWE-bench evalssandboxpython	5.2k	slightly complex	✅	LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals.
3	AgentBench evalssandboxragworkflowpython	3.5k	complex	✅	ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface.
4	inspect_ai evalssandboxpython	2.2k	complex	✅	Inspect AI core: composable eval tasks, sandboxes, scorers, and multi-model runs; the framework behind inspect_evals, not just the task bundle.
5	WebArena python	1.5k	complex	✅	Realistic web env (e.g. e‑commerce, CMS, dev tools); 812 tasks; measures end-to-end web agent success.
6	WebVoyager evalsvision	1.1k	slightly complex	✅	End-to-end web agent with LMMs: screenshots + actions on real sites; benchmark on 15 sites, GPT-4V for automatic eval.
7	ARC-AGI-2	715	super simple	✅	ARC Prize task set: grid-based abstraction/reasoning; public and private splits for generalization.
8	SWE-Gym evalstrainingpython	694	slightly complex	✅	Training and evaluation for SWE agents and verifiers (ICML 2025).
9	swe-smith trainingpython	681	slightly complex	✅	Data generation for SWE agents; 50k+ instances across 128 repos; used for SWE-agent-LM training.
10	inspect_evals evalssandbox	547	slightly complex	✅	UK AISI/Arcadia/Vector: GAIA and other evals in Inspect AI; level 1–3, sandboxed, tool-calling solvers.
11	arc-agi-benchmarking evalsprovider-agnosticpython	350	mostly simple	✅	Runner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring.
12	VitaBench	145	complex	✅	ICLR'26: 66 tools, real-world apps (delivery, travel, retail); 100 cross-scenario + 300 single-scenario tasks; adopted by Qwen/Seed.
13	AgencyBench evalssandboxpython	87	complex	✅	Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges.
14	letta-evals memorypython	72	mostly simple	✅	Eval harness for stateful Letta agents; configurable suites and grading (LLM or rule-based) so you can measure what you ship.
15	SUPER sandboxpython	53	slightly complex	✅	Agents that set up and run ML/NLP from GitHub repos; 45 expert problems, 152 masked tasks, 602 AutoGen tasks; Docker-based.
16	TRAIL	19	mostly simple	✅	Trace reasoning and agentic issue localization; 148 long-context traces, 841 errors, 20+ error types; Hugging Face dataset.