best-of-Agent-Harnesses

AgencyBench

Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges.

evalssandboxpython

Stars: 87
Adoption surface: complex
Autonomy: headless
Recovery: none
License: ✅ open-source
Category: Evaluation and benchmarking harnesses

Repository ↗ Example: AgencyBench leaderboard ↗

Related in Evaluation and benchmarking harnesses