AgencyBench

Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges.

evalssandboxpython
Stars
87
Adoption surface
complex
Autonomy
headless
Recovery
none
License
✅ open-source
Category
Evaluation and benchmarking harnesses

Repository ↗ Example: AgencyBench leaderboard ↗

Related in Evaluation and benchmarking harnesses