AgencyBench
Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges.
evalssandboxpython
- Stars
- 87
- Adoption surface
- complex
- Autonomy
- headless
- Recovery
- none
- License
- ✅ open-source
Repository ↗ Example: AgencyBench leaderboard ↗