best-of-Agent-Harnesses

AgentBench

ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface.

evalssandboxragworkflowpython

Stars: 3.5k
Adoption surface: complex
Autonomy: headless
Recovery: none
License: ✅ open-source
Category: Evaluation and benchmarking harnesses

Repository ↗ Example: AgentBench ICLR'24 paper ↗

Related in Evaluation and benchmarking harnesses