AgentBench

ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface.

evalssandboxragworkflowpython
Stars
3.5k
Adoption surface
complex
Autonomy
headless
Recovery
none
License
✅ open-source
Category
Evaluation and benchmarking harnesses

Repository ↗ Example: AgentBench ICLR'24 paper ↗

Related in Evaluation and benchmarking harnesses