AgentBench
ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface.
evalssandboxragworkflowpython
- Stars
- 3.5k
- Adoption surface
- complex
- Autonomy
- headless
- Recovery
- none
- License
- ✅ open-source
Repository ↗ Example: AgentBench ICLR'24 paper ↗