best-of-Agent-Harnesses

WebVoyager

End-to-end web agent with LMMs: screenshots + actions on real sites; benchmark on 15 sites, GPT-4V for automatic eval.

evalsvision

Stars: 1.1k
Adoption surface: slightly complex
Autonomy: headless
Recovery: none
License: ✅ open-source
Category: Evaluation and benchmarking harnesses

Repository ↗ Example: 643 web tasks dataset ↗

Related in Evaluation and benchmarking harnesses