best-of-Agent-Harnesses

arc-agi-benchmarking

Runner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring.

evalsprovider-agnosticpython

Stars: 350
Adoption surface: mostly simple
Autonomy: headless
Recovery: retry
License: ✅ open-source
Category: Evaluation and benchmarking harnesses

Repository ↗ Example: o3 prompt example ↗

Related in Evaluation and benchmarking harnesses