arc-agi-benchmarking
Runner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring.
evalsprovider-agnosticpython
- Stars
- 350
- Adoption surface
- mostly simple
- Autonomy
- headless
- Recovery
- retry
- License
- ✅ open-source
Repository ↗ Example: o3 prompt example ↗