History
Released by François Chollet in 2019 as part of the On the Measure of Intelligence paper. 1,000 visual reasoning tasks, each with a small number of input-output examples and a held-out test grid.
ARC-AGI was effectively unsolvable for LLMs through 2024. OpenAI's o3 reached ~88% in late 2024 (with massive compute), reframing the benchmark as a test of inference-time reasoning budget.
How ARC-AGI is graded
Each task: model sees 2–5 example pairs of grids (~30×30 cells) and must produce the output grid for a held-out input. Exact-match grading.
Public set is for development; private set is held by Chollet for the prize competition. Reported numbers should always specify which set.
Common pitfalls when reporting ARC-AGI
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Compute-budget dependency. ARC-AGI scores scale almost linearly with inference budget. A model scoring 30% with $1 of compute can hit 80% with $1000. Cost-normalised scores are essential.
- Public vs private leakage. The public training set is now in many model training mixes. Only the private hidden set is a clean signal.
- Grid representation choice matters. Different prompt formats (JSON, ASCII, image) yield 10pp+ differences for the same model.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · ARC-AGI
Full leaderboard →How to ship a ARC-AGI score that nobody can challenge
Run ARC-AGI on Benchlist
Benchlist runs the canonical ARC-AGI sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "arc-agi",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run arc-agi --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is ARC-AGI?
How is ARC-AGI scored?
What's the biggest pitfall when reporting ARC-AGI?
How do I verify a published ARC-AGI score?
benchlist run arc-agi or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.