ARC-AGI, methodology, history, and how to verify a published score

Q: What is ARC-AGI?

Released by François Chollet in 2019 as part of the On the Measure of Intelligence paper. 1,000 visual reasoning tasks, each with a small number of input-output examples and a held-out test grid.

Q: What's the biggest pitfall when reporting ARC-AGI?

Compute-budget dependency. ARC-AGI scores scale almost linearly with inference budget. A model scoring 30% with $1 of compute can hit 80% with $1000. Cost-normalised scores are essential.

Q: How do I verify a published ARC-AGI score?

Use Benchlist. Run via benchlist run arc-agi or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by François Chollet in 2019 as part of the On the Measure of Intelligence paper. 1,000 visual reasoning tasks, each with a small number of input-output examples and a held-out test grid.

ARC-AGI was effectively unsolvable for LLMs through 2024. OpenAI's o3 reached ~88% in late 2024 (with massive compute), reframing the benchmark as a test of inference-time reasoning budget.

How ARC-AGI is graded

Each task: model sees 2–5 example pairs of grids (~30×30 cells) and must produce the output grid for a held-out input. Exact-match grading.

Public set is for development; private set is held by Chollet for the prize competition. Reported numbers should always specify which set.

Common pitfalls when reporting ARC-AGI

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Compute-budget dependency. ARC-AGI scores scale almost linearly with inference budget. A model scoring 30% with $1 of compute can hit 80% with $1000. Cost-normalised scores are essential.
Public vs private leakage. The public training set is now in many model training mixes. Only the private hidden set is a clean signal.
Grid representation choice matters. Different prompt formats (JSON, ASCII, image) yield 10pp+ differences for the same model.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · ARC-AGI

Full leaderboard →

Loading…

How to ship a ARC-AGI score that nobody can challenge

Run ARC-AGI on Benchlist

Benchlist runs the canonical ARC-AGI sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "arc-agi",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run arc-agi --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is ARC-AGI?

Released by François Chollet in 2019 as part of the On the Measure of Intelligence paper. 1,000 visual reasoning tasks, each with a small number of input-output examples and a held-out test grid.

How is ARC-AGI scored?

Each task: model sees 2–5 example pairs of grids (~30×30 cells) and must produce the output grid for a held-out input. Exact-match grading.

What's the biggest pitfall when reporting ARC-AGI?

Compute-budget dependency. ARC-AGI scores scale almost linearly with inference budget. A model scoring 30% with $1 of compute can hit 80% with $1000. Cost-normalised scores are essential.

How do I verify a published ARC-AGI score?

Use Benchlist. Run via benchlist run arc-agi or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for ARC-AGI?

Per the catalog, ARC-AGI runs at temperature 0.0 with max_tokens 4096. Deviating without disclosure makes scores incomparable.