ARC-Challenge, methodology, history, and how to verify a published score

Q: What is ARC-Challenge?

Released by AI2 in Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Clark et al., 2018). The Challenge subset (1,172 test) holds questions that retrieval-augmented models historically failed.

Q: How is ARC-Challenge scored?

Four-option multiple-choice (rarely 3 or 5 — normalised to 4 in most evaluations). Letter-match grading. Both 5-shot and 25-shot are reported in literature.

Q: What's the biggest pitfall when reporting ARC-Challenge?

Saturation. Frontier models cap out at 95%+. Use this benchmark to differentiate small models, not frontier ones.

Q: How do I verify a published ARC-Challenge score?

Use Benchlist. Run via benchlist run arc-challenge or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by AI2 in Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Clark et al., 2018). The Challenge subset (1,172 test) holds questions that retrieval-augmented models historically failed.

ARC was the canonical commonsense-reasoning benchmark from 2018 through 2023. Frontier models now reach 95%+ on the Challenge split. Useful mainly for smaller-model comparison.

How ARC-Challenge is graded

Four-option multiple-choice (rarely 3 or 5, normalised to 4 in most evaluations). Letter-match grading. Both 5-shot and 25-shot are reported in literature.

Common pitfalls when reporting ARC-Challenge

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Saturation. Frontier models cap out at 95%+. Use this benchmark to differentiate small models, not frontier ones.
Easy + Challenge confusion. ARC ships with both an Easy split and a Challenge split. Cross-paper comparisons require the same split.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · ARC-Challenge

Full leaderboard →

Loading…

How to ship a ARC-Challenge score that nobody can challenge

Run ARC-Challenge on Benchlist

Benchlist runs the canonical ARC-Challenge sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "arc-challenge",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run arc-challenge --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is ARC-Challenge?

How is ARC-Challenge scored?

Four-option multiple-choice (rarely 3 or 5, normalised to 4 in most evaluations). Letter-match grading. Both 5-shot and 25-shot are reported in literature.

What's the biggest pitfall when reporting ARC-Challenge?

Saturation. Frontier models cap out at 95%+. Use this benchmark to differentiate small models, not frontier ones.

How do I verify a published ARC-Challenge score?

Use Benchlist. Run via benchlist run arc-challenge or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for ARC-Challenge?

Per the catalog, ARC-Challenge runs at temperature 0.0 with max_tokens 128. Deviating without disclosure makes scores incomparable.