SimpleQA, methodology, history, and how to verify a published score

Q: What is SimpleQA?

Released by OpenAI in late 2024 as a hallucination probe — 4,326 short factual questions where the answer is a single name, date, number, or place. Questions were crafted such that each had a single unambiguous reference answer at curation time.

Q: How is SimpleQA scored?

Generative — model produces free text; an external judge grades against the canonical reference. The judge is itself an LLM (GPT-4-class), which introduces some noise.

Q: What's the biggest pitfall when reporting SimpleQA?

Judge-model bias. GPT-4-class judges tend to over-reward GPT-4-style answers. Run with multiple judges if possible.

Q: How do I verify a published SimpleQA score?

Use Benchlist. Run via benchlist run simpleqa or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by OpenAI in late 2024 as a hallucination probe, 4,326 short factual questions where the answer is a single name, date, number, or place. Questions were crafted such that each had a single unambiguous reference answer at curation time.

SimpleQA's headline finding: even frontier models hallucinate confidently on 25–40% of questions. The benchmark is the field's strongest current test of factual calibration.

How SimpleQA is graded

Generative, model produces free text; an external judge grades against the canonical reference. The judge is itself an LLM (GPT-4-class), which introduces some noise.

Reported metrics: % correct, % incorrect, % attempted-but-wrong, % declined-to-answer. The 'declined' bucket is critical for calibration analysis.

Common pitfalls when reporting SimpleQA

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Judge-model bias. GPT-4-class judges tend to over-reward GPT-4-style answers. Run with multiple judges if possible.
Calibration vs raw accuracy. A model that says "I don't know" 50% of the time and is right 80% of the rest is more useful than one that always answers and is right 60%. Look at the calibration curve, not just raw accuracy.
Time-anchored questions. Some answers change over time ("current president of X"). Versioning the question set matters.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · SimpleQA

Full leaderboard →

Loading…

How to ship a SimpleQA score that nobody can challenge

Run SimpleQA on Benchlist

Benchlist runs the canonical SimpleQA sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "simpleqa",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run simpleqa --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is SimpleQA?

How is SimpleQA scored?

Generative, model produces free text; an external judge grades against the canonical reference. The judge is itself an LLM (GPT-4-class), which introduces some noise.

What's the biggest pitfall when reporting SimpleQA?

Judge-model bias. GPT-4-class judges tend to over-reward GPT-4-style answers. Run with multiple judges if possible.

How do I verify a published SimpleQA score?

Use Benchlist. Run via benchlist run simpleqa or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for SimpleQA?

Per the catalog, SimpleQA runs at temperature 0.0 with max_tokens 256. Deviating without disclosure makes scores incomparable.