History
Introduced by Rein et al. in GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023). 448 PhD-level multiple-choice questions in biology, physics, and chemistry written by domain experts and validated against multi-hour expert efforts.
GPQA Diamond is the 198-question high-quality subset. The 'Google-proof' framing means the questions are designed so that even with internet access, a non-expert cannot easily find the answer. By 2026, frontier models score 75–82% on Diamond.
How GPQA Diamond is graded
Four-option multiple-choice. Letter-match grading. Most papers use 5-shot prompting; some use zero-shot to avoid in-context contamination.
Validation set (Diamond): expert PhDs in the relevant domain achieved 65% on average; non-experts (with internet) achieved 34%. So a model scoring 70% is genuinely beyond the domain-non-expert ceiling.
Common pitfalls when reporting GPQA Diamond
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Tiny denominator. 198 problems means individual questions are ~0.5pp each. Score gaps under 2pp are within sampling noise.
- Subject distribution. Diamond is 1/3 biology, 1/3 physics, 1/3 chemistry. Models often have domain skews, a 5pp gap can come entirely from one subject.
- Reasoning-mode dependency. GPQA is the canonical chain-of-thought benchmark. Without scratchpad, top models drop 20+ pp.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · GPQA Diamond
Full leaderboard →How to ship a GPQA Diamond score that nobody can challenge
Run GPQA Diamond on Benchlist
Benchlist runs the canonical GPQA Diamond sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "gpqa",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run gpqa --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is GPQA Diamond?
How is GPQA Diamond scored?
What's the biggest pitfall when reporting GPQA Diamond?
How do I verify a published GPQA Diamond score?
benchlist run gpqa or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.