History
Released by TIGER-Lab in 2024 as an explicit successor to MMLU, addressing the saturation and contamination problems. ~12,000 problems with up to 10 answer choices (vs MMLU's 4), curated to require more reasoning and less surface recall.
Frontier models score 70–88% on MMLU-Pro vs 90%+ on MMLU, meaningful headroom restored. This is the canonical knowledge-and-reasoning benchmark for the 2025–2026 era.
How MMLU-Pro is graded
Multiple-choice with up to 10 options (A–J). Grading is letter-match. Most reports use 5-shot prompting with chain-of-thought disclosed.
The dataset re-uses some MMLU questions but adds harder distractors, so a model that memorised MMLU will not get a free pass on MMLU-Pro.
Common pitfalls when reporting MMLU-Pro
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Already entering contamination. MMLU-Pro is now public; expect training-set leakage to grow over the next 12 months. The 'cleanest' signal is its first year.
- 10-option formatting drift. Some models trip on 10-option lists. Score depends partially on whether the prompt format matches what the model saw in training.
- Uneven subject difficulty. Like MMLU, subject mixes matter. Always report the subject break-down for fair comparison.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · MMLU-Pro
Full leaderboard →How to ship a MMLU-Pro score that nobody can challenge
Run MMLU-Pro on Benchlist
Benchlist runs the canonical MMLU-Pro sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "mmlu-pro",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run mmlu-pro --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is MMLU-Pro?
How is MMLU-Pro scored?
What's the biggest pitfall when reporting MMLU-Pro?
How do I verify a published MMLU-Pro score?
benchlist run mmlu-pro or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.