MMLU-Pro, methodology, history, and how to verify a published score

Q: What's the biggest pitfall when reporting MMLU-Pro?

Already entering contamination. MMLU-Pro is now public; expect training-set leakage to grow over the next 12 months. The 'cleanest' signal is its first year.

Q: How do I verify a published MMLU-Pro score?

Use Benchlist. Run via benchlist run mmlu-pro or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by TIGER-Lab in 2024 as an explicit successor to MMLU, addressing the saturation and contamination problems. ~12,000 problems with up to 10 answer choices (vs MMLU's 4), curated to require more reasoning and less surface recall.

Frontier models score 70–88% on MMLU-Pro vs 90%+ on MMLU, meaningful headroom restored. This is the canonical knowledge-and-reasoning benchmark for the 2025–2026 era.

How MMLU-Pro is graded

Multiple-choice with up to 10 options (A–J). Grading is letter-match. Most reports use 5-shot prompting with chain-of-thought disclosed.

The dataset re-uses some MMLU questions but adds harder distractors, so a model that memorised MMLU will not get a free pass on MMLU-Pro.

Common pitfalls when reporting MMLU-Pro

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Already entering contamination. MMLU-Pro is now public; expect training-set leakage to grow over the next 12 months. The 'cleanest' signal is its first year.
10-option formatting drift. Some models trip on 10-option lists. Score depends partially on whether the prompt format matches what the model saw in training.
Uneven subject difficulty. Like MMLU, subject mixes matter. Always report the subject break-down for fair comparison.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MMLU-Pro

Full leaderboard →

Loading…

How to ship a MMLU-Pro score that nobody can challenge

Run MMLU-Pro on Benchlist

Benchlist runs the canonical MMLU-Pro sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "mmlu-pro",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run mmlu-pro --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MMLU-Pro?

How is MMLU-Pro scored?

Multiple-choice with up to 10 options (A–J). Grading is letter-match. Most reports use 5-shot prompting with chain-of-thought disclosed.

What's the biggest pitfall when reporting MMLU-Pro?

Already entering contamination. MMLU-Pro is now public; expect training-set leakage to grow over the next 12 months. The 'cleanest' signal is its first year.

How do I verify a published MMLU-Pro score?

Use Benchlist. Run via benchlist run mmlu-pro or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for MMLU-Pro?

Per the catalog, MMLU-Pro runs at temperature 0.0 with max_tokens 512. Deviating without disclosure makes scores incomparable.