MMLU, methodology, history, and how to verify a published score

Q: What is MMLU?

Introduced by Hendrycks et al. in Measuring Massive Multitask Language Understanding (2021). 14,042 multiple-choice questions across 57 subjects from middle-school through expert-level US licensing exams.

Q: What's the biggest pitfall when reporting MMLU?

Severe contamination. MMLU questions and answer keys are in many training corpora. Differential difficulty ("abstract algebra" vs "high-school US history") tracks how well each subject was scraped. The benchmark is now ~useless as a forward-looking capability signal.

Q: How do I verify a published MMLU score?

Use Benchlist. Run via benchlist run mmlu or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by Hendrycks et al. in Measuring Massive Multitask Language Understanding (2021). 14,042 multiple-choice questions across 57 subjects from middle-school through expert-level US licensing exams.

MMLU was the standard knowledge benchmark from 2021 through 2024. Top models now score 90%+, putting MMLU near saturation. MMLU-Pro is the current-generation successor.

How MMLU is graded

Four-choice multiple-choice. Grading is letter-match. The published convention is 5-shot prompting (5 in-context examples before each question), though zero-shot is increasingly reported.

Subject-mix matters: the average score is the simple mean across 57 subjects, but if you only run 14 subjects you'll get a different number. Always disclose which subset.

Common pitfalls when reporting MMLU

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Severe contamination. MMLU questions and answer keys are in many training corpora. Differential difficulty ("abstract algebra" vs "high-school US history") tracks how well each subject was scraped. The benchmark is now ~useless as a forward-looking capability signal.
25% baseline floor. Random guessing gets 25%. Always show absolute lift over baseline, not just the raw score.
Letter-only vs reasoning-then-letter. Some labs prompt for the letter directly; others ask for reasoning then extract the letter. Score gap is 2–5pp.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MMLU

Full leaderboard →

Loading…

How to ship a MMLU score that nobody can challenge

Run MMLU on Benchlist

Benchlist runs the canonical MMLU sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "mmlu",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run mmlu --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MMLU?

How is MMLU scored?

Four-choice multiple-choice. Grading is letter-match. The published convention is 5-shot prompting (5 in-context examples before each question), though zero-shot is increasingly reported.

What's the biggest pitfall when reporting MMLU?

Severe contamination. MMLU questions and answer keys are in many training corpora. Differential difficulty ("abstract algebra" vs "high-school US history") tracks how well each subject was scraped. The benchmark is now ~useless as a forward-looking capability signal.

How do I verify a published MMLU score?

Use Benchlist. Run via benchlist run mmlu or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for MMLU?

Per the catalog, MMLU runs at temperature 0.0 with max_tokens 32. Deviating without disclosure makes scores incomparable.