History
Introduced by Hendrycks et al. in Measuring Massive Multitask Language Understanding (2021). 14,042 multiple-choice questions across 57 subjects from middle-school through expert-level US licensing exams.
MMLU was the standard knowledge benchmark from 2021 through 2024. Top models now score 90%+, putting MMLU near saturation. MMLU-Pro is the current-generation successor.
How MMLU is graded
Four-choice multiple-choice. Grading is letter-match. The published convention is 5-shot prompting (5 in-context examples before each question), though zero-shot is increasingly reported.
Subject-mix matters: the average score is the simple mean across 57 subjects, but if you only run 14 subjects you'll get a different number. Always disclose which subset.
Common pitfalls when reporting MMLU
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Severe contamination. MMLU questions and answer keys are in many training corpora. Differential difficulty ("abstract algebra" vs "high-school US history") tracks how well each subject was scraped. The benchmark is now ~useless as a forward-looking capability signal.
- 25% baseline floor. Random guessing gets 25%. Always show absolute lift over baseline, not just the raw score.
- Letter-only vs reasoning-then-letter. Some labs prompt for the letter directly; others ask for reasoning then extract the letter. Score gap is 2–5pp.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · MMLU
Full leaderboard →How to ship a MMLU score that nobody can challenge
Run MMLU on Benchlist
Benchlist runs the canonical MMLU sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "mmlu",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run mmlu --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is MMLU?
How is MMLU scored?
What's the biggest pitfall when reporting MMLU?
How do I verify a published MMLU score?
benchlist run mmlu or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.