History
Released by Yue et al. in 2024, 11,500 college-level multimodal questions across 30 subjects, drawn from textbooks and lecture materials. Each question pairs text with images (charts, diagrams, photographs) and requires both visual perception and domain reasoning.
MMMU is the canonical multimodal-reasoning benchmark for 2024–2026. Frontier vision-capable models score 60–75%; expert humans average 88.6%.
How MMMU is graded
Multiple-choice, 4 options. Letter-match grading. Both 5-shot and zero-shot are reported.
Pro split adds harder questions and reduces shortcut shortcuts (text-only solvable). Always check whether MMMU or MMMU-Pro is being reported.
Common pitfalls when reporting MMMU
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Image-resolution drift. Different vision encoders process at different resolutions. Same model at 384×384 vs 1024×1024 input differs by 5–10pp.
- Text-only shortcuts. About 30% of MMMU questions are solvable from text alone. MMMU-Pro filtered most of these. The original benchmark overstates true vision capability.
- Subject distribution. Some subjects (humanities) are easier than others (engineering). Aggregate hides domain skews.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · MMMU
Full leaderboard →How to ship a MMMU score that nobody can challenge
Run MMMU on Benchlist
Benchlist runs the canonical MMMU sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "mmmu",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run mmmu --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is MMMU?
How is MMMU scored?
What's the biggest pitfall when reporting MMMU?
How do I verify a published MMMU score?
benchlist run mmmu or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.