MMMU, methodology, history, and how to verify a published score

Q: What is MMMU?

Released by Yue et al. in 2024 — 11,500 college-level multimodal questions across 30 subjects, drawn from textbooks and lecture materials. Each question pairs text with images (charts, diagrams, photographs) and requires both visual perception and domain reasoning.

Q: What's the biggest pitfall when reporting MMMU?

Image-resolution drift. Different vision encoders process at different resolutions. Same model at 384×384 vs 1024×1024 input differs by 5–10pp.

Q: How do I verify a published MMMU score?

Use Benchlist. Run via benchlist run mmmu or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Yue et al. in 2024, 11,500 college-level multimodal questions across 30 subjects, drawn from textbooks and lecture materials. Each question pairs text with images (charts, diagrams, photographs) and requires both visual perception and domain reasoning.

MMMU is the canonical multimodal-reasoning benchmark for 2024–2026. Frontier vision-capable models score 60–75%; expert humans average 88.6%.

How MMMU is graded

Multiple-choice, 4 options. Letter-match grading. Both 5-shot and zero-shot are reported.

Pro split adds harder questions and reduces shortcut shortcuts (text-only solvable). Always check whether MMMU or MMMU-Pro is being reported.

Common pitfalls when reporting MMMU

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Image-resolution drift. Different vision encoders process at different resolutions. Same model at 384×384 vs 1024×1024 input differs by 5–10pp.
Text-only shortcuts. About 30% of MMMU questions are solvable from text alone. MMMU-Pro filtered most of these. The original benchmark overstates true vision capability.
Subject distribution. Some subjects (humanities) are easier than others (engineering). Aggregate hides domain skews.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MMMU

Full leaderboard →

Loading…

How to ship a MMMU score that nobody can challenge

Run MMMU on Benchlist

Benchlist runs the canonical MMMU sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "mmmu",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run mmmu --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MMMU?

How is MMMU scored?

Multiple-choice, 4 options. Letter-match grading. Both 5-shot and zero-shot are reported.

What's the biggest pitfall when reporting MMMU?

Image-resolution drift. Different vision encoders process at different resolutions. Same model at 384×384 vs 1024×1024 input differs by 5–10pp.

How do I verify a published MMMU score?

Use Benchlist. Run via benchlist run mmmu or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for MMMU?

Per the catalog, MMMU runs at temperature 0.0 with max_tokens 512. Deviating without disclosure makes scores incomparable.