History

Released by Google in Program Synthesis with Large Language Models (Austin et al., 2021). 974 short Python problems crowdsourced from internal evaluators, each with a one-sentence description, a reference solution, and three test cases.

MBPP and HumanEval were released within months of each other and have been paired ever since, MBPP is the broader, easier distribution; HumanEval is the more targeted prompts-to-implementation test.

How MBPP is graded

pass@1 at temperature 0 against the three reference test cases per problem.

Two flavours: the full 974-problem set used in research, and a 500-problem sanitized subset that removes ambiguous prompts, most modern reports use the sanitized version.

Common pitfalls when reporting MBPP

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

  • Three-test grading is undertest. MBPP's three tests per problem miss many real bugs. MBPP+ (EvalPlus) adds ~35× more, published scores drop 5–10 pp on MBPP+.
  • Easy-distribution saturation. Top frontier models hit 90%+ on MBPP. Use MBPP-Plus or BigCodeBench for current signal.
  • Dataset confusion. Sanitized MBPP and full MBPP have different difficulty distributions. Cross-paper comparisons must use the same split.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MBPP

Full leaderboard →
Loading…

How to ship a MBPP score that nobody can challenge

Run MBPP on Benchlist

Benchlist runs the canonical MBPP sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "mbpp",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run mbpp --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MBPP?
Released by Google in Program Synthesis with Large Language Models (Austin et al., 2021). 974 short Python problems crowdsourced from internal evaluators, each with a one-sentence description, a reference solution, and three test cases.
How is MBPP scored?
pass@1 at temperature 0 against the three reference test cases per problem.
What's the biggest pitfall when reporting MBPP?
Three-test grading is undertest. MBPP's three tests per problem miss many real bugs. MBPP+ (EvalPlus) adds ~35× more, published scores drop 5–10 pp on MBPP+.
How do I verify a published MBPP score?
Use Benchlist. Run via benchlist run mbpp or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.
What are the canonical decoding parameters for MBPP?
Per the catalog, MBPP runs at temperature 0.0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.