HumanEval+, methodology, history, and how to verify a published score

Q: What is HumanEval+?

Introduced by Liu et al. in Is Your Code Generated by ChatGPT Really Correct? (NeurIPS 2023). The headline finding: many solutions that pass HumanEval's small test suite fail under more rigorous testing. EvalPlus extends each problem with ~35× more unit tests and stricter input handling.

Q: What's the biggest pitfall when reporting HumanEval+?

Lower numbers are not always due to a worse model. Some HumanEval+ test cases are arguably stricter than spec — a model returning a sensible but unconventional answer can fail. Read the diff before declaring a regression.

Q: How do I verify a published HumanEval+ score?

Use Benchlist. Run via benchlist run humaneval-plus or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by Liu et al. in Is Your Code Generated by ChatGPT Really Correct? (NeurIPS 2023). The headline finding: many solutions that pass HumanEval's small test suite fail under more rigorous testing. EvalPlus extends each problem with ~35× more unit tests and stricter input handling.

Reported drops from HumanEval to HumanEval+ are typically 5–15 percentage points for frontier models, the gap is a direct measure of how much HumanEval was overstating real correctness.

How HumanEval+ is graded

Same pass@1 grading as HumanEval. Each problem gets the original tests plus ~80 generated tests covering edge cases (null, empty, very large, wrong types, off-by-one).

Some problems were also rewritten because the original docstrings were ambiguous, a model could give a 'correct' solution that wasn't what HumanEval intended. EvalPlus fixes those ambiguities.

Common pitfalls when reporting HumanEval+

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Lower numbers are not always due to a worse model. Some HumanEval+ test cases are arguably stricter than spec, a model returning a sensible but unconventional answer can fail. Read the diff before declaring a regression.
Most labs still report HumanEval, not HumanEval+. If you're comparing against a vendor's claimed number, check which version they're quoting. The gap is real.
Contamination is reduced but not eliminated. EvalPlus tests are public; subsequent training runs include them. The advantage shrinks every quarter.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · HumanEval+

Full leaderboard →

Loading…

How to ship a HumanEval+ score that nobody can challenge

Run HumanEval+ on Benchlist

Benchlist runs the canonical HumanEval+ sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "humaneval-plus",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run humaneval-plus --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is HumanEval+?

How is HumanEval+ scored?

Same pass@1 grading as HumanEval. Each problem gets the original tests plus ~80 generated tests covering edge cases (null, empty, very large, wrong types, off-by-one).

What's the biggest pitfall when reporting HumanEval+?

Lower numbers are not always due to a worse model. Some HumanEval+ test cases are arguably stricter than spec, a model returning a sensible but unconventional answer can fail. Read the diff before declaring a regression.

How do I verify a published HumanEval+ score?

Use Benchlist. Run via benchlist run humaneval-plus or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for HumanEval+?

Per the catalog, HumanEval+ runs at temperature 0.0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.