History
Introduced by Liu et al. in Is Your Code Generated by ChatGPT Really Correct? (NeurIPS 2023). The headline finding: many solutions that pass HumanEval's small test suite fail under more rigorous testing. EvalPlus extends each problem with ~35× more unit tests and stricter input handling.
Reported drops from HumanEval to HumanEval+ are typically 5–15 percentage points for frontier models, the gap is a direct measure of how much HumanEval was overstating real correctness.
How HumanEval+ is graded
Same pass@1 grading as HumanEval. Each problem gets the original tests plus ~80 generated tests covering edge cases (null, empty, very large, wrong types, off-by-one).
Some problems were also rewritten because the original docstrings were ambiguous, a model could give a 'correct' solution that wasn't what HumanEval intended. EvalPlus fixes those ambiguities.
Common pitfalls when reporting HumanEval+
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Lower numbers are not always due to a worse model. Some HumanEval+ test cases are arguably stricter than spec, a model returning a sensible but unconventional answer can fail. Read the diff before declaring a regression.
- Most labs still report HumanEval, not HumanEval+. If you're comparing against a vendor's claimed number, check which version they're quoting. The gap is real.
- Contamination is reduced but not eliminated. EvalPlus tests are public; subsequent training runs include them. The advantage shrinks every quarter.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · HumanEval+
Full leaderboard →How to ship a HumanEval+ score that nobody can challenge
Run HumanEval+ on Benchlist
Benchlist runs the canonical HumanEval+ sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "humaneval-plus",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run humaneval-plus --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is HumanEval+?
How is HumanEval+ scored?
What's the biggest pitfall when reporting HumanEval+?
How do I verify a published HumanEval+ score?
benchlist run humaneval-plus or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.