HumanEval, methodology, history, and how to verify a published score

Q: What is HumanEval?

Introduced by OpenAI in Evaluating Large Language Models Trained on Code (Chen et al., 2021) alongside the announcement of Codex. The dataset is 164 hand-written Python programming problems — each with a function signature, docstring, body, and unit tests.

Q: What's the biggest pitfall when reporting HumanEval?

Training-set contamination is severe. HumanEval problems and their solutions appear verbatim across millions of GitHub repos, tutorials, and Stack Overflow answers. Any model trained after early 2022 has likely seen them. This is the central reason researchers use HumanEval+ and contamination-aware splits.

Q: How do I verify a published HumanEval score?

Use Benchlist. Run via benchlist run humaneval or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by OpenAI in Evaluating Large Language Models Trained on Code (Chen et al., 2021) alongside the announcement of Codex. The dataset is 164 hand-written Python programming problems, each with a function signature, docstring, body, and unit tests.

HumanEval became the de facto code-generation leaderboard for the next 18 months. By late 2023 frontier models were saturating it (90%+ pass@1), which is why HumanEval+ (more tests per problem) and BigCodeBench (harder distribution) were proposed as successors.

How HumanEval is graded

The grading procedure is deterministic: for each problem, sample one completion from the model at temperature 0, append it to the function signature, and run the canonical unit tests under a 3-second timeout. pass@1 is the fraction of problems where the tests pass.

Sampling strategy matters. Some labs report pass@10 or pass@100 with the same temperature 0 sampling, that's not the same as pass@1. Always check whether the number is single-shot or best-of-k.

Decoding parameters: the canonical config is temperature 0.0, max_tokens 1024. Higher temperatures inflate variance and make scores incomparable.

Common pitfalls when reporting HumanEval

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Training-set contamination is severe. HumanEval problems and their solutions appear verbatim across millions of GitHub repos, tutorials, and Stack Overflow answers. Any model trained after early 2022 has likely seen them. This is the central reason researchers use HumanEval+ and contamination-aware splits.
pass@k inflation. Quoting pass@10 against another lab's pass@1 hides 5–15 percentage points of headroom. Always demand the same k.
Test-suite-only signal. HumanEval tests are well-formed but small (~7 tests per problem on average). HumanEval+ adds 35× more tests and routinely catches solutions that passed HumanEval but fail in production.
Whitespace + import edge cases. Different runners handle leading whitespace, missing imports, and indentation differently. A 1–2pp score drift across runners is normal.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · HumanEval

Full leaderboard →

Loading…

How to ship a HumanEval score that nobody can challenge

Run HumanEval on Benchlist

Benchlist runs the canonical HumanEval sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "humaneval",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run humaneval --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is HumanEval?

How is HumanEval scored?

What's the biggest pitfall when reporting HumanEval?

Training-set contamination is severe. HumanEval problems and their solutions appear verbatim across millions of GitHub repos, tutorials, and Stack Overflow answers. Any model trained after early 2022 has likely seen them. This is the central reason researchers use HumanEval+ and contamination-aware splits.

How do I verify a published HumanEval score?

Use Benchlist. Run via benchlist run humaneval or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for HumanEval?

Per the catalog, HumanEval runs at temperature 0.0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.