BigCodeBench, methodology, history, and how to verify a published score

Q: How is BigCodeBench scored?

pass@1 at temperature 0. Each problem ships ~17 unit tests on average, exercising real library behaviour. Tests run in a sandboxed Docker container.

Q: What's the biggest pitfall when reporting BigCodeBench?

Library version drift. Tests pin specific library versions. A model writing correct numpy 2.x code can fail tests pinned to numpy 1.x. Read the requirements file.

Q: How do I verify a published BigCodeBench score?

Use Benchlist. Run via benchlist run bigcodebench or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Zhuo et al. in 2024 as the successor to MBPP and HumanEval, both of which had saturated. BigCodeBench tasks require multi-library function calls (numpy, pandas, scipy, requests, sklearn, etc.), closer to real engineering work than the toy problems in HumanEval.

Two splits: Complete (164 problems, hard subset) and Hard (the full 1,140). Frontier models score 30–50% on Hard; HumanEval-saturated models can drop 40+ pp on this benchmark.

How BigCodeBench is graded

pass@1 at temperature 0. Each problem ships ~17 unit tests on average, exercising real library behaviour. Tests run in a sandboxed Docker container.

Two task types: Complete (function-completion from docstring) and Instruct (full instructions, no signature). Complete is easier; Instruct is the harder real-world test.

Common pitfalls when reporting BigCodeBench

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Library version drift. Tests pin specific library versions. A model writing correct numpy 2.x code can fail tests pinned to numpy 1.x. Read the requirements file.
Sandbox import overhead. Each test imports up to 10 libraries, sandbox cold-start adds 2–4s per test. Parallelisation matters.
Complete vs Instruct confusion. Complete is roughly twice as easy as Instruct. Cross-paper comparisons must use the same split.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · BigCodeBench

Full leaderboard →

Loading…

How to ship a BigCodeBench score that nobody can challenge

Run BigCodeBench on Benchlist

Benchlist runs the canonical BigCodeBench sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "bigcodebench",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run bigcodebench --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is BigCodeBench?

How is BigCodeBench scored?

pass@1 at temperature 0. Each problem ships ~17 unit tests on average, exercising real library behaviour. Tests run in a sandboxed Docker container.

What's the biggest pitfall when reporting BigCodeBench?

Library version drift. Tests pin specific library versions. A model writing correct numpy 2.x code can fail tests pinned to numpy 1.x. Read the requirements file.

How do I verify a published BigCodeBench score?

Use Benchlist. Run via benchlist run bigcodebench or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for BigCodeBench?

Per the catalog, BigCodeBench runs at temperature 0.0 with max_tokens 2048. Deviating without disclosure makes scores incomparable.