History
Released by Zhuo et al. in 2024 as the successor to MBPP and HumanEval, both of which had saturated. BigCodeBench tasks require multi-library function calls (numpy, pandas, scipy, requests, sklearn, etc.), closer to real engineering work than the toy problems in HumanEval.
Two splits: Complete (164 problems, hard subset) and Hard (the full 1,140). Frontier models score 30–50% on Hard; HumanEval-saturated models can drop 40+ pp on this benchmark.
How BigCodeBench is graded
pass@1 at temperature 0. Each problem ships ~17 unit tests on average, exercising real library behaviour. Tests run in a sandboxed Docker container.
Two task types: Complete (function-completion from docstring) and Instruct (full instructions, no signature). Complete is easier; Instruct is the harder real-world test.
Common pitfalls when reporting BigCodeBench
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Library version drift. Tests pin specific library versions. A model writing correct numpy 2.x code can fail tests pinned to numpy 1.x. Read the requirements file.
- Sandbox import overhead. Each test imports up to 10 libraries, sandbox cold-start adds 2–4s per test. Parallelisation matters.
- Complete vs Instruct confusion. Complete is roughly twice as easy as Instruct. Cross-paper comparisons must use the same split.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · BigCodeBench
Full leaderboard →How to ship a BigCodeBench score that nobody can challenge
Run BigCodeBench on Benchlist
Benchlist runs the canonical BigCodeBench sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "bigcodebench",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run bigcodebench --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is BigCodeBench?
How is BigCodeBench scored?
pass@1 at temperature 0. Each problem ships ~17 unit tests on average, exercising real library behaviour. Tests run in a sandboxed Docker container.What's the biggest pitfall when reporting BigCodeBench?
How do I verify a published BigCodeBench score?
benchlist run bigcodebench or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.