LiveCodeBench, methodology, history, and how to verify a published score

Q: How is LiveCodeBench scored?

pass@1 against the canonical hidden test set for each problem. Each problem also has a contamination tag (release-date relative to model cutoff) so you can filter.

Q: What's the biggest pitfall when reporting LiveCodeBench?

Date-cutoff is the most important field. A 95% score on problems released before the model's training cutoff is meaningless. Always filter to post-cutoff problems.

Q: How do I verify a published LiveCodeBench score?

Use Benchlist. Run via benchlist run livecodebench or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by Jain et al. in 2024 to address contamination in static benchmarks. LiveCodeBench scrapes new problems from LeetCode, AtCoder, and Codeforces continuously, every model snapshot can be evaluated only on problems posted after its training cutoff.

The temporal-cutoff property makes LiveCodeBench the cleanest current public signal for true coding capability. Reported scores typically lag headline-grabbing HumanEval numbers by 20–40 percentage points, that gap is the contamination delta.

How LiveCodeBench is graded

pass@1 against the canonical hidden test set for each problem. Each problem also has a contamination tag (release-date relative to model cutoff) so you can filter.

Code execution happens in a per-language Docker sandbox (Python, C++, Java).

Common pitfalls when reporting LiveCodeBench

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Date-cutoff is the most important field. A 95% score on problems released before the model's training cutoff is meaningless. Always filter to post-cutoff problems.
Continuous freshness means moving target. LiveCodeBench scores depend on which time window you evaluate. v3 ≠ v4, pin the version.
Language mix. Different languages (Python, C++) have different solve rates. Aggregate scores hide this.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · LiveCodeBench

Full leaderboard →

Loading…

How to ship a LiveCodeBench score that nobody can challenge

Run LiveCodeBench on Benchlist

Benchlist runs the canonical LiveCodeBench sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "livecodebench",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run livecodebench --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is LiveCodeBench?

How is LiveCodeBench scored?

pass@1 against the canonical hidden test set for each problem. Each problem also has a contamination tag (release-date relative to model cutoff) so you can filter.

What's the biggest pitfall when reporting LiveCodeBench?

Date-cutoff is the most important field. A 95% score on problems released before the model's training cutoff is meaningless. Always filter to post-cutoff problems.

How do I verify a published LiveCodeBench score?

Use Benchlist. Run via benchlist run livecodebench or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for LiveCodeBench?

Per the catalog, LiveCodeBench runs at temperature 0.0 with max_tokens 2048. Deviating without disclosure makes scores incomparable.