SWE-bench Lite, methodology, history, and how to verify a published score

Q: How is SWE-bench Lite scored?

Same harness as SWE-bench Verified — Docker per-repo, official swebench evaluator, patch-application + test-pass grading.

Q: What's the biggest pitfall when reporting SWE-bench Lite?

Lite vs Verified vs Full. Three distinct difficulty regimes. Cross-subset comparisons are not legitimate.

Q: How do I verify a published SWE-bench Lite score?

Use Benchlist. Run via benchlist run swe-bench-lite or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by the SWE-bench team alongside the full benchmark, 300 tasks selected to be solvable by mid-tier models, with shorter context, simpler patches, and fewer file edits per fix.

The Lite split is what most labs report when they want a SWE-bench number that's cheaper and faster than the Verified subset. Frontier scores typically run 5–10pp above Verified scores on the same model.

How SWE-bench Lite is graded

Same harness as SWE-bench Verified, Docker per-repo, official swebench evaluator, patch-application + test-pass grading.

Selection bias: Lite tasks were curated to be 'solvable' which removes the hardest. The benchmark is not statistically uniform, interpret it as 'easier subset' not 'random sample'.

Common pitfalls when reporting SWE-bench Lite

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Lite vs Verified vs Full. Three distinct difficulty regimes. Cross-subset comparisons are not legitimate.
Apply-format errors masked at smaller n. 300 problems means individual failures swing the score 0.33pp. Confidence intervals are wide.
Selection bias. Lite is by construction easier than a random subset. The 'I beat SWE-bench' narrative falls apart on Verified or Full.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · SWE-bench Lite

Full leaderboard →

Loading…

How to ship a SWE-bench Lite score that nobody can challenge

Run SWE-bench Lite on Benchlist

Benchlist runs the canonical SWE-bench Lite sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "swe-bench-lite",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run swe-bench-lite --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is SWE-bench Lite?

Released by the SWE-bench team alongside the full benchmark, 300 tasks selected to be solvable by mid-tier models, with shorter context, simpler patches, and fewer file edits per fix.

How is SWE-bench Lite scored?

Same harness as SWE-bench Verified, Docker per-repo, official swebench evaluator, patch-application + test-pass grading.

What's the biggest pitfall when reporting SWE-bench Lite?

Lite vs Verified vs Full. Three distinct difficulty regimes. Cross-subset comparisons are not legitimate.

How do I verify a published SWE-bench Lite score?

Use Benchlist. Run via benchlist run swe-bench-lite or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for SWE-bench Lite?

Per the catalog, SWE-bench Lite runs at temperature 0.0 with max_tokens 8192. Deviating without disclosure makes scores incomparable.