History
Released by the SWE-bench team alongside the full benchmark, 300 tasks selected to be solvable by mid-tier models, with shorter context, simpler patches, and fewer file edits per fix.
The Lite split is what most labs report when they want a SWE-bench number that's cheaper and faster than the Verified subset. Frontier scores typically run 5–10pp above Verified scores on the same model.
How SWE-bench Lite is graded
Same harness as SWE-bench Verified, Docker per-repo, official swebench evaluator, patch-application + test-pass grading.
Selection bias: Lite tasks were curated to be 'solvable' which removes the hardest. The benchmark is not statistically uniform, interpret it as 'easier subset' not 'random sample'.
Common pitfalls when reporting SWE-bench Lite
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Lite vs Verified vs Full. Three distinct difficulty regimes. Cross-subset comparisons are not legitimate.
- Apply-format errors masked at smaller n. 300 problems means individual failures swing the score 0.33pp. Confidence intervals are wide.
- Selection bias. Lite is by construction easier than a random subset. The 'I beat SWE-bench' narrative falls apart on Verified or Full.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · SWE-bench Lite
Full leaderboard →How to ship a SWE-bench Lite score that nobody can challenge
Run SWE-bench Lite on Benchlist
Benchlist runs the canonical SWE-bench Lite sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "swe-bench-lite",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run swe-bench-lite --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is SWE-bench Lite?
How is SWE-bench Lite scored?
swebench evaluator, patch-application + test-pass grading.What's the biggest pitfall when reporting SWE-bench Lite?
How do I verify a published SWE-bench Lite score?
benchlist run swe-bench-lite or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.