History
SWE-Bench was introduced by Jimenez et al. (Princeton, 2023), 2,294 real GitHub issues from 12 popular Python repos paired with the human-written PR that fixed them. The model's job: produce a patch that makes the failing tests pass.
SWE-Bench Verified is the 500-task subset OpenAI re-validated in August 2024 to remove ambiguous, malformed, or near-impossible tasks. Today this is the canonical version reported in model cards.
Frontier scores rose from ~12% (GPT-4) in late 2023 to 60–70% (Claude Opus 4.x, GPT-5.x) by 2026. SWE-Bench Verified is the strongest current public signal of practical coding-agent capability.
How SWE-bench Verified is graded
For each task: the model is given the repo at a specific commit, the issue text, and a small budget of agent turns. It must produce a unified diff that, when applied, makes the failing tests pass without breaking the passing tests.
Grading runs the official swebench harness inside a per-repo Docker image with pinned dependencies. Patch-application failures count as failures.
Reported numbers depend heavily on the agent harness. A model under cline / cursor / claude-code / a custom scaffold can score 5–15pp differently than the same model in a bare prompt. Always disclose the harness.
Common pitfalls when reporting SWE-bench Verified
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Harness mismatch. Comparing two models is only meaningful if both ran under the same agent loop, the same retry budget, and the same tool set.
- Verified vs Lite vs Full. Three subsets exist (Full, Lite, Verified), they have different difficulty distributions. Cross-subset comparisons are not legitimate.
- Apply-format errors. A non-trivial fraction of failures are diffs the model wrote correctly in spirit but wrong in unified-diff format. Filter these before claiming reasoning gaps.
- Test-leakage time horizon. Older model snapshots may have seen the fix PRs in their pre-training data. Models released before mid-2024 should be treated with extra suspicion on this benchmark.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · SWE-bench Verified
Full leaderboard →How to ship a SWE-bench Verified score that nobody can challenge
Run SWE-bench Verified on Benchlist
Benchlist runs the canonical SWE-bench Verified sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "swe-bench-verified",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run swe-bench-verified --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is SWE-bench Verified?
How is SWE-bench Verified scored?
What's the biggest pitfall when reporting SWE-bench Verified?
How do I verify a published SWE-bench Verified score?
benchlist run swe-bench-verified or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.