SWE-bench Verified, methodology, history, and how to verify a published score

Q: What's the biggest pitfall when reporting SWE-bench Verified?

Harness mismatch. Comparing two models is only meaningful if both ran under the same agent loop, the same retry budget, and the same tool set.

Q: How do I verify a published SWE-bench Verified score?

Use Benchlist. Run via benchlist run swe-bench-verified or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

SWE-Bench was introduced by Jimenez et al. (Princeton, 2023), 2,294 real GitHub issues from 12 popular Python repos paired with the human-written PR that fixed them. The model's job: produce a patch that makes the failing tests pass.

SWE-Bench Verified is the 500-task subset OpenAI re-validated in August 2024 to remove ambiguous, malformed, or near-impossible tasks. Today this is the canonical version reported in model cards.

Frontier scores rose from ~12% (GPT-4) in late 2023 to 60–70% (Claude Opus 4.x, GPT-5.x) by 2026. SWE-Bench Verified is the strongest current public signal of practical coding-agent capability.

How SWE-bench Verified is graded

For each task: the model is given the repo at a specific commit, the issue text, and a small budget of agent turns. It must produce a unified diff that, when applied, makes the failing tests pass without breaking the passing tests.

Grading runs the official swebench harness inside a per-repo Docker image with pinned dependencies. Patch-application failures count as failures.

Reported numbers depend heavily on the agent harness. A model under cline / cursor / claude-code / a custom scaffold can score 5–15pp differently than the same model in a bare prompt. Always disclose the harness.

Common pitfalls when reporting SWE-bench Verified

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Harness mismatch. Comparing two models is only meaningful if both ran under the same agent loop, the same retry budget, and the same tool set.
Verified vs Lite vs Full. Three subsets exist (Full, Lite, Verified), they have different difficulty distributions. Cross-subset comparisons are not legitimate.
Apply-format errors. A non-trivial fraction of failures are diffs the model wrote correctly in spirit but wrong in unified-diff format. Filter these before claiming reasoning gaps.
Test-leakage time horizon. Older model snapshots may have seen the fix PRs in their pre-training data. Models released before mid-2024 should be treated with extra suspicion on this benchmark.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · SWE-bench Verified

Full leaderboard →

Loading…

How to ship a SWE-bench Verified score that nobody can challenge

Run SWE-bench Verified on Benchlist

Benchlist runs the canonical SWE-bench Verified sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "swe-bench-verified",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run swe-bench-verified --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is SWE-bench Verified?

How is SWE-bench Verified scored?

What's the biggest pitfall when reporting SWE-bench Verified?

Harness mismatch. Comparing two models is only meaningful if both ran under the same agent loop, the same retry budget, and the same tool set.

How do I verify a published SWE-bench Verified score?

Use Benchlist. Run via benchlist run swe-bench-verified or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for SWE-bench Verified?

Per the catalog, SWE-bench Attested runs at temperature 0.0 with max_tokens 8192. Deviating without disclosure makes scores incomparable.