History
Introduced by Sakaguchi et al. (AI2, 2019) as a large-scale, adversarial extension of the original Winograd Schema Challenge. Each problem has a sentence with an underscored pronoun and two candidate referents.
Designed to defeat 'simple' surface heuristics that solved the original Winograd Schemas. Frontier models now solve 90%+ on the validation set.
How WinoGrande is graded
Two-option binary choice. Pick the referent that fits the sentence. Letter or option-text match.
Five training-set sizes are released (XS through XL). Most evaluations use the validation set against the XL training prompt format.
Common pitfalls when reporting WinoGrande
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Two-option = 50% baseline. Always show absolute lift over chance, not raw accuracy.
- Sentences are short and stylised. The benchmark is increasingly out-of-distribution for models trained on long-form internet text.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · WinoGrande
Full leaderboard →How to ship a WinoGrande score that nobody can challenge
Run WinoGrande on Benchlist
Benchlist runs the canonical WinoGrande sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "winogrande",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run winogrande --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is WinoGrande?
How is WinoGrande scored?
What's the biggest pitfall when reporting WinoGrande?
How do I verify a published WinoGrande score?
benchlist run winogrande or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.