WinoGrande, methodology, history, and how to verify a published score

Q: What's the biggest pitfall when reporting WinoGrande?

Two-option = 50% baseline. Always show absolute lift over chance, not raw accuracy.

Q: How do I verify a published WinoGrande score?

Use Benchlist. Run via benchlist run winogrande or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by Sakaguchi et al. (AI2, 2019) as a large-scale, adversarial extension of the original Winograd Schema Challenge. Each problem has a sentence with an underscored pronoun and two candidate referents.

Designed to defeat 'simple' surface heuristics that solved the original Winograd Schemas. Frontier models now solve 90%+ on the validation set.

How WinoGrande is graded

Two-option binary choice. Pick the referent that fits the sentence. Letter or option-text match.

Five training-set sizes are released (XS through XL). Most evaluations use the validation set against the XL training prompt format.

Common pitfalls when reporting WinoGrande

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Two-option = 50% baseline. Always show absolute lift over chance, not raw accuracy.
Sentences are short and stylised. The benchmark is increasingly out-of-distribution for models trained on long-form internet text.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · WinoGrande

Full leaderboard →

Loading…

How to ship a WinoGrande score that nobody can challenge

Run WinoGrande on Benchlist

Benchlist runs the canonical WinoGrande sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "winogrande",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run winogrande --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is WinoGrande?

How is WinoGrande scored?

Two-option binary choice. Pick the referent that fits the sentence. Letter or option-text match.

What's the biggest pitfall when reporting WinoGrande?

Two-option = 50% baseline. Always show absolute lift over chance, not raw accuracy.

How do I verify a published WinoGrande score?

Use Benchlist. Run via benchlist run winogrande or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for WinoGrande?

Per the catalog, WinoGrande runs at temperature 0.0 with max_tokens 16. Deviating without disclosure makes scores incomparable.