History

Introduced by Zellers et al. in HellaSwag: Can a Machine Really Finish Your Sentence? (2019). 70k+ sentence-completion problems with adversarially-generated wrong endings designed to fool early-2019 LMs.

Once a hard benchmark, HellaSwag is now near-saturated for frontier models (95%+). Useful as a smoke test and for tiny-model comparisons.

How HellaSwag is graded

Four-option multiple-choice. The model picks the most likely continuation. Letter-match grading.

Common pitfalls when reporting HellaSwag

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

  • Adversarial wrong answers can confuse small models. The original wrong endings were generated to fool 2019-era LMs. Modern models often prefer the wrong answer for stylistic reasons even when they 'know' the right one. This is a feature, not a bug, but it makes HellaSwag a bad raw-capability proxy.
  • Validation vs test. The test labels are not public. Most reports use the validation set. Always disclose.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · HellaSwag

Full leaderboard →
Loading…

How to ship a HellaSwag score that nobody can challenge

Run HellaSwag on Benchlist

Benchlist runs the canonical HellaSwag sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "hellaswag",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run hellaswag --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is HellaSwag?
Introduced by Zellers et al. in HellaSwag: Can a Machine Really Finish Your Sentence? (2019). 70k+ sentence-completion problems with adversarially-generated wrong endings designed to fool early-2019 LMs.
How is HellaSwag scored?
Four-option multiple-choice. The model picks the most likely continuation. Letter-match grading.
What's the biggest pitfall when reporting HellaSwag?
Adversarial wrong answers can confuse small models. The original wrong endings were generated to fool 2019-era LMs. Modern models often prefer the wrong answer for stylistic reasons even when they 'know' the right one. This is a feature, not a bug, but it makes HellaSwag a bad raw-capability proxy.
How do I verify a published HellaSwag score?
Use Benchlist. Run via benchlist run hellaswag or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.
What are the canonical decoding parameters for HellaSwag?
Per the catalog, HellaSwag runs at temperature 0.0 with max_tokens 32. Deviating without disclosure makes scores incomparable.