History
Introduced by Zellers et al. in HellaSwag: Can a Machine Really Finish Your Sentence? (2019). 70k+ sentence-completion problems with adversarially-generated wrong endings designed to fool early-2019 LMs.
Once a hard benchmark, HellaSwag is now near-saturated for frontier models (95%+). Useful as a smoke test and for tiny-model comparisons.
How HellaSwag is graded
Four-option multiple-choice. The model picks the most likely continuation. Letter-match grading.
Common pitfalls when reporting HellaSwag
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Adversarial wrong answers can confuse small models. The original wrong endings were generated to fool 2019-era LMs. Modern models often prefer the wrong answer for stylistic reasons even when they 'know' the right one. This is a feature, not a bug, but it makes HellaSwag a bad raw-capability proxy.
- Validation vs test. The test labels are not public. Most reports use the validation set. Always disclose.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · HellaSwag
Full leaderboard →How to ship a HellaSwag score that nobody can challenge
Run HellaSwag on Benchlist
Benchlist runs the canonical HellaSwag sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "hellaswag",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run hellaswag --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is HellaSwag?
How is HellaSwag scored?
What's the biggest pitfall when reporting HellaSwag?
How do I verify a published HellaSwag score?
benchlist run hellaswag or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.