History
Released by OpenAI in Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021). 8.5K linguistically diverse grade-school math problems written by human contractors, each with a step-by-step solution and a final numeric answer marked with ####.
GSM8K became the canonical 'does this model do basic math' test for the next four years. By 2025 frontier models routinely scored 95%+. The benchmark is now considered effectively saturated and is used mainly as a smoke test, with MATH-500 and AIME used to differentiate frontier capability.
How GSM8K is graded
Standard grading: extract the last number from the model's output (after stripping commas) and compare it to the canonical answer to within 1e-3.
Two scoring conventions exist, strict numeric match (used in the original paper) and a tolerant text match that accepts "the answer is 12" alongside "12". Most modern reports use strict.
Tool use changes everything. Calculator-augmented models routinely add 8–15 percentage points over the same model without tools. Always disclose whether tool use was enabled.
Common pitfalls when reporting GSM8K
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Heavy contamination. GSM8K problems and solutions appear verbatim in countless online tutorials post-2021. Most modern training mixes contain them. The benchmark's signal value is now mostly differential, small models on cold prompts.
- Chain-of-thought is doing the work. The 95%+ scores from frontier models are with multi-step reasoning enabled. Without scratchpad, the same models drop 20+ percentage points. Disclose.
- Last-number regex is brittle. Models that say 'we have 12 apples plus 5 oranges totalling 17' get scored as 17 (correct) when the question asked for apples (12). Regex graders miss this.
- Saturation hides the real gap. GSM8K differentiates 60% models from 90% models. It does not differentiate 95% models from 97% models, that gap is below the noise floor of the regex grader.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · GSM8K
Full leaderboard →How to ship a GSM8K score that nobody can challenge
Run GSM8K on Benchlist
Benchlist runs the canonical GSM8K sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "gsm8k",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run gsm8k --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is GSM8K?
####.How is GSM8K scored?
What's the biggest pitfall when reporting GSM8K?
How do I verify a published GSM8K score?
benchlist run gsm8k or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.