History
Released by Shi et al. in Language Models are Multilingual Chain-of-Thought Reasoners (2022). 250 problems from GSM8K translated to 10 typologically-diverse languages plus the English original.
MGSM measures whether a model's math reasoning transfers across languages. Frontier models maintain ~80–90% of their English GSM8K score across most languages; older or smaller models can drop 30–50pp on lower-resource languages.
How MGSM is graded
Same numeric exact-match grading as GSM8K. Score is reported per-language and as a multilingual average.
Two prompting modes: native-language CoT (model thinks in target language) and English-CoT (model thinks in English regardless of question language). Native-language CoT is harder and more honest.
Common pitfalls when reporting MGSM
The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:
- Language uneven coverage. MGSM has only 250 problems per language vs GSM8K's 1,319, confidence intervals are wider.
- Translation quality varies. Some problems were re-checked by native speakers, others not. Per-language scores carry different noise.
- English-CoT vs native-CoT. Many reports use English-CoT and call it 'multilingual'. Read the methodology.
Live Benchlist leaderboard
Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.
Top scores · MGSM
Full leaderboard →How to ship a MGSM score that nobody can challenge
Run MGSM on Benchlist
Benchlist runs the canonical MGSM sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.
Hosted runner, POST a job and we email the verify URL when it's done:
curl -X POST https://benchlist.ai/api/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-sonnet-4.5",
"benchmark": "mgsm",
"runs": 1,
"limit": 50,
"proof_system": "signed",
"inference_api_key": "managed"
}'
Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:
pip install benchlist-runner
benchlist run mgsm --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json
FAQ
What is MGSM?
How is MGSM scored?
What's the biggest pitfall when reporting MGSM?
How do I verify a published MGSM score?
benchlist run mgsm or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.