MGSM, methodology, history, and how to verify a published score

Q: What is MGSM?

Released by Shi et al. in Language Models are Multilingual Chain-of-Thought Reasoners (2022). 250 problems from GSM8K translated to 10 typologically-diverse languages plus the English original.

Q: What's the biggest pitfall when reporting MGSM?

Language uneven coverage. MGSM has only 250 problems per language vs GSM8K's 1,319 — confidence intervals are wider.

Q: How do I verify a published MGSM score?

Use Benchlist. Run via benchlist run mgsm or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Shi et al. in Language Models are Multilingual Chain-of-Thought Reasoners (2022). 250 problems from GSM8K translated to 10 typologically-diverse languages plus the English original.

MGSM measures whether a model's math reasoning transfers across languages. Frontier models maintain ~80–90% of their English GSM8K score across most languages; older or smaller models can drop 30–50pp on lower-resource languages.

How MGSM is graded

Same numeric exact-match grading as GSM8K. Score is reported per-language and as a multilingual average.

Two prompting modes: native-language CoT (model thinks in target language) and English-CoT (model thinks in English regardless of question language). Native-language CoT is harder and more honest.

Common pitfalls when reporting MGSM

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Language uneven coverage. MGSM has only 250 problems per language vs GSM8K's 1,319, confidence intervals are wider.
Translation quality varies. Some problems were re-checked by native speakers, others not. Per-language scores carry different noise.
English-CoT vs native-CoT. Many reports use English-CoT and call it 'multilingual'. Read the methodology.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MGSM

Full leaderboard →

Loading…

How to ship a MGSM score that nobody can challenge

Run MGSM on Benchlist

Benchlist runs the canonical MGSM sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "mgsm",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run mgsm --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MGSM?

Released by Shi et al. in Language Models are Multilingual Chain-of-Thought Reasoners (2022). 250 problems from GSM8K translated to 10 typologically-diverse languages plus the English original.

How is MGSM scored?

Same numeric exact-match grading as GSM8K. Score is reported per-language and as a multilingual average.

What's the biggest pitfall when reporting MGSM?

Language uneven coverage. MGSM has only 250 problems per language vs GSM8K's 1,319, confidence intervals are wider.

How do I verify a published MGSM score?

Use Benchlist. Run via benchlist run mgsm or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for MGSM?

Per the catalog, MGSM runs at temperature 0.0 with max_tokens 512. Deviating without disclosure makes scores incomparable.