CommonsenseQA, methodology, history, and how to verify a published score

Q: What is CommonsenseQA?

Released by Talmor et al. in 2018 as CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge . Built on top of ConceptNet — the questions probe relations between common concepts ("where would you find a chimney?") with five candidate answers crafted to all be plausible.

Q: What's the biggest pitfall when reporting CommonsenseQA?

ConceptNet contamination. Both questions and answer keys live in public corpora. Models trained after 2019 have likely seen them.

Q: How do I verify a published CommonsenseQA score?

Use Benchlist. Run via benchlist run commonsenseqa or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Talmor et al. in 2018 as CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. Built on top of ConceptNet, the questions probe relations between common concepts ("where would you find a chimney?") with five candidate answers crafted to all be plausible.

Frontier models score 85%+ on CommonsenseQA. Useful mostly for differentiating small open-weight models, a 7B model getting 75% says something different than a 70B model getting 88%.

How CommonsenseQA is graded

Five-option multiple-choice (A–E). Letter-match grading on the validation set (the test set has hidden labels).

Each question is paired with a single concept from ConceptNet plus 4 distractors all related to that concept, so surface-level lexical overlap doesn't help much.

Common pitfalls when reporting CommonsenseQA

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

ConceptNet contamination. Both questions and answer keys live in public corpora. Models trained after 2019 have likely seen them.
20% baseline floor. Random guessing gets 20%. Always show absolute lift over baseline.
Distractors look correct. The wrong answers are deliberately plausible, small models often pick a plausible-but-wrong choice for syntactic reasons. Read the transcripts before declaring a reasoning gap.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · CommonsenseQA

Full leaderboard →

Loading…

How to ship a CommonsenseQA score that nobody can challenge

Run CommonsenseQA on Benchlist

Benchlist runs the canonical CommonsenseQA sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "commonsenseqa",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run commonsenseqa --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is CommonsenseQA?

How is CommonsenseQA scored?

Five-option multiple-choice (A–E). Letter-match grading on the validation set (the test set has hidden labels).

What's the biggest pitfall when reporting CommonsenseQA?

ConceptNet contamination. Both questions and answer keys live in public corpora. Models trained after 2019 have likely seen them.

How do I verify a published CommonsenseQA score?

Use Benchlist. Run via benchlist run commonsenseqa or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for CommonsenseQA?

Per the catalog, CommonsenseQA runs at temperature 0.0 with max_tokens 16. Deviating without disclosure makes scores incomparable.