History

Released by Lin et al. (Oxford) in 2021. Questions probe whether the model repeats popular human misconceptions ('do duck quacks echo?') or correct facts.

Two scoring modes: MC1 (single correct answer among 4–6) and MC2 (multi-select). MC1 is more commonly reported.

How TruthfulQA MC1 is graded

MC1: pick the single most truthful answer. MC2: select all true answers (scored as a probability mass).

There's also a generative version where the model writes free text and a judge model grades for truthfulness, that introduces judge bias.

Common pitfalls when reporting TruthfulQA MC1

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

  • Truthfulness is socially loaded. Some 'correct' answers are politically or culturally contested. The benchmark reflects the labellers' views as much as objective truth.
  • Calibration matters. A model that hedges ("some studies suggest...") gets penalised by both MC1 and MC2 even when its hedging is appropriate.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · TruthfulQA MC1

Full leaderboard →
Loading…

How to ship a TruthfulQA MC1 score that nobody can challenge

Run TruthfulQA MC1 on Benchlist

Benchlist runs the canonical TruthfulQA MC1 sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "truthfulqa",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run truthfulqa --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is TruthfulQA MC1?
Released by Lin et al. (Oxford) in 2021. Questions probe whether the model repeats popular human misconceptions ('do duck quacks echo?') or correct facts.
How is TruthfulQA MC1 scored?
MC1: pick the single most truthful answer. MC2: select all true answers (scored as a probability mass).
What's the biggest pitfall when reporting TruthfulQA MC1?
Truthfulness is socially loaded. Some 'correct' answers are politically or culturally contested. The benchmark reflects the labellers' views as much as objective truth.
How do I verify a published TruthfulQA MC1 score?
Use Benchlist. Run via benchlist run truthfulqa or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.
What are the canonical decoding parameters for TruthfulQA MC1?
Per the catalog, TruthfulQA MC1 runs at temperature 0.0 with max_tokens 32. Deviating without disclosure makes scores incomparable.