History

Introduced by Epoch AI in 2024, a private collection of research-level math problems curated by working mathematicians. The set is hidden; only Epoch evaluators see solutions. Models score on average 1–15% even at frontier scale.

FrontierMath is the field's best hedge against contamination, by keeping the problems private, it can't be memorised. The benchmark is consulted by labs releasing new flagship models as a 'are we genuinely beyond the prior frontier' check.

How Frontier Math is graded

Single numerical or short-symbolic answer per problem. Exact-match grading (with reasonable normalisation).

Cost: typically $5–50 per problem in inference and prover compute, depending on the model. Full-set runs are rarely public.

Common pitfalls when reporting Frontier Math

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

  • Tiny absolute scores. 1pp differences sound dramatic at 5–10% but are within sampling noise. Always show n and confidence intervals.
  • Compute-budget dependency. Like ARC-AGI, scores depend on how much inference budget the model is allowed. Cost-normalised scores are essential.
  • Private set means trust the evaluator. You can't replay these scores yourself. The benchmark's value depends entirely on Epoch's evaluation pipeline being trustworthy.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · Frontier Math

Full leaderboard →
Loading…

How to ship a Frontier Math score that nobody can challenge

Run Frontier Math on Benchlist

Benchlist runs the canonical Frontier Math sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "frontier-math",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run frontier-math --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is Frontier Math?
Introduced by Epoch AI in 2024, a private collection of research-level math problems curated by working mathematicians. The set is hidden; only Epoch evaluators see solutions. Models score on average 1–15% even at frontier scale.
How is Frontier Math scored?
Single numerical or short-symbolic answer per problem. Exact-match grading (with reasonable normalisation).
What's the biggest pitfall when reporting Frontier Math?
Tiny absolute scores. 1pp differences sound dramatic at 5–10% but are within sampling noise. Always show n and confidence intervals.
How do I verify a published Frontier Math score?
Use Benchlist. Run via benchlist run frontier-math or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.
What are the canonical decoding parameters for Frontier Math?
Per the catalog, Frontier Math runs at temperature 0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.