τ-Bench, methodology, history, and how to verify a published score

Q: What is τ-Bench?

Released by Sierra in 2024. Two domains: retail (customer-service for an e-commerce platform) and airline (flight ticket modification). The model interacts with a stateful environment and a simulated user, calling tools to make changes.

Q: What's the biggest pitfall when reporting τ-Bench?

Harness sensitivity is high. Different agent loops (chain-of-thought, ReAct, plan-and-execute) produce very different scores. Always disclose.

Q: How do I verify a published τ-Bench score?

Use Benchlist. Run via benchlist run tau-bench or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Sierra in 2024. Two domains: retail (customer-service for an e-commerce platform) and airline (flight ticket modification). The model interacts with a stateful environment and a simulated user, calling tools to make changes.

Tau-Bench has become the canonical agent-evaluation benchmark, multiple labs now report it in model cards alongside SWE-Bench.

How τ-Bench is graded

Each task gives the model a customer query, a database state, and a fixed set of tools. The model must reason, call the right tools, and produce a final response. Grading checks both the final state of the database and the response text against canonical answers.

Pass@1 is bounded by simulator stochasticity. The simulated user can interpret the same model output differently across runs. Reports should include n≥3 trials.

Common pitfalls when reporting τ-Bench

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Harness sensitivity is high. Different agent loops (chain-of-thought, ReAct, plan-and-execute) produce very different scores. Always disclose.
Domain skew. The retail and airline domains have different difficulty profiles. Aggregate scores hide this.
Cost is non-trivial. A full Tau-Bench eval at frontier-model token rates is ~$50–200 per model. Sub-sample if you need a fast smoke.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · τ-Bench

Full leaderboard →

Loading…

How to ship a τ-Bench score that nobody can challenge

Run τ-Bench on Benchlist

Benchlist runs the canonical τ-Bench sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "tau-bench",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run tau-bench --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is τ-Bench?

How is τ-Bench scored?

What's the biggest pitfall when reporting τ-Bench?

Harness sensitivity is high. Different agent loops (chain-of-thought, ReAct, plan-and-execute) produce very different scores. Always disclose.

How do I verify a published τ-Bench score?

Use Benchlist. Run via benchlist run tau-bench or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for τ-Bench?

Per the catalog, τ-Bench runs at temperature 0.0 with max_tokens 4096. Deviating without disclosure makes scores incomparable.