History

Released by Sierra in 2024. Two domains: retail (customer-service for an e-commerce platform) and airline (flight ticket modification). The model interacts with a stateful environment and a simulated user, calling tools to make changes.

Tau-Bench has become the canonical agent-evaluation benchmark, multiple labs now report it in model cards alongside SWE-Bench.

How τ-Bench is graded

Each task gives the model a customer query, a database state, and a fixed set of tools. The model must reason, call the right tools, and produce a final response. Grading checks both the final state of the database and the response text against canonical answers.

Pass@1 is bounded by simulator stochasticity. The simulated user can interpret the same model output differently across runs. Reports should include n≥3 trials.

Common pitfalls when reporting τ-Bench

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

  • Harness sensitivity is high. Different agent loops (chain-of-thought, ReAct, plan-and-execute) produce very different scores. Always disclose.
  • Domain skew. The retail and airline domains have different difficulty profiles. Aggregate scores hide this.
  • Cost is non-trivial. A full Tau-Bench eval at frontier-model token rates is ~$50–200 per model. Sub-sample if you need a fast smoke.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · τ-Bench

Full leaderboard →
Loading…

How to ship a τ-Bench score that nobody can challenge

Run τ-Bench on Benchlist

Benchlist runs the canonical τ-Bench sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "tau-bench",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run tau-bench --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is τ-Bench?
Released by Sierra in 2024. Two domains: retail (customer-service for an e-commerce platform) and airline (flight ticket modification). The model interacts with a stateful environment and a simulated user, calling tools to make changes.
How is τ-Bench scored?
Each task gives the model a customer query, a database state, and a fixed set of tools. The model must reason, call the right tools, and produce a final response. Grading checks both the final state of the database and the response text against canonical answers.
What's the biggest pitfall when reporting τ-Bench?
Harness sensitivity is high. Different agent loops (chain-of-thought, ReAct, plan-and-execute) produce very different scores. Always disclose.
How do I verify a published τ-Bench score?
Use Benchlist. Run via benchlist run tau-bench or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.
What are the canonical decoding parameters for τ-Bench?
Per the catalog, τ-Bench runs at temperature 0.0 with max_tokens 4096. Deviating without disclosure makes scores incomparable.