All terms
Evaluation
Tau-bench
A benchmark for tool-using agents in realistic customer-service-style scenarios.
Definition
Tau-bench is a benchmark for evaluating tool-using agents in realistic, interactive scenarios. It checks whether an agent can follow domain rules, use the right tools, hold a conversation with a simulated user, and end in the correct final state — closer to real work than a single-shot question.