τ-bench#

Overview#

τ-bench (Tau Bench) is a benchmark for evaluating conversational AI agents that interact with users through domain-specific API tools and policy guidelines. It simulates dynamic, multi-turn conversations where a language model acts as both the user and the agent.

Task Description#

  • Task Type: Conversational Agent Evaluation

  • Input: User scenarios with specific goals and constraints

  • Output: Agent actions via API tool calls to complete tasks

  • Domains: Airline customer service, Retail customer service

Key Features#

  • Dynamic conversation simulation with LLM-simulated users

  • Domain-specific API tools and policy guidelines

  • Realistic customer service scenarios

  • Tests multi-turn dialogue capabilities

  • Evaluates tool use and policy compliance

Evaluation Notes#

  • Installation Required: pip install git+https://github.com/sierra-research/tau-bench

  • User Model Configuration: Requires setting up a user simulation model

  • Primary metric: Accuracy based on task completion reward

  • Supports airline and retail domains

  • Uses pass@k aggregation for robustness evaluation

  • Usage Example

Properties#

Property

Value

Benchmark Name

tau_bench

Dataset ID

tau-bench

Paper

N/A

Tags

Agent, FunctionCalling, Reasoning

Metrics

N/A

Default Shots

0-shot

Evaluation Split

test

Aggregation

mean_and_pass_hat_k

Data Statistics#

Statistics not available.

Sample Example#

Sample example not available.

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter

Type

Default

Description

user_model

str

qwen-plus

Model used to simulate the user in the environment.

api_key

str

EMPTY

API key for the user model backend.

api_base

str

https://dashscope.aliyuncs.com/compatible-mode/v1

Base URL for the user model API requests.

generation_config

dict

{'temperature': 0.0}

Default generation config for user model simulation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets tau_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['tau_bench'],
    dataset_args={
        'tau_bench': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)