τ-bench#
Overview#
τ-bench (Tau Bench) is a benchmark for evaluating conversational AI agents that interact with users through domain-specific API tools and policy guidelines. It simulates dynamic, multi-turn conversations where a language model acts as both the user and the agent.
Task Description#
Task Type: Conversational Agent Evaluation
Input: User scenarios with specific goals and constraints
Output: Agent actions via API tool calls to complete tasks
Domains: Airline customer service, Retail customer service
Key Features#
Dynamic conversation simulation with LLM-simulated users
Domain-specific API tools and policy guidelines
Realistic customer service scenarios
Tests multi-turn dialogue capabilities
Evaluates tool use and policy compliance
Evaluation Notes#
Installation Required:
pip install git+https://github.com/sierra-research/tau-benchUser Model Configuration: Requires setting up a user simulation model
Primary metric: Accuracy based on task completion reward
Supports airline and retail domains
Uses pass@k aggregation for robustness evaluation
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
N/A |
Default Shots |
0-shot |
Evaluation Split |
|
Aggregation |
|
Data Statistics#
Statistics not available.
Sample Example#
Sample example not available.
Prompt Template#
No prompt template defined.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model used to simulate the user in the environment. |
|
|
|
API key for the user model backend. |
|
|
|
Base URL for the user model API requests. |
|
|
|
Default generation config for user model simulation. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets tau_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['tau_bench'],
dataset_args={
'tau_bench': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)