τ-bench#
Introduction#
τ-bench (TAU Bench) is a specialized benchmark framework designed to evaluate the tool utilization capabilities of large language models in dynamic dialogue agent scenarios. This benchmark simulates dynamic conversations between a user (simulated by a language model) and a language agent, where the agent is equipped with domain-specific API tools and strategic guidance.
Project URL: https://github.com/sierra-research/tau-bench
Key features of this benchmark include:
Dynamic Interaction: Simulates multi-turn dialogues between real users and AI agents.
Tool Integration: Agents must effectively utilize provided API tools.
Strategy Compliance: Agents must adhere to specific business strategies and guidelines.
Domain Expertise: Covers real-world business scenarios such as airline customer service and retail customer service.
Supported evaluation domains:
airline: Airline customer service scenariosretail: Retail customer service scenarios
Dependency Installation#
Before running the evaluation, install the following dependencies:
pip install evalscope # Install evalscope
pip install git+https://github.com/sierra-research/tau-bench # Install tau-bench
Usage#
Run the following code to start the evaluation. The example below uses the qwen-plus model for evaluation.
⚠️ Note: Only API model service evaluation is supported. For local model evaluation, it is recommended to use frameworks like vLLM to pre-launch the service. The official default user model is gpt-4o, and to achieve leaderboard results, please use this model for evaluation.
import os
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='openai_api', # Use API model service
datasets=['tau_bench'],
dataset_args={
'tau_bench': {
'subset_list': ['airline', 'retail'], # Select evaluation domains
'extra_params': {
'user_model': 'qwen-plus', # User simulation model
'api_key': os.getenv('DASHSCOPE_API_KEY'),
'api_base': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'generation_config': {
'temperature': 0.7,
}
}
}
},
eval_batch_size=5,
limit=5, # Limit the number of evaluations for quick testing; remove for full evaluation
generation_config={
'temperature': 0.6,
},
)
run_task(task_cfg=task_cfg)
Evaluation Methodology#
Dialogue Process#
The evaluation process of τ-bench includes the following key steps:
Task Initialization: The system provides the agent with domain-specific API toolsets and strategic guidance.
User Simulation: The user simulator generates natural user requests based on predefined scenarios.
Agent Response: The tested agent model generates responses based on tools and strategies.
Multi-turn Interaction: The user and agent engage in multi-turn dialogues until the task is completed or fails.
Result Evaluation: Scoring is based on task completion and strategy adherence.
Evaluation Dimensions#
Whether the agent successfully completes the core task of the user request.
Whether the necessary API tools are used correctly.
Domain Characteristics#
Airline Scenario#
API Tools: Flight inquiry, booking modification, seat selection, ticket changes, etc.
Strategy Requirements: Fee transparency, customer prioritization, adherence to safety regulations.
Typical Tasks: Flight rescheduling, seat upgrades, baggage issue handling.
Retail Scenario#
API Tools: Product inquiry, order management, inventory check, payment processing, etc.
Strategy Requirements: Price accuracy, real-time inventory, customer satisfaction.
Typical Tasks: Product recommendations, order tracking, return and exchange processing.
Result Analysis#
Output Example#
+-----------+-----------+--------+---------+-------+---------+
| Model | Dataset | Metric | Subset | Num | Score |
+===========+===========+========+=========+=======+=========+
| qwen-plus | tau_bench | Pass^1 | airline | 10 | 0.5 |
+-----------+-----------+--------+---------+-------+---------+
| qwen-plus | tau_bench | Pass^1 | retail | 10 | 0.6 |
+-----------+-----------+--------+---------+-------+---------+
| qwen-plus | tau_bench | Pass^1 | OVERALL | 20 | 0.55 |
+-----------+-----------+--------+---------+-------+---------+
Performance Analysis Metrics#
Pass^1: The proportion of tasks successfully completed on the first attempt
Measures the model’s ability to resolve user issues in a single dialogue.
Considers task completion, strategy adherence, and dialogue quality comprehensively.