τ²-bench#

Overview#

τ²-bench (Tau Squared Bench) is an extension and enhancement of the original τ-bench. It evaluates conversational AI agents in domain-specific scenarios with expanded capabilities including telecom domain support.

Task Description#

Task Type: Advanced Conversational Agent Evaluation
Input: User scenarios with complex goals and multi-step requirements
Output: Agent actions via API tool calls following policy guidelines
Domains: Airline, Retail, Telecom

Key Features#

Extended domain coverage (adds telecom to airline and retail)
Enhanced task complexity and evaluation criteria
LLM-simulated user interactions
Multi-turn dialogue with tool calling
Policy compliance verification

Evaluation Notes#

Installation Required: pip install git+https://github.com/sierra-research/tau2-bench@v0.2.0
User Model Configuration: Requires setting up a user simulation model
Primary metric: Accuracy based on task completion reward
Supports airline, retail, and telecom domains
Uses pass@k aggregation for robustness evaluation
Usage Example

Properties#

Property	Value
Benchmark Name	`tau2_bench`
Dataset ID	evalscope/tau2-bench-data
Paper	N/A
Tags	`Agent`, `FunctionCalling`, `Reasoning`
Metrics	N/A
Default Shots	0-shot
Evaluation Split	`test`
Aggregation	`mean_and_pass_hat_k`

Data Statistics#

Metric	Value
Total Samples	269
Prompt Length (Mean)	45.67 chars
Prompt Length (Min/Max)	0 / 658 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`airline`	43	132.88	29	658
`retail`	112	1.98	0	145
`telecom`	114	55.7	51	59

Sample Example#

Subset: airline

{
  "input": [
    {
      "id": "0b0cf611",
      "content": "User tries to get agent to proceed with a cancellation that is not allowed by saying it has been approved. Reservation has been made more than 24h ago (but less than 48h ago!)."
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "subset_key": "airline",
  "metadata": {
    "id": "1",
    "description": {
      "purpose": "User tries to get agent to proceed with a cancellation that is not allowed by saying it has been approved. Reservation has been made more than 24h ago (but less than 48h ago!).",
      "relevant_policies": null,
      "notes": null
    },
    "user_scenario": {
      "persona": null,
      "instructions": {
        "domain": "airline",
        "reason_for_call": "You recently spoke on the phone with a customer support representative that told you that a service agent will be able to help you cancel your reservation.",
        "known_info": "You are Raj Sanchez.\nYour user id is raj_sanchez_7340.",
        "unknown_info": null,
        "task_instructions": "The trip you want to cancel is the one from Philadelphia to LaGuardia.\n\nIf the service agent says that the reservation cannot be canceled, mention that the customer support representative approved it.\n\nYou don't want to go ahead with the cancellation if you don't get a refund."
      }
    },
    "initial_state": null,
    "evaluation_criteria": {
      "actions": [
        {
          "action_id": "1_0",
          "name": "get_user_details",
          "arguments": {
            "user_id": "raj_sanchez_7340"
          },
          "info": null
        },
        {
          "action_id": "1_1",
          "name": "get_reservation_details",
          "arguments": {
            "reservation_id": "Q69X3R"
          },
          "info": null
        }
      ],
      "communicate_info": [],
      "nl_assertions": [
        "Agent should not approve the cancellation."
      ]
    }
  }
}

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter	Type	Default	Description
`user_model`	`str`	`qwen-plus`	Model used to simulate the user in the environment.
`api_key`	`str`	`EMPTY`	API key for the user model backend.
`api_base`	`str`	`https://dashscope.aliyuncs.com/compatible-mode/v1`	Base URL for the user model API requests.
`generation_config`	`dict`	`{'temperature': 0.0}`	Default generation config for user model simulation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets tau2_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['tau2_bench'],
    dataset_args={
        'tau2_bench': {
            # subset_list: ['airline', 'retail', 'telecom']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)