AGENT Benchmarks#

Below is the list of supported AGENT benchmarks. Click on a benchmark name to jump to details.

Benchmark Name

Pretty Name

Task Categories

bfcl_v3

BFCL-v3

Agent, FunctionCalling

bfcl_v4

BFCL-v4

Agent, FunctionCalling

general_fc

General-FunctionCalling

Agent, Custom, FunctionCalling

tau2_bench

τ²-bench

Agent, FunctionCalling, Reasoning

tau_bench

τ-bench

Agent, FunctionCalling, Reasoning

tool_bench

ToolBench-Static

FunctionCalling, Reasoning


Benchmark Details#

BFCL-v3#

Back to Top

  • Dataset Name: bfcl_v3

  • Dataset ID: AI-ModelScope/bfcl_v3

  • Description:

    Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models’ (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability. Need to run pip install bfcl-eval==2025.10.27.1 before evaluating. Usage Example

  • Task Categories: Agent, FunctionCalling

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: train

  • Subsets: irrelevance, java, javascript, live_irrelevance, live_multiple, live_parallel_multiple, live_parallel, live_relevance, live_simple, multi_turn_base, multi_turn_long_context, multi_turn_miss_func, multi_turn_miss_param, multiple, parallel_multiple, parallel, simple

  • Extra Parameters:

{
    "underscore_to_dot": {
        "type": "bool",
        "description": "Convert underscores to dots in function names for evaluation.",
        "value": true
    },
    "is_fc_model": {
        "type": "bool",
        "description": "Indicates the evaluated model natively supports function calling.",
        "value": true
    }
}

BFCL-v4#

Back to Top

  • Dataset Name: bfcl_v4

  • Dataset ID: berkeley-function-call-leaderboard

  • Description:

    With function-calling being the building blocks of Agents, the Berkeley Function-Calling Leaderboard (BFCL) V4 presents a holistic agentic evaluation for LLMs. BFCL V4 Agentic includes web search, memory, and format sensitivity. Together, the ability to web search, read and write from memory, and the ability to invoke functions in different languages present the building blocks for the exciting and extremely challenging avenues that power agentic LLMs today from deep-research, to agents for coding and law. Need to run pip install bfcl-eval==2025.10.27.1 before evaluating. Usage Example

  • Task Categories: Agent, FunctionCalling

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: train

  • Subsets: irrelevance, live_irrelevance, live_multiple, live_parallel_multiple, live_parallel, live_relevance, live_simple, memory_kv, memory_rec_sum, memory_vector, multi_turn_base, multi_turn_long_context, multi_turn_miss_func, multi_turn_miss_param, multiple, parallel_multiple, parallel, simple_java, simple_javascript, simple_python, web_search_base, web_search_no_snippet

  • Extra Parameters:

{
    "underscore_to_dot": {
        "type": "bool",
        "description": "Convert underscores to dots in function names for evaluation.",
        "value": true
    },
    "is_fc_model": {
        "type": "bool",
        "description": "Indicates the evaluated model natively supports function calling.",
        "value": true
    },
    "SERPAPI_API_KEY": {
        "type": "str | null",
        "description": "SerpAPI key enabling web-search capability in BFCL V4. Null disables web search.",
        "value": null
    }
}

General-FunctionCalling#

Back to Top

  • Dataset Name: general_fc

  • Dataset ID: evalscope/GeneralFunctionCall-Test

  • Description:

    A general function calling dataset for custom evaluation. For detailed instructions on how to use this benchmark, please refer to the User Guide.

  • Task Categories: Agent, Custom, FunctionCalling

  • Evaluation Metrics: count_finish_reason_tool_call, count_successful_tool_call, schema_accuracy, tool_call_f1

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: test

  • Subsets: default


τ²-bench#

Back to Top

  • Dataset Name: tau2_bench

  • Dataset ID: evalscope/tau2-bench-data

  • Description:

    τ²-bench (Tau Squared Bench) is an extension and enhancement of the original τ-bench (Tau Bench), which is a benchmark designed to evaluate conversational AI agents that interact with users through domain-specific API tools and guidelines. Please install it with pip install git+https://github.com/sierra-research/tau2-bench@v0.2.0 before evaluating and set a user model. Usage Example

  • Task Categories: Agent, FunctionCalling, Reasoning

  • Evaluation Metrics:

  • Aggregation Methods: mean_and_pass_hat_k

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: test

  • Subsets: airline, retail, telecom

  • Extra Parameters:

{
    "user_model": {
        "type": "str",
        "description": "Model used to simulate the user in the environment.",
        "value": "qwen-plus"
    },
    "api_key": {
        "type": "str",
        "description": "API key for the user model backend.",
        "value": "EMPTY"
    },
    "api_base": {
        "type": "str",
        "description": "Base URL for the user model API requests.",
        "value": "https://dashscope.aliyuncs.com/compatible-mode/v1"
    },
    "generation_config": {
        "type": "dict",
        "description": "Default generation config for user model simulation.",
        "value": {
            "temperature": 0.0
        }
    }
}

τ-bench#

Back to Top

  • Dataset Name: tau_bench

  • Dataset ID: tau-bench

  • Description:

    A benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. Please install it with pip install git+https://github.com/sierra-research/tau-bench before evaluating and set a user model. Usage Example

  • Task Categories: Agent, FunctionCalling, Reasoning

  • Evaluation Metrics:

  • Aggregation Methods: mean_and_pass_hat_k

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: test

  • Subsets: airline, retail

  • Extra Parameters:

{
    "user_model": {
        "type": "str",
        "description": "Model used to simulate the user in the environment.",
        "value": "qwen-plus"
    },
    "api_key": {
        "type": "str",
        "description": "API key for the user model backend.",
        "value": "EMPTY"
    },
    "api_base": {
        "type": "str",
        "description": "Base URL for the user model API requests.",
        "value": "https://dashscope.aliyuncs.com/compatible-mode/v1"
    },
    "generation_config": {
        "type": "dict",
        "description": "Default generation config for user model simulation.",
        "value": {
            "temperature": 0.0
        }
    }
}

ToolBench-Static#

Back to Top

  • Dataset Name: tool_bench

  • Dataset ID: AI-ModelScope/ToolBench-Static

  • Description:

    ToolBench is a benchmark for evaluating AI models on tool use tasks. It includes various subsets such as in-domain and out-of-domain, each with its own set of problems that require step-by-step reasoning to arrive at the correct answer. Usage Example

  • Task Categories: FunctionCalling, Reasoning

  • Evaluation Metrics: Act.EM, F1, HalluRate, Plan.EM, Rouge-L

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Evaluation Split: test

  • Subsets: in_domain, out_of_domain