BFCL-v3#

Overview#

BFCL (Berkeley Function Calling Leaderboard) v3 is the first comprehensive and executable function call evaluation benchmark for assessing LLMs’ ability to invoke functions. It evaluates various forms of function calls, diverse scenarios, and executability.

Task Description#

Task Type: Function Calling / Tool Use Evaluation
Input: User query with available function definitions
Output: Correct function call with parameters
Categories: AST_NON_LIVE, AST_LIVE, RELEVANCE, MULTI_TURN

Key Features#

Comprehensive function call evaluation
Tests simple, multiple, and parallel function calls
Multi-language support (Python, Java, JavaScript)
Relevance detection (irrelevance handling)
Multi-turn conversation function calling
Live API endpoints for execution testing

Evaluation Notes#

Requires pip install bfcl-eval==2025.10.27.1 before evaluation
Set is_fc_model=True for models with native function calling support
underscore_to_dot parameter controls function name formatting
See the usage documentation for detailed setup
Results broken down by category (AST, Relevance, Multi-turn)

Properties#

Property	Value
Benchmark Name	`bfcl_v3`
Dataset ID	AI-ModelScope/bfcl_v3
Paper	N/A
Tags	`Agent`, `FunctionCalling`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	4,441
Prompt Length (Mean)	291.25 chars
Prompt Length (Min/Max)	36 / 10805 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`simple`	400	119.66	57	303
`multiple`	200	120.75	65	229
`parallel`	200	298.92	69	781
`parallel_multiple`	200	432.17	88	1249
`java`	100	228.93	138	531
`javascript`	50	222.34	118	637
`live_simple`	258	161.74	47	1247
`live_multiple`	1,053	138.76	42	4828
`live_parallel`	16	150.44	88	365
`live_parallel_multiple`	24	217.67	72	650
`irrelevance`	240	85.96	55	203
`live_relevance`	18	263.83	71	1214
`live_irrelevance`	882	188.26	36	10805
`multi_turn_base`	200	803.33	106	1756
`multi_turn_miss_func`	200	806.75	110	1760
`multi_turn_miss_param`	200	858.2	167	1791
`multi_turn_long_context`	200	803.29	106	1756

Sample Example#

Subset: simple

{
  "input": [
    {
      "id": "15e84989",
      "content": "[[{\"role\": \"user\", \"content\": \"Find the area of a triangle with a base of 10 units and height of 5 units.\"}]]"
    }
  ],
  "target": "[{\"calculate_triangle_area\": {\"base\": [10], \"height\": [5], \"unit\": [\"units\", \"\"]}}]",
  "id": 0,
  "group_id": 0,
  "subset_key": "simple",
  "metadata": {
    "id": "simple_0",
    "multi_turn": false,
    "functions": [
      {
        "name": "calculate_triangle_area",
        "description": "Calculate the area of a triangle given its base and height.",
        "parameters": {
          "type": "dict",
          "properties": {
            "base": {
              "type": "integer",
              "description": "The base of the triangle."
            },
            "height": {
              "type": "integer",
              "description": "The height of the triangle."
            },
            "unit": {
              "type": "string",
              "description": "The unit of measure (defaults to 'units' if not specified)"
            }
          },
          "required": [
            "base",
            "height"
          ]
        }
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "calculate_triangle_area",
          "description": "Calculate the area of a triangle given its base and height. Note that the provided function is in Python 3 syntax.",
          "parameters": {
            "type": "object",
            "properties": {
              "base": {
                "type": "integer",
                "description": "The base of the triangle."
              },
              "height": {
                "type": "integer",
                "description": "The height of the triangle."
              },
              "unit": {
                "type": "string",
                "description": "The unit of measure (defaults to 'units' if not specified)"
              }
            },
            "required": [
              "base",
              "height"
            ]
          }
        }
      }
    ],
    "missed_functions": "{}",
    "initial_config": {},
    "involved_classes": [],
    "turns": [
      [
        {
          "role": "user",
          "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
        }
      ]
    ],
    "language": "Python",
    "test_category": "simple",
    "subset": "simple",
    "ground_truth": [
      {
        "calculate_triangle_area": {
          "base": [
            10
          ],
          "height": [
            5
          ],
          "unit": [
            "units",
            ""
          ]
        }
      }
    ],
    "should_execute_tool_calls": false,
    "missing_functions": {},
    "is_fc_model": true
  }
}

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter	Type	Default	Description
`underscore_to_dot`	`bool`	`True`	Convert underscores to dots in function names for evaluation.
`is_fc_model`	`bool`	`True`	Indicates the evaluated model natively supports function calling.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets bfcl_v3 \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['bfcl_v3'],
    dataset_args={
        'bfcl_v3': {
            # subset_list: ['simple', 'multiple', 'parallel']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)