BFCL-v3#
Overview#
BFCL (Berkeley Function Calling Leaderboard) v3 is the first comprehensive and executable function call evaluation benchmark for assessing LLMs’ ability to invoke functions. It evaluates various forms of function calls, diverse scenarios, and executability.
Task Description#
Task Type: Function Calling / Tool Use Evaluation
Input: User query with available function definitions
Output: Correct function call with parameters
Categories: AST_NON_LIVE, AST_LIVE, RELEVANCE, MULTI_TURN
Key Features#
Comprehensive function call evaluation
Tests simple, multiple, and parallel function calls
Multi-language support (Python, Java, JavaScript)
Relevance detection (irrelevance handling)
Multi-turn conversation function calling
Live API endpoints for execution testing
Evaluation Notes#
Requires
pip install bfcl-eval==2025.10.27.1before evaluationSet
is_fc_model=Truefor models with native function calling supportunderscore_to_dotparameter controls function name formattingSee the usage documentation for detailed setup
Results broken down by category (AST, Relevance, Multi-turn)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
4,441 |
Prompt Length (Mean) |
291.25 chars |
Prompt Length (Min/Max) |
36 / 10805 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
400 |
119.66 |
57 |
303 |
|
200 |
120.75 |
65 |
229 |
|
200 |
298.92 |
69 |
781 |
|
200 |
432.17 |
88 |
1249 |
|
100 |
228.93 |
138 |
531 |
|
50 |
222.34 |
118 |
637 |
|
258 |
161.74 |
47 |
1247 |
|
1,053 |
138.76 |
42 |
4828 |
|
16 |
150.44 |
88 |
365 |
|
24 |
217.67 |
72 |
650 |
|
240 |
85.96 |
55 |
203 |
|
18 |
263.83 |
71 |
1214 |
|
882 |
188.26 |
36 |
10805 |
|
200 |
803.33 |
106 |
1756 |
|
200 |
806.75 |
110 |
1760 |
|
200 |
858.2 |
167 |
1791 |
|
200 |
803.29 |
106 |
1756 |
Sample Example#
Subset: simple
{
"input": [
{
"id": "15e84989",
"content": "[[{\"role\": \"user\", \"content\": \"Find the area of a triangle with a base of 10 units and height of 5 units.\"}]]"
}
],
"target": "[{\"calculate_triangle_area\": {\"base\": [10], \"height\": [5], \"unit\": [\"units\", \"\"]}}]",
"id": 0,
"group_id": 0,
"subset_key": "simple",
"metadata": {
"id": "simple_0",
"multi_turn": false,
"functions": [
{
"name": "calculate_triangle_area",
"description": "Calculate the area of a triangle given its base and height.",
"parameters": {
"type": "dict",
"properties": {
"base": {
"type": "integer",
"description": "The base of the triangle."
},
"height": {
"type": "integer",
"description": "The height of the triangle."
},
"unit": {
"type": "string",
"description": "The unit of measure (defaults to 'units' if not specified)"
}
},
"required": [
"base",
"height"
]
}
}
],
"tools": [
{
"type": "function",
"function": {
"name": "calculate_triangle_area",
"description": "Calculate the area of a triangle given its base and height. Note that the provided function is in Python 3 syntax.",
"parameters": {
"type": "object",
"properties": {
"base": {
"type": "integer",
"description": "The base of the triangle."
},
"height": {
"type": "integer",
"description": "The height of the triangle."
},
"unit": {
"type": "string",
"description": "The unit of measure (defaults to 'units' if not specified)"
}
},
"required": [
"base",
"height"
]
}
}
}
],
"missed_functions": "{}",
"initial_config": {},
"involved_classes": [],
"turns": [
[
{
"role": "user",
"content": "Find the area of a triangle with a base of 10 units and height of 5 units."
}
]
],
"language": "Python",
"test_category": "simple",
"subset": "simple",
"ground_truth": [
{
"calculate_triangle_area": {
"base": [
10
],
"height": [
5
],
"unit": [
"units",
""
]
}
}
],
"should_execute_tool_calls": false,
"missing_functions": {},
"is_fc_model": true
}
}
Prompt Template#
No prompt template defined.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Convert underscores to dots in function names for evaluation. |
|
|
|
Indicates the evaluated model natively supports function calling. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets bfcl_v3 \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['bfcl_v3'],
dataset_args={
'bfcl_v3': {
# subset_list: ['simple', 'multiple', 'parallel'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)