ToolBench-Static#
Overview#
ToolBench-Static is a benchmark for evaluating AI models’ ability to use tools and execute function calls in a static evaluation setting. It tests models on realistic tool-use scenarios with ground-truth comparisons.
Task Description#
Task Type: Tool Use / Function Calling Evaluation
Input: Conversation with available tools and context
Output: Correct tool/function call sequence
Subsets: In-domain (seen tools) and Out-of-domain (unseen tools)
Key Features#
Static evaluation (no actual tool execution)
Tests both tool selection and parameter generation
In-domain and out-of-domain evaluation splits
Multiple evaluation metrics for comprehensive assessment
Tests planning and action execution capabilities
Evaluation Notes#
Default configuration uses 0-shot evaluation
Five metrics: Act.EM, Plan.EM, F1, HalluRate, Rouge-L
F1 score is the primary metric
HalluRate measures tool hallucination frequency
See the usage documentation for detailed setup
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
2,369 |
Prompt Length (Mean) |
7628.39 chars |
Prompt Length (Min/Max) |
2720 / 23468 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
1,588 |
7943.34 |
2720 |
23468 |
|
781 |
6988.01 |
2788 |
20967 |
Sample Example#
Subset: in_domain
{
"input": [
{
"id": "8b5b61b4",
"content": "You are AutoGPT, you can use many tools(functions) to do the following task.\nFirst I will give you the task description, and your task start.\nAt each step, you need to give your thought to analyze the status now and what to do next, with a fu ... [TRUNCATED] ... \"enum\": [\"give_answer\", \"give_up_and_restart\"]}, \"final_answer\": {\"type\": \"string\", \"description\": \"The final answer you want to give the user. You should have this field if \\\"return_type\\\"==\\\"give_answer\\\"\"}}, \"required\": [\"return_type\"]}}]",
"role": "system"
},
{
"id": "97df9194",
"content": "\nI'm planning a surprise birthday party for my sister and I need to find a suitable venue. Can you suggest event spaces in the city center with a capacity for 50 people? Also, recommend nearby catering services and provide me with a list of available dates for booking.\nBegin!\n",
"role": "user"
}
],
"target": "",
"id": 0,
"group_id": 0,
"metadata": {
"target": "Thought: \nAction: spott\nAction Input: {\n \"is_id\": \"city center\"\n}",
"tools": [
{
"name": "get_place_by_my_ip_for_spott",
"description": "This is the subfunction for tool \"spott\", you can use this tool.The description of this function is: \"Returns the place related to the IP where the request was performed. Returns \"Not Found\" error when no place is related to the IP.\"",
"parameters": "{'type': 'object', 'properties': {}, 'required': [], 'optional': []}"
},
{
"name": "autocomplete_places_for_spott",
"description": "This is the subfunction for tool \"spott\", you can use this tool.The description of this function is: \"Returns a list of places matching a prefix and specified filter properties. Useful to create \"search as you type\" inputs.\"",
"parameters": "{'type': 'object', 'properties': {}, 'required': [], 'optional': []}"
},
{
"name": "get_place_by_ip_for_spott",
"description": "This is the subfunction for tool \"spott\", you can use this tool.The description of this function is: \"Returns the Place where a given IP Address is located. Returns \"Not Found\" error when no place is related to the IP. When sending '127.0.0.1' or '0.0.0.0' IP Addresses it will return the Place from the request was performed.\"",
"parameters": "{'type': 'object', 'properties': {'is_id': {'type': 'string', 'description': 'IP Address (v4 and v6 are supported).', 'example_value': '200.194.51.97'}, 'language': {'type': 'string', 'description': 'Specifies a language (ISO 639-1) to get the localized name of the place. If translation is not available, \"localizedName\" property will be null.'}}, 'required': ['is_id'], 'optional': ['language']}"
},
{
"name": "webcams_list_orderby_order_sort_for_webcams_travel",
"description": "This is the subfunction for tool \"webcams_travel\", you can use this tool.The description of this function is: \"This is a modifier. Returns the list of webcams ordered by {order}. The optional sorting direction is given by {sort}. Required {order}.\"",
"parameters": "{'type': 'object', 'properties': {'sort': {'type': 'string', 'description': 'Possible values are: \"asc\" (ascending), or \"desc\" (descending).'}, 'order': {'type': 'string', 'description': 'Possible values are: \"popularity\" (default order: \"des ... [TRUNCATED] ... the response. Possible values are: \"webcams\", \"categories\", \"continents\", \"countries\", \"regions\", \"properties\". Default is \"webcams\".', 'example_value': 'webcams:image,location'}}, 'required': ['sort', 'order'], 'optional': ['lang', 'show']}"
},
{
"name": "webcams_list_region_region_region_for_webcams_travel",
"description": "This is the subfunction for tool \"webcams_travel\", you can use this tool.The description of this function is: \"This is a modifier. Returns a list of webcams according to the listed region. Multiple regions must be separated by comma. Required: at least one {region}. Possible values are ISO 3166-1-alpha-2 codes for the country and a region code separated by a dot.\"",
"parameters": "{'type': 'object', 'properties': {'region': {'type': 'string', 'description': 'Comma separated list of ISO 3166-1-alpha-2 codes for the country and a region code separated by a dot.'}, 'lang': {'type': 'string', 'description': 'Localize the r ... [TRUNCATED] ... sted in the response. Possible values are: \"webcams\", \"categories\", \"continents\", \"countries\", \"regions\", \"properties\". Default is \"webcams\".', 'example_value': 'webcams:image,location'}}, 'required': ['region'], 'optional': ['lang', 'show']}"
},
{
"name": "webcams_list_bbox_ne_lat_ne_lng_sw_lat_sw_lng_for_webcams_travel",
"description": "This is the subfunction for tool \"webcams_travel\", you can use this tool.The description of this function is: \"This is a modifier. Returns a list of the webcams in the bounding box given by north-east ({ne_lat},{ne_lng}) and south-west ({sw_lat},{sw_lng}) coordinates. Required: {ne_lat},{ne_lng},{sw_lat},{sw_lng}.\"",
"parameters": "{'type': 'object', 'properties': {'ne_lat': {'type': 'integer', 'description': 'North-east WGS84 latitude of the bounding box.'}, 'sw_lng': {'type': 'integer', 'description': 'South-west WGS84 longitude of the bounding box.'}, 'sw_lat': {'typ ... [TRUNCATED] ... values are: \"webcams\", \"categories\", \"continents\", \"countries\", \"regions\", \"properties\". Default is \"webcams\".', 'example_value': 'webcams:image,location'}}, 'required': ['ne_lat', 'sw_lng', 'sw_lat', 'ne_lng'], 'optional': ['lang', 'show']}"
},
{
"name": "Finish",
"description": "If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are unable to proceed with the task in the current state, call this function to restart. Remember: you must ALWAYS call this function at the end of your attempt, and the only part that will be shown to the user is the final answer, so it should contain sufficient information.",
"parameters": "{'type': 'object', 'properties': {'return_type': {'type': 'string', 'enum': ['give_answer', 'give_up_and_restart']}, 'final_answer': {'type': 'string', 'description': 'The final answer you want to give the user. You should have this field if \"return_type\"==\"give_answer\"'}}, 'required': ['return_type']}"
}
],
"messages": [
{
"role": "system",
"content": "You are AutoGPT, you can use many tools(functions) to do the following task.\nFirst I will give you the task description, and your task start.\nAt each step, you need to give your thought to analyze the status now and what to do next, with a fu ... [TRUNCATED] ... \"enum\": [\"give_answer\", \"give_up_and_restart\"]}, \"final_answer\": {\"type\": \"string\", \"description\": \"The final answer you want to give the user. You should have this field if \\\"return_type\\\"==\\\"give_answer\\\"\"}}, \"required\": [\"return_type\"]}}]",
"name": null
},
{
"role": "user",
"content": "\nI'm planning a surprise birthday party for my sister and I need to find a suitable venue. Can you suggest event spaces in the city center with a capacity for 50 people? Also, recommend nearby catering services and provide me with a list of available dates for booking.\nBegin!\n",
"name": null
}
]
}
}
Note: Some content was truncated for display.
Prompt Template#
No prompt template defined.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets tool_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['tool_bench'],
dataset_args={
'tool_bench': {
# subset_list: ['in_domain', 'out_of_domain'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)