ToolBench#
Description#
We evaluate the effectiveness of tool learning benchmark: ToolBench (Qin et al.,2023b). The task involve integrating API calls to accomplish tasks, where the agent must accurately select the appropriate API and compose necessary API requests.
Moreover, we partition the test set of ToolBench into in-domain and out-of-domain based on whether the tools used in the test instances have been seen during training.
This division allows us to evaluate performance in both in-distribution and out-of-distribution scenarios. We call this dataset to be ToolBench-Static
.
For more details, please refer to: Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
Dataset#
Dataset statistics:
Number of in_domain: 1588
Number of out_domain: 781
Usage#
Installation#
pip install evalscope -U
pip install ms-swift -U
pip install rouge -U
Download the dataset#
wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/toolbench-static/data.zip
Unzip the dataset#
unzip data.zip
# The dataset will be unzipped to the `/path/to/data/toolbench_static` folder
Task configuration#
There are two ways to configure the task: dict and yaml.
Configuration with dict:
your_task_config = {
'infer_args': {
'model_name_or_path': '/path/to/model_dir',
'model_type': 'qwen2-7b-instruct',
'data_path': 'data/toolbench_static',
'output_dir': 'output_res',
'deploy_type': 'swift',
'max_new_tokens': 2048,
'num_infer_samples': None
},
'eval_args': {
'input_path': 'output_res',
'output_path': 'output_res'
}
}
Arguments:
model_name_or_path
: The path to the model local directory.model_type
: The model type, refer to model type listdata_path
: The path to the dataset directory containsin_domain.json
andout_of_domain.json
files.output_dir
: The path to the output directory. Default tooutput_res
.deploy_type
: The deploy type, default toswift
.max_new_tokens
: The maximum number of tokens to generate.num_infer_samples
: The number of samples to infer. Default toNone
, which means infer all samples.input_path
: The path to the input directory for evaluation, should be the same asoutput_dir
ofinfer_args
.output_path
: The path to the output directory for evaluation.
Configuration with yaml:
infer_args:
model_name_or_path: /path/to/model_dir # absolute path is recommended
model_type: qwen2-7b-instruct
data_path: /path/to/data/toolbench_static # absolute path is recommended
deploy_type: swift
max_new_tokens: 2048
num_infer_samples: null
output_dir: output_res
eval_args:
input_path: output_res
output_path: output_res
refer to config_default.yaml for more details.
Run the task#
from evalscope.third_party.toolbench_static import run_task
# Run the task with dict configuration
run_task(task_cfg=your_task_config)
# Run the task with yaml configuration
run_task(task_cfg='/path/to/your_task_config.yaml')
Results and metrics#
Metrics:
Plan.EM
: The agent’s planning decisions at each step for using tools invocation, generating answer, or giving up. Exact match score.Act.EM
: Action exact match score, including the tool name and arguments.HalluRate
(lower is better): The hallucination rate of the agent’s answers at each step.Avg.F1
: The average F1 score of the agent’s tools calling at each step.R-L
: The Rouge-L score of the agent’s answers at each step.
Generally, we focus on Act.EM
, HalluRate
and Avg.F1
metrics.