ToolBench#

Description#

We evaluate the effectiveness of tool learning benchmark: ToolBench (Qin et al.,2023b). The task involve integrating API calls to accomplish tasks, where the agent must accurately select the appropriate API and compose necessary API requests.

Moreover, we partition the test set of ToolBench into in-domain and out-of-domain based on whether the tools used in the test instances have been seen during training.

This division allows us to evaluate performance in both in-distribution and out-of-distribution scenarios. We call this dataset to be ToolBench-Static.

For more details, please refer to: Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Dataset#

  • Dataset statistics:

    • Number of in_domain: 1588

    • Number of out_domain: 781

Usage#

Installation#

pip install evalscope -U
pip install ms-swift -U
pip install rouge -U

Download the dataset#

wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/toolbench-static/data.zip

Unzip the dataset#

unzip data.zip
# The dataset will be unzipped to the `/path/to/data/toolbench_static` folder

Task configuration#

There are two ways to configure the task: dict and yaml.

  1. Configuration with dict:

your_task_config = {
    'infer_args': {
        'model_name_or_path': '/path/to/model_dir',
        'model_type': 'qwen2-7b-instruct',
        'data_path': 'data/toolbench_static',
        'output_dir': 'output_res',
        'deploy_type': 'swift',
        'max_new_tokens': 2048,
        'num_infer_samples': None
    },
    'eval_args': {
        'input_path': 'output_res',
        'output_path': 'output_res'
    }
}
  • Arguments:

    • model_name_or_path: The path to the model local directory.

    • model_type: The model type, refer to model type list

    • data_path: The path to the dataset directory contains in_domain.json and out_of_domain.json files.

    • output_dir: The path to the output directory. Default to output_res.

    • deploy_type: The deploy type, default to swift.

    • max_new_tokens: The maximum number of tokens to generate.

    • num_infer_samples: The number of samples to infer. Default to None, which means infer all samples.

    • input_path: The path to the input directory for evaluation, should be the same as output_dir of infer_args.

    • output_path: The path to the output directory for evaluation.

  1. Configuration with yaml:

infer_args:
  model_name_or_path: /path/to/model_dir      # absolute path is recommended
  model_type: qwen2-7b-instruct
  data_path: /path/to/data/toolbench_static   # absolute path is recommended
  deploy_type: swift
  max_new_tokens: 2048
  num_infer_samples: null
  output_dir: output_res
eval_args:
  input_path: output_res
  output_path: output_res

refer to config_default.yaml for more details.

Run the task#

from evalscope.third_party.toolbench_static import run_task

# Run the task with dict configuration
run_task(task_cfg=your_task_config)

# Run the task with yaml configuration
run_task(task_cfg='/path/to/your_task_config.yaml')

Results and metrics#

  • Metrics:

    • Plan.EM: The agent’s planning decisions at each step for using tools invocation, generating answer, or giving up. Exact match score.

    • Act.EM: Action exact match score, including the tool name and arguments.

    • HalluRate(lower is better): The hallucination rate of the agent's answers at each step.

    • Avg.F1: The average F1 score of the agent's tools calling at each step.

    • R-L: The Rouge-L score of the agent's answers at each step.

Generally, we focus on Act.EM, HalluRate and Avg.F1 metrics.