Basic Usage#
Simple Evaluation#
Evaluate a model on specified datasets using default configurations. This framework supports two ways to initiate evaluation tasks: via command line or using Python code.
Method 1. Using Command Line#
Execute the eval command from any directory:
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k arc \
--limit 5
Execute from the evalscope root directory:
python evalscope/run.py \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k arc \
--limit 5
Method 2. Using Python Code#
When using Python code for evaluation, submit the evaluation task with the run_task function by passing in a TaskConfig as a parameter. It can also be a Python dictionary, a YAML file path, or a JSON file path, for example:
from evalscope.run import run_task
task_cfg = {
'model': 'Qwen/Qwen2.5-0.5B-Instruct',
'datasets': ['gsm8k', 'arc'],
'limit': 5
}
run_task(task_cfg=task_cfg)
from evalscope.run import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['gsm8k', 'arc'],
limit=5
)
run_task(task_cfg=task_cfg)
model: Qwen/Qwen2.5-0.5B-Instruct
datasets:
- gsm8k
- arc
limit: 5
from evalscope.run import run_task
run_task(task_cfg="config.yaml")
{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"datasets": ["gsm8k", "arc"],
"limit": 5
}
from evalscope.run import run_task
run_task(task_cfg="config.json")
Basic Parameter Descriptions#
--model: Specifies themodel_idof the model in ModelScope, which can be automatically downloaded, for example, Qwen/Qwen2.5-0.5B-Instruct; it can also be a local path to the model, e.g.,/path/to/model.--datasets: Dataset names, supporting multiple datasets separated by spaces. Datasets will be automatically downloaded from ModelScope; refer to the Dataset List for supported datasets.--limit: Maximum amount of evaluation data per dataset. If not specified, it defaults to evaluating all data, which can be used for quick validation.
Output Results#
+-----------------------+-------------------+-----------------+
| Model | ai2_arc | gsm8k |
+=======================+===================+=================+
| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
+-----------------------+-------------------+-----------------+
Complex Evaluation#
For more customized evaluations, like setting custom model parameters or dataset parameters, you can use the following command. The method to initiate evaluation is the same as in simple evaluation. Below is an example using the eval command to start the evaluation:
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
Parameter Descriptions#
--model-args: Model loading parameters, separated by commas in thekey=valueformat, default parameters:revision: Model version, defaults tomasterprecision: Model precision, defaults toautodevice_map: Device allocation for the model, defaults toauto
--generation-config: Generation parameters, separated by commas in thekey=valueformat, default parameters:do_sample: Whether to use sampling, defaults tofalsemax_length: Maximum length, defaults to 2048max_new_tokens: Maximum length for generation, defaults to 512
--dataset-args: Settings parameters for the evaluation dataset, provided injsonformat, where the key is the dataset name and the value is the parameter. Note that these must correspond one-to-one with the values in the--datasetsparameter:few_shot_num: Number of few-shot samplesfew_shot_random: Whether to randomly sample few-shot data; if not set, defaults totrue
See also
Reference: All Parameter Descriptions
Output Results#
+-----------------------+-----------------+
| Model | gsm8k |
+=======================+=================+
| Qwen2.5-0.5B-Instruct | (gsm8k/acc) 0.2 |
+-----------------------+-----------------+
Using Local Datasets and Models#
By default, datasets are hosted on ModelScope and require internet access for loading. If you are in an offline environment, you can use local datasets. The process is as follows:
Assume the current local working path is /path/to/workdir.
Download the Dataset Locally#
Execute the following commands:
wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/benchmark/data.zip
unzip data.zip
The extracted dataset will be in the /path/to/workdir/data directory. This directory will be passed as the value of the local_path parameter in subsequent steps.
Download the Model Locally#
Model files are hosted on the ModelScope Hub and require internet access for loading. If you need to create evaluation tasks in an offline environment, you can download the model to your local machine in advance:
For example, use Git to download the Qwen2.5-0.5B-Instruct model locally:
git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen2.5-0.5B-Instruct.git
See also
Execute Evaluation Task#
Run the following command to perform the evaluation, passing in the local dataset path and model path. Note that local_path must correspond one-to-one with the values in the --datasets parameter:
evalscope eval \
--model /path/to/workdir/Qwen2.5-0.5B-Instruct \
--datasets arc \
--dataset-args '{"arc": {"local_path": "/path/to/workdir/data/arc"}}' \
--limit 10