Basic Usage#
Simple Evaluation#
Evaluate a model on specified datasets using default configurations. This framework supports two ways to initiate evaluation tasks: via command line or using Python code.
Method 1. Using Command Line#
Execute the eval
command from any directory:
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k arc \
--limit 5
Execute from the evalscope
root directory:
python evalscope/run.py \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k arc \
--limit 5
Method 2. Using Python Code#
When using Python code for evaluation, submit the evaluation task with the run_task
function by passing in a TaskConfig
as a parameter. It can also be a Python dictionary, a YAML file path, or a JSON file path, for example:
from evalscope.run import run_task
task_cfg = {
'model': 'Qwen/Qwen2.5-0.5B-Instruct',
'datasets': ['gsm8k', 'arc'],
'limit': 5
}
run_task(task_cfg=task_cfg)
from evalscope.run import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['gsm8k', 'arc'],
limit=5
)
run_task(task_cfg=task_cfg)
model: Qwen/Qwen2.5-0.5B-Instruct
datasets:
- gsm8k
- arc
limit: 5
from evalscope.run import run_task
run_task(task_cfg="config.yaml")
{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"datasets": ["gsm8k", "arc"],
"limit": 5
}
from evalscope.run import run_task
run_task(task_cfg="config.json")
Basic Parameter Descriptions#
--model
: Specifies themodel_id
of the model in ModelScope, which can be automatically downloaded, for example, Qwen/Qwen2.5-0.5B-Instruct; it can also be a local path to the model, e.g.,/path/to/model
.--datasets
: Dataset names, supporting multiple datasets separated by spaces. Datasets will be automatically downloaded from ModelScope; refer to the Dataset List for supported datasets.--limit
: Maximum amount of evaluation data per dataset. If not specified, it defaults to evaluating all data, which can be used for quick validation.
Output Results#
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Model Name | Dataset Name | Metric Name | Category Name | Subset Name | Num | Score |
+=======================+================+=================+=================+===============+=======+=========+
| Qwen2.5-0.5B-Instruct | gsm8k | AverageAccuracy | default | main | 5 | 0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc | AverageAccuracy | default | ARC-Easy | 5 | 0.8 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc | AverageAccuracy | default | ARC-Challenge | 5 | 0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
Complex Evaluation#
For more customized evaluations, like setting custom model parameters or dataset parameters, you can use the following command. The method to initiate evaluation is the same as in simple evaluation. Below is an example using the eval
command to start the evaluation:
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
Parameter Descriptions#
--model-args
: Model loading parameters, separated by commas in thekey=value
format, default parameters:revision
: Model version, defaults tomaster
precision
: Model precision, defaults toauto
device_map
: Device allocation for the model, defaults toauto
--generation-config
: Generation parameters, separated by commas in thekey=value
format, default parameters:do_sample
: Whether to use sampling, defaults tofalse
max_length
: Maximum length, defaults to 2048max_new_tokens
: Maximum length for generation, defaults to 512
--dataset-args
: Settings parameters for the evaluation dataset, provided injson
format, where the key is the dataset name and the value is the parameter. Note that these must correspond one-to-one with the values in the--datasets
parameter:few_shot_num
: Number of few-shot samplesfew_shot_random
: Whether to randomly sample few-shot data; if not set, defaults totrue
See also
Reference: All Parameter Descriptions
Output Results#
+-----------------------+-----------------+
| Model | gsm8k |
+=======================+=================+
| Qwen2.5-0.5B-Instruct | (gsm8k/acc) 0.2 |
+-----------------------+-----------------+
Model API Service Evaluation#
Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the eval-type
parameter must be specified as service
, for example:
For example, to launch a model service using vLLM:
export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801
Then, you can use the following command to evaluate the model API service:
evalscope eval \
--model qwen2.5 \
--api-url http://127.0.0.1:8801/v1/chat/completions \
--api-key EMPTY \
--eval-type service \
--datasets gsm8k \
--limit 10
Using Local Datasets and Models#
By default, datasets are hosted on ModelScope and require internet access for loading. If you are in an offline environment, you can use local datasets. The process is as follows:
Assume the current local working path is /path/to/workdir
.
Download the Dataset Locally#
Execute the following commands:
wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/benchmark/data.zip
unzip data.zip
The extracted dataset will be in the /path/to/workdir/data
directory. This directory will be passed as the value of the local_path
parameter in subsequent steps.
Download the Model Locally#
Model files are hosted on the ModelScope Hub and require internet access for loading. If you need to create evaluation tasks in an offline environment, you can download the model to your local machine in advance:
For example, use Git to download the Qwen2.5-0.5B-Instruct model locally:
git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen2.5-0.5B-Instruct.git
See also
Execute Evaluation Task#
Run the following command to perform the evaluation, passing in the local dataset path and model path. Note that local_path
must correspond one-to-one with the values in the --datasets
parameter:
evalscope eval \
--model /path/to/workdir/Qwen2.5-0.5B-Instruct \
--datasets arc \
--dataset-args '{"arc": {"local_path": "/path/to/workdir/data/arc"}}' \
--limit 10