Quick Start#
EvalScope is an evaluation framework designed for large models, aiming to provide a simple, easy-to-use, and comprehensive evaluation process. This guide will walk you through various evaluation tasks from simple to complex, helping you get started quickly.
Start Your First Evaluation#
You can evaluate a model on specified datasets using default configurations through either command line or Python code.
Method 1: Using Command Line#
Execute the evalscope eval command in any path to start evaluation. This is the most recommended way to get started.
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k arc \
--limit 5
Basic Parameter Description:
--model: Specify the model’s ModelScope ID (e.g.,Qwen/Qwen2.5-0.5B-Instruct) or local path.--datasets: Specify one or more dataset names, separated by spaces. For supported datasets, please refer to Dataset List.--limit: Maximum number of samples to evaluate per dataset, convenient for quick verification. If not set, all data will be evaluated.
Method 2: Using Python Code#
By calling the run_task function with a TaskConfig configuration, you can run evaluation in a Python environment. The configuration can be a TaskConfig object, Python dictionary, or yaml/json file path.
from evalscope.run import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['gsm8k', 'arc'],
limit=5
)
run_task(task_cfg=task_cfg)
from evalscope.run import run_task
task_cfg = {
'model': 'Qwen/Qwen2.5-0.5B-Instruct',
'datasets': ['gsm8k', 'arc'],
'limit': 5
}
run_task(task_cfg=task_cfg)
model: Qwen/Qwen2.5-0.5B-Instruct
datasets:
- gsm8k
- arc
limit: 5
from evalscope.run import run_task
run_task(task_cfg="config.yaml")
{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"datasets": ["gsm8k", "arc"],
"limit": 5
}
from evalscope.run import run_task
run_task(task_cfg="config.json")
View Evaluation Results#
After evaluation is complete, the terminal will print a score report in the following format:
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Model Name | Dataset Name | Metric Name | Category Name | Subset Name | Num | Score |
+=======================+================+=================+=================+===============+=======+=========+
| Qwen2.5-0.5B-Instruct | gsm8k | AverageAccuracy | default | main | 5 | 0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc | AverageAccuracy | default | ARC-Easy | 5 | 0.8 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc | AverageAccuracy | default | ARC-Challenge | 5 | 0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
See also
In addition to command line reports, you can also use our visualization tools to analyze evaluation results in depth.
Reference: Evaluation Result Visualization
Advanced Usage#
Customize Evaluation Parameters#
Example 1. Adjusting Model Loading and Inference Generation Parameters
You can fine-tune the control of model loading, inference generation, and dataset configuration by passing JSON-formatted strings.
evalscope eval \
--model Qwen/Qwen3-0.6B \
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
--generation-config '{"do_sample":true,"temperature":0.6,"max_tokens":512}' \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
Common Parameter Descriptions:
--model-args: Model loading parameters, such asrevision(version),precision(precision),device_map(device mapping).--generation-config: Inference generation parameters, such asdo_sample(sampling),temperature(temperature),max_tokens(maximum length).--dataset-args: Dataset-specific parameters, keyed by dataset name. For example,few_shot_num(number of few-shot examples).
Example 2. Adjusting Result Aggregation Methods
For tasks that require multiple generations (such as math problems), you can specify the number of repeated generations k using the --repeats parameter, and adjust the result aggregation method through dataset-args, such as mean_and_vote_at_k, mean_and_pass_at_k, mean_and_pass^k.
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k \
--limit 10 \
--dataset-args '{"gsm8k": {"aggregation": "mean_and_vote_at_k"}}' \
--repeats 5
See also
To view all configurable options, please refer to: Complete Parameter Description
Evaluate Online Model APIs#
EvalScope supports evaluating model services compatible with OpenAI API format. Simply specify the service address, API Key, and set eval-type to openai_api.
1. Start Model Service
Using vLLM as an example, start a model service:
# Please install vLLM first: pip install vllm
export VLLM_USE_MODELSCOPE=True
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-0.5B-Instruct \
--served-model-name qwen2.5 \
--trust-remote-code \
--port 8801
2. Run Evaluation
Use the following command to evaluate the API service:
evalscope eval \
--model qwen2.5 \
--api-url http://127.0.0.1:8801/v1 \
--api-key EMPTY \
--eval-type openai_api \
--datasets gsm8k \
--limit 10
Using Judge Models for Assessment#
For certain subjective or open-ended tasks (such as simple_qa), you can specify a powerful “judge model” to evaluate the target model’s output.
import os
from evalscope import TaskConfig, run_task
from evalscope.constants import JudgeStrategy
task_cfg = TaskConfig(
model='qwen2.5-7b-instruct',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='openai_api',
datasets=['chinese_simpleqa'],
limit=5,
judge_strategy=JudgeStrategy.AUTO,
judge_model_args={
'model_id': 'qwen2.5-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.getenv('DASHSCOPE_API_KEY'),
}
)
run_task(task_cfg=task_cfg)
See also
For detailed configuration of judge models, please refer to: Judge Model Parameters
Offline Evaluation#
In offline environments, you can use locally cached models and datasets for evaluation.
1. Prepare Local Datasets
Datasets are hosted on ModelScope by default and require internet connection to load. For offline environments, you can use local datasets and models with the following process:
First, check the ModelScope ID of the dataset you want to use: Find the Dataset ID of the dataset you need in the Supported Datasets list, for example, the ID for
mmlu_proismodelscope/MMLU-Pro.Use the modelscope command to download the dataset: Click “Dataset Files” tab -> Click “Download Dataset” -> Copy the command line
# Download dataset
modelscope download --dataset modelscope/MMLU-Pro --local_dir ./data/mmlu_pro
Use the directory
./data/mmlu_proas the value for thelocal_pathparameter.
2. Prepare Local Models
Model files are hosted on the ModelScope Hub and require internet connection to load. When you need to create evaluation tasks in an offline environment, you can download models locally in advance:
For example, use the modelscope command to download the Qwen2.5-0.5B-Instruct model locally:
modelscope download --model modelscope/Qwen2.5-0.5B-Instruct --local_dir ./model/qwen2.5
See also
For more download options, please refer to ModelScope Download Guide.
3. Run Offline Evaluation
In the evaluation command, point --model to the local model path and specify the dataset’s local_path through --dataset-args.
evalscope eval \
--model ./model/qwen2.5 \
--datasets mmlu_pro \
--dataset-args '{"mmlu_pro": {"local_path": "./data/mmlu_pro"}}' \
--limit 10
Continue Evaluation from Existing/Interrupted Results#
If a previous evaluation task was interrupted, or you want to continue evaluating based on existing results, you can use the --use-cache parameter to specify the previous output directory. This will skip completed samples and only evaluate the remaining ones. Additionally, you can use the --rerun-review parameter to force re-execution of the scoring step for all samples.
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k \
--limit 10 \
--use-cache outputs/20230101_123456 \
--rerun-review
Skip Samples with Evaluation Errors#
In some cases, the model may fail to infer due to overly long inputs, timeouts, model service crashes, or other reasons, or unknown errors may occur during the evaluation process. In such cases, the program will throw an exception and interrupt the evaluation flow. You can use the --ignore-errors parameter to skip these error samples and continue evaluating the remaining ones. In the evaluation report, these samples will be ignored, meaning the final number of evaluated samples will be reduced.
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--datasets gsm8k \
--limit 10 \
--ignore-errors
Migration from v0.1.x#
If you previously used v0.1.x version, please note the following major changes after upgrading to v1.0+:
Local Datasets:
zipcompressed package format is no longer supported. Please refer to the Offline Evaluation guide and use thelocal_pathparameter to load local datasets.Visualization Compatibility: The output report format in v1.0 has been updated, and old reports are incompatible with the new visualization tools.
Inference Parameters: The model inference parameter
nhas been removed. Please use therepeatsparameter to control the number of repeated generations.Evaluation Process Control: The
stageparameter has been removed. Addedrerun_reviewparameter to force re-execution of the scoring step whenuse_cache=True.Dataset Names: The
gpqadataset has been renamed togpqa_diamondand no longer requires manual specification ofsubset_list.
Frequently Asked Questions#
Encountering issues during evaluation? We’ve compiled a list of frequently asked questions to help you solve problems.
See also
Reference: Frequently Asked Questions