Basic Usage#
1. Simple Evaluation#
To evaluate a model using default settings on specified datasets, follow the process below:
You can execute this command from any directory:
python -m evalscope.run \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--datasets arc
If prompted with Do you wish to run the custom code? [y/N], please type y.
Execute this command in the evalscope directory:
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--datasets arc
If prompted with Do you wish to run the custom code? [y/N], please type y.
Basic Parameter Descriptions#
--model: Specifies themodel_idof the model on ModelScope, allowing automatic download. For example, see the Qwen2-0.5B-Instruct model link; you can also use a local path, such as/path/to/model.--template-type: Specifies the template type corresponding to the model. Refer to theDefault Templatefield in the template table for filling in this field.Note
You can also view the list of
template_typefor models using the following code:from evalscope.models.template import TemplateType print(TemplateType.get_template_name_list())--datasets: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the supported datasets list for available options.
2. Parameterized Evaluation#
If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:
Example 1:
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--model-args revision=master,precision=torch.float16,device_map=auto \
--datasets gsm8k ceval \
--use-cache true \
--limit 10
Example 2:
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--generation-config do_sample=false,temperature=0.0 \
--datasets ceval \
--dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' \
--limit 10
Parameter Descriptions#
In addition to the three basic parameters, the other parameters are as follows:
--model-args: Model loading parameters, separated by commas, inkey=valueformat.--generation-config: Generation parameters, separated by commas, inkey=valueformat.do_sample: Whether to use sampling, default isfalse.max_new_tokens: Maximum generation length, default is 1024.temperature: Sampling temperature.top_p: Sampling threshold.top_k: Sampling threshold.
--use-cache: Whether to use local cache, default isfalse. If set totrue, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.--dataset-args: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in--datasets.--few_shot_num: Number of few-shot examples.--few_shot_random: Whether to randomly sample few-shot data; if not specified, defaults totrue.
--limit: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
3. Use the run_task Function to Submit an Evaluation Task#
Using the run_task function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:
1. Configuration Task Dictionary Parameters#
import torch
from evalscope.constants import DEFAULT_ROOT_CACHE_DIR
# Example
your_task_cfg = {
'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
'dataset_args': {},
'dry_run': False,
'model': 'qwen/Qwen2-0.5B-Instruct',
'template_type': 'qwen',
'datasets': ['arc', 'hellaswag'],
'work_dir': DEFAULT_ROOT_CACHE_DIR,
'outputs': DEFAULT_ROOT_CACHE_DIR,
'mem_cache': False,
'dataset_hub': 'ModelScope',
'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
'limit': 10,
'debug': False
}
Here, DEFAULT_ROOT_CACHE_DIR is set to '~/.cache/evalscope'.
2. Execute Task with run_task#
from evalscope.run import run_task
run_task(task_cfg=your_task_cfg)