Parameters#
Run evalscope eval --help to get a complete list of parameter descriptions.
Model Parameters#
--model: Specifies themodel_idof the model in ModelScope, which can be automatically downloaded, for example, Qwen/Qwen2.5-0.5B-Instruct; it can also be a local path to the model, e.g.,/path/to/model.--model-args: Model loading parameters, separated by commas inkey=valueformat. Default parameters:revision: Model version, default ismaster.precision: Model precision, default isauto.device_map: Device allocation for the model, default isauto.
--generation-config: Generation parameters, separated by commas inkey=valueformat. Default parameters:do_sample: Whether to use sampling, default isfalse.max_length: Maximum length, default is 2048.max_new_tokens: Maximum length of generated tokens, default is 512.
--chat-template: Model inference template, default isNone, which means using transformers’apply_chat_template; supports passing a Jinja template string for custom inference templates.--template-type: Model inference template, deprecated, refer to--chat-template.
Dataset Parameters#
--datasets: Dataset names, supporting multiple datasets separated by spaces. Datasets will be automatically downloaded from ModelScope; refer to the Dataset List for supported datasets.--dataset-args: Settings for the evaluation dataset in JSON format, where the key is the dataset name and the value is the parameters. Note that they must correspond one-to-one with the values in the--datasetsparameter:local_path: Local path to the dataset. If specified, it will attempt to load local data.prompt_template: Prompt template for the evaluation dataset. If specified, it will be prefixed to each evaluation data entry.subset_list: List of subsets for evaluation data. If specified, only subset data will be used.few_shot_num: Number of few-shot samples.few_shot_random: Whether to randomly sample few-shot data. Defaults totrueif not set.
--dataset-dir: Path for downloading datasets, default is~/.cache/modelscope/datasets.--dataset-hub: Source for downloading datasets, default ismodelscope, can also behuggingface.--limit: Maximum amount of evaluation data per dataset. If not specified, it defaults to evaluating all data, which can be used for quick validation.
Evaluation Parameters#
--eval-stage: Evaluation stage, options areall,infer,review:all: Full evaluation, including inference and evaluation.infer: Only perform inference, no evaluation.review: Only perform data evaluation, no inference.
--eval-type: Evaluation type, options arecheckpoint,custom, default ischeckpoint.--eval-backend: Evaluation backend, options areNative,OpenCompass,VLMEvalKit,RAGEval,ThirdParty, default isNative.OpenCompassis used for evaluating large language models.VLMEvalKitis used for evaluating multimodal models.RAGEvalis used for evaluating RAG processes, embedding models, reranker models, CLIP models.See also
Other evaluation backend Usage Guide
ThirdPartyis used for other special task evaluations, such as ToolBench and LongBench.
--eval-config: Required when using non-Nativeevaluation backends.
Other Parameters#
--work-dir: Output path for model evaluation, default is./outputs/{timestamp}.--use-cache: Use local cache path, default isNone; if a path is specified, such asoutputs/20241210_194434, it will reuse the model inference results from that path. If inference is not completed, it will continue inference and then proceed to evaluation.--seed: Random seed, default is42.--debug: Whether to enable debug mode, default isfalse.--dry-run: Pre-check parameters without performing inference, only prints parameters, default isfalse.