Parameters#
Run evalscope eval --help
to get a complete list of parameter descriptions.
Model Parameters#
--model
: The name of the model being evaluated.Specify the model’s
id
in ModelScope, and it will automatically download the model, for example, Qwen/Qwen2.5-0.5B-Instruct;Specify the local path to the model, for example,
/path/to/model
, to load the model from the local environment;When the evaluation target is the model API endpoint, it needs to be specified as
model_id
, for example,Qwen2.5-0.5B-Instruct
.
--model-id
: An alias for the model being evaluated. Defaults to the last part ofmodel
, for example, themodel-id
forQwen/Qwen2.5-0.5B-Instruct
isQwen2.5-0.5B-Instruct
.--model-task
: The task type of the model, defaults totext_generation
, options aretext_generation
,image_generation
.--model-args
: Model loading parameters, separated by commas inkey=value
format, with default parameters:revision
: Model version, defaults tomaster
precision
: Model precision, defaults totorch.float16
device_map
: Device allocation for the model, defaults toauto
--generation-config
: Generation parameters, separated by commas, in the form ofkey=value
or passed in as a JSON string, which will be parsed into a dictionary:If using local model inference (based on Transformers), the following parameters are included (Full parameter guide):
do_sample
: Whether to use sampling, default isfalse
max_length
: Maximum length, default is 2048max_new_tokens
: Maximum length of generated text, default is 512num_return_sequences
: Number of sequences to generate, default is 1; when set greater than 1, multiple sequences will be generated, requires settingdo_sample=True
temperature
: Generation temperaturetop_k
: Top-k for generationtop_p
: Top-p for generation
If using model API service for inference (
eval-type
set toservice
), the following parameters are included (please refer to the deployed model service for specifics):max_tokens
: Maximum length of generated text, default is 512temperature
: Generation temperature, default is 0.0n
: number of generated sequences, default is 1 (Note: currently, lmdeploy only supports n=1)
# For example, pass arguments in the form of key=value --model-args revision=master,precision=torch.float16,device_map=auto --generation-config do_sample=true,temperature=0.5 # Or pass more complex parameters using a JSON string --model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' --generation-config '{"do_sample":true,"temperature":0.5,"chat_template_kwargs":{"enable_thinking": false}}'
--chat-template
: Model inference template, defaults toNone
, indicating the use of transformers’apply_chat_template
; supports passing in a jinja template string to customize the inference template.--template-type
: Model inference template, deprecated, refer to--chat-template
.
The following parameters are only valid when eval-type=service
:
--api-url
: Model API endpoint, default isNone
; supports local or remote OpenAI API format endpoints, for examplehttp://127.0.0.1:8000/v1
.--api-key
: Model API endpoint key, default isEMPTY
--timeout
: Model API request timeout, default isNone
--stream
: Whether to use streaming transmission, default isFalse
Dataset Parameters#
--datasets
: Dataset name, supports inputting multiple datasets separated by spaces, datasets will automatically be downloaded from ModelScope, supported datasets refer to Dataset List.--dataset-args
: Configuration parameters for the evaluation dataset, passed injson
format, where the key is the dataset name and the value is the parameter, note that it needs to correspond one-to-one with the values in the--datasets
parameter:dataset_id
(orlocal_path
): Local path for the dataset, once specified, it will attempt to load local data.prompt_template
: The prompt template for the evaluation dataset. When specified, it will be used to generate prompts. For example, the template for thegsm8k
dataset isQuestion: {query}\nLet's think step by step\nAnswer:
. The question from the dataset will be filled into thequery
field of the template.query_template
: The query template for the evaluation dataset. When specified, it will be used to generate queries. For example, the template forgeneral_mcq
isQuestion: {question}\n{choices}\nAnswer: {answer}\n\n
. The questions from the dataset will be inserted into thequestion
field of the template, options will be inserted into thechoices
field, and answers will be inserted into theanswer
field (answer insertion is only effective for few-shot scenarios).system_prompt
: System prompt for the evaluation dataset.model_adapter
: The model adapter for the evaluation dataset. Once specified, the given model adapter will be used for evaluation. Currently, it supportsgeneration
,multiple_choice_logits
, andcontinuous_logits
. For service evaluation, onlygeneration
is supported at the moment. Some multiple-choice datasets supportlogits
output.subset_list
: List of subsets for the evaluation dataset, once specified, only subset data will be used.few_shot_num
: Number of few-shots.few_shot_random
: Whether to randomly sample few-shot data, defaults toFalse
.metric_list
: A list of metrics for evaluating the dataset. When specified, the evaluation will use the given metrics. Currently supported metrics includeAverageAccuracy
,AveragePass@1
, andPass@[1-16]
. For example, for thehumaneval
dataset, you can specify["Pass@1", "Pass@5"]
. Note that in this case, you need to setn=5
to make the model return 5 results.filters
: Filters for the evaluation dataset. When specified, these filters will be used to process the evaluation results. They can be used to handle the output of inference models. Currently supported filters are:remove_until {string}
: Removes the part of the model’s output before the specified string.extract {regex}
: Extracts the part of the model’s output that matches the specified regular expression. For example, theifeval
dataset can specify{"remove_until": "</think>"}
, which will filter out the part of the model’s output before</think>
, avoiding interference with scoring.
# For example --datasets gsm8k arc --dataset-args '{"gsm8k": {"few_shot_num": 4, "few_shot_random": false}, "arc": {"dataset_id": "/path/to/arc"}}, "ifeval": {"filters": {"remove_until": "</think>"}}'
--dataset-dir
: Dataset download path, defaults to~/.cache/modelscope/datasets
.--dataset-hub
: Dataset download source, defaults tomodelscope
, alternative ishuggingface
.--limit
: Maximum evaluation data amount for each dataset, if not specified, defaults to all data for evaluation, can be used for quick validation.
Evaluation Parameters#
--eval-batch-size
: Evaluation batch size, default is1
; wheneval-type=service
, it indicates the number of concurrent evaluation requests, default is8
.--eval-stage
: (Deprecated, refer to--use-cache
) Evaluation stage, options areall
,infer
,review
, default isall
.--eval-type
: Evaluation type, options arecheckpoint
,custom
,service
; defaults tocheckpoint
.--eval-backend
: Evaluation backend, options areNative
,OpenCompass
,VLMEvalKit
,RAGEval
,ThirdParty
, defaults toNative
.OpenCompass
is used for evaluating large language models.VLMEvalKit
is used for evaluating multimodal models.RAGEval
is used for evaluating RAG processes, embedding models, re-ranking models, CLIP models.See also
Other evaluation backends User Guide
ThirdParty
is used for other special task evaluations, such as ToolBench, LongBench.
--eval-config
: This parameter needs to be passed when using a non-Native
evaluation backend.
Judge Parameters#
The LLM-as-a-Judge evaluation parameters use a judge model to determine correctness, including the following parameters:
--judge-strategy
: The strategy for using the judge model, options include:auto
: The default strategy, which decides whether to use the judge model based on the dataset requirementsllm
: Always use the judge modelrule
: Do not use the judge model, use rule-based judgment insteadllm_recall
: First use rule-based judgment, and if it fails, then use the judge model
--judge-worker-num
: The concurrency number for the judge model, default is1
--judge-model-args
: Sets the parameters for the judge model, passed in as ajson
string and parsed as a dictionary, supporting the following fields:api_key
: API endpoint key for the model, default isEMPTY
api_url
: API endpoint for the model, default ishttps://api.openai.com/v1
model_id
: Model ID, default isgpt-3.5-turbo
system_prompt
: (Optional) System prompt for evaluating the datasetprompt_template
: (Optional) Prompt template for evaluating the datasetgeneration_config
: (Optional) Generation parameters
Other Parameters#
--work-dir
: Output path for model evaluation, default is./outputs/{timestamp}
.--use-cache
: Use local cache path, default isNone
; if a path is specified, such asoutputs/20241210_194434
, it will reuse the model inference results from that path. If inference is not completed, it will continue inference and then proceed to evaluation.--seed
: Random seed, default is42
.--debug
: Whether to enable debug mode, default isfalse
.--ignore-errors
: Whether to ignore errors during model generation, default isfalse
.--dry-run
: Pre-check parameters without performing inference, only prints parameters, default isfalse
.