Parameters#
Run evalscope eval --help to get the full list of parameters.
Model Parameters#
--model: The name of the model to be evaluated.Specify the modelβs
idon ModelScope to download the model automatically, e.g. Qwen/Qwen2.5-0.5B-Instruct.Specify a local path to load the model from, e.g.
/path/to/model.For evaluating a model API service, specify the model id corresponding to the service, e.g.
Qwen2.5-0.5B-Instruct.
--model-id: Alias for the evaluated model, used in reports. Defaults to the last part ofmodel, e.g. forQwen/Qwen2.5-0.5B-Instruct, themodel-idisQwen2.5-0.5B-Instruct.--api-url: The model API endpoint, default isNone; supports local or remote OpenAI API format endpoints, e.g.http://127.0.0.1:8000/v1.--api-key: The model API endpoint key, default isEMPTY.--model-args: Model loading parameters, either as comma-separatedkey=valuepairs, or as a JSON string which will be parsed as a dictionary. Default parameters:revision: Model revision, default ismaster.precision: Model precision, default istorch.float16.device_map: Device allocation for the model, default isauto.
--model-task: Model task type, default istext_generation, optional values aretext_generation,image_generation.--chat-template: Model inference template, default isNone(uses transformersβapply_chat_template); you can pass a Jinja template string to customize the inference template.
Model Inference Parameters#
--generation-config: Generation parameters, either as comma-separatedkey=valuepairs, or as a JSON string (parsed as a dictionary):timeout: Optional integer, request timeout (seconds).stream: Optional boolean, whether to return responses in streaming mode (depends on the model).max_tokens: Optional integer, the maximum number of tokens generated (depends on the model).top_p: Optional float, nucleus sampling; the model only considers tokens accounting for top_p probability mass.temperature: Optional float, sampling temperature (0~2); higher means more randomness, lower means more deterministic output.frequency_penalty: Optional float (-2.0~2.0); positive values penalize repeated tokens to reduce repetition. Supported by OpenAI, Google, Grok, Groq, vLLM, SGLang.presence_penalty: Optional float (-2.0~2.0); positive values penalize already appeared tokens to encourage new topics. Supported by OpenAI, Google, Grok, Groq, vLLM, SGLang.logit_bias: Optional dict mapping token ids to bias values (-100~100), e.g."42=10,43=-10". Supported by OpenAI, Grok, vLLM.seed: Optional integer, random seed. Supported by OpenAI, Google, Mistral, Groq, HuggingFace, vLLM.do_sample: Optional boolean, whether to use sampling strategy (otherwise greedy decoding). Supported by Transformers models.top_k: Optional integer, sample next token from the top_k most likely candidates. Supported by Anthropic, Google, HuggingFace, vLLM, SGLang.logprobs: Optional boolean, whether to return log probabilities for output tokens. Supported by OpenAI, Grok, TogetherAI, HuggingFace, llama-cpp-python, vLLM, SGLang.top_logprobs: Optional integer (0~20), return the top top_logprobs tokens and their probabilities for each position. Supported by OpenAI, Grok, HuggingFace, vLLM, SGLang.parallel_tool_calls: Optional boolean, whether to support parallel tool calls (default True). Supported by OpenAI, Groq.max_tool_output: Optional integer, maximum bytes for tool output. Default is 16*1024.cache_prompt: Optional βautoβ, boolean, or None; whether to cache prompt prefix. Default is βautoβ, enabled when using tools. Supported by Anthropic.reasoning_effort: Optional enum ('low','medium','high'), restricts reasoning effort, default'medium'. Supported by OpenAI o1 models.reasoning_tokens: Optional integer, max tokens for reasoning. Supported by Anthropic Claude models.reasoning_summary: Optional enum ('concise','detailed','auto'); whether to provide a summary of reasoning steps.'auto'uses the most detailed summary available. Supported by OpenAI reasoning models.reasoning_history: Optional enum ('none','all','last','auto'); whether reasoning content is included in chat message history.response_schema: Optional ResponseSchema object, request returns formatted according to JSONSchema (output still requires validation). Supported by OpenAI, Google, Mistral.extra_body: Optional dict, extra request body for OpenAI-compatible services.extra_query: Optional dict, extra query parameters for OpenAI-compatible services.extra_headers: Optional dict, extra headers for OpenAI-compatible services.height: Optional integer, for image generation models, specifies image height.width: Optional integer, for image generation models, specifies image width.num_inference_steps: Optional integer, for image models, number of inference steps.guidance_scale: Optional float, for image models, guidance scale.
Example usage:
# Pass as key=value form
--model-args revision=master,precision=torch.float16,device_map=auto
--generation-config do_sample=true,temperature=0.5
# Or pass as JSON string for more complex parameters
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}'
--generation-config '{"do_sample":true,"temperature":0.5,"chat_template_kwargs":{"enable_thinking": false}}'
Dataset Parameters#
--datasets: Dataset name(s), supports multiple datasets separated by spaces. Datasets will be automatically downloaded from ModelScope. Supported datasets are listed at Dataset List.--dataset-args: Evaluation dataset settings, passed as a JSON string and parsed into a dictionary. Must correspond to values in--datasets:dataset_id(orlocal_path): Specify local path for the dataset; if set, data will be loaded locally.prompt_template: Prompt template for the evaluation dataset; will be used to generate prompts. For example,gsm8kβs template isQuestion: {query}\nLet's think step by step\nAnswer:and the dataset question will be filled in thequeryfield.system_prompt: System prompt for the evaluation dataset.subset_list: List of dataset subsets; only data from specified subsets will be used.few_shot_num: Number of few-shot samples.few_shot_random: Whether to sample few-shot data randomly (defaultFalse).shuffle: Indicates whether to randomize the data before evaluation, with the default setting beingFalse.shuffle_choices: Specifies whether to rearrange the order of choices before evaluation, defaulting toFalse. This option is only applicable to multiple-choice datasets.
metric_list: The metric list for the evaluation dataset. When specified, the given metrics will be used for evaluation. Currently,accis supported by default. Other computed metrics can be found in the dataset list.aggregation: The aggregation method for evaluation results. Default ismean. Other options include:mean_and_pass_at_k: Calculates the mean andpass_at_k. Requires specifyingrepeats=k, which automatically calculates the probability of passing at least once across k attempts for the same example. For instance, thehumanevaldataset can specifyrepeats=5to calculate the pass rate among 5 generated results for the same example.mean_and_vote_at_k: Calculates the mean andvote_at_k. Requires specifyingrepeats=k, which automatically calculates the voting result across k attempts for the same example to determine the final result.mean_and_pass_hat_k: Calculates the mean andpass_hat_k. Requires specifyingrepeats=k, which automatically calculates the probability of passing all k attempts for the same example. For instance, thetau2_benchdataset can specifyrepeats=3to calculate the pass rate among 3 generated results for the same example.
filters: Filters for the evaluation dataset. When specified, the given filters will be used to filter evaluation results and can be used to process the output of inference models. Currently supported filters include:remove_until {string}: Removes the portion before the specified string in the model output. For example, theifevaldataset can specify{"remove_until": "</think>"}to filter out the portion before</think>in the model output, preventing it from affecting scoring.extract {regex}: Extracts the portion of the model output that matches the specified regular expression.
extra_params: Extra parameters related to the dataset. Refer to the documentation of each dataset for specific parameters, e.g., theinclude_multi_modalparameter for thehledataset.
--dataset-dir: Dataset download path, default is~/.cache/modelscope/datasets.--dataset-hub: Dataset source, default ismodelscope, optional value ishuggingface.--limit: Max number of samples to evaluate per dataset. If not set, evaluates all data. Supports int and float. Int means the firstNsamples, float means the firstN%samples in the dataset. For example,0.1means the first 10% of samples,100means the first 100 samples.
Example usage:
--datasets gsm8k arc ifeval hle \
--dataset-args '{
"gsm8k": {
"few_shot_num": 4,
"few_shot_random": false
},
"arc": {
"dataset_id": "/path/to/arc"
},
"ifeval": {
"filters": {
"remove_until": "</think>"
}
},
"hle": {
"extra_params": {
"include_multi_modal": false
}
}
}'
Evaluation Parameters#
--eval-type: Evaluation type, choose based on model inference method, default isllm_ckpt:llm_ckpt: Local model inference; downloads the model from ModelScope and uses Transformers for inference.openai_api: Online model service inference; supports any OpenAI API-compatible service.text2image: Local text-to-image model inference; downloads the model from ModelScope and uses Diffusers pipeline for inference.mock_llm: Simulated LLM inference, for function verification.
--eval-batch-size: Evaluation batch size, default is1; foreval-type=service, this means concurrent evaluation requests, default is8.--eval-backend: Evaluation backend, optional values areNative,OpenCompass,VLMEvalKit,RAGEval,ThirdParty; default isNative.OpenCompassfor LLM evaluationVLMEvalKitfor multimodal model evaluationRAGEvalfor RAG pipeline, embedding, reranker, CLIP model evaluationSee also
Refer to the other backend usage guide
ThirdPartyfor other special tasks, e.g. ToolBench, LongBench
--eval-config: Required when using non-Nativeevaluation backend
Judge Parameters#
The LLM-as-a-Judge evaluation parameters use a judge model to determine correctness, including the following parameters:
--judge-strategy: The strategy for using the judge model, options include:auto: The default strategy, which decides whether to use the judge model based on the dataset requirementsllm: Always use the judge modelrule: Do not use the judge model, use rule-based judgment insteadllm_recall: First use rule-based judgment, and if it fails, then use the judge model
--judge-worker-num: The concurrency number for the judge model, default is1--judge-model-args: Sets the parameters for the judge model, passed in as ajsonstring and parsed as a dictionary, supporting the following fields:api_key: The API endpoint key for the model. If not set, it will be retrieved from the environment variableMODELSCOPE_SDK_TOKEN, with a default value ofEMPTY.api_url: The API endpoint for the model. If not set, it will be retrieved from the environment variableMODELSCOPE_API_BASE, with a default value ofhttps://api-inference.modelscope.cn/v1/.model_id: The model ID. If not set, it will be retrieved from the environment variableMODELSCOPE_JUDGE_LLM, with a default value ofQwen/Qwen3-235B-A22B.See also
For more information on ModelScopeβs model inference services, please refer to ModelScope API Inference Services.
system_prompt: System prompt for evaluating the datasetprompt_template: Prompt template for evaluating the datasetgeneration_config: Model generation parameters, same as the--generation-configparameter.score_type: Preset model scoring method, options include:pattern: (Default option) Directly judge whether the model output matches the reference answer, suitable for evaluations with reference answers.Default prompt_template
Your job is to look at a question, a gold target, and a predicted answer, and return a letter "A" or "B" to indicate whether the predicted answer is correct or incorrect. [Question] {question} [Reference Answer] {gold} [Predicted Answer] {pred} Evaluate the model's answer based on correctness compared to the reference answer. Grade the predicted answer of this new question as one of: A: CORRECT B: INCORRECT Just return the letters "A" or "B", with no text around it.numeric: Judge the model output score under prompt conditions, suitable for evaluations without reference answers.Default prompt_template
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 0 (worst) to 1 (best) by strictly following this format: \"[[rating]]\", for example: \"Rating: [[0.5]]\" [Question] {question} [Response] {pred}
score_pattern: Regular expression for parsing model output, default forpatternmode is(A|B); default fornumericmode is\[\[(\d+(?:\.\d+)?)\]\], used to extract model scoring results.score_mapping: Score mapping dictionary forpatternmode, default is{'A': 1.0, 'B': 0.0}
--analysis-report: Whether to generate an analysis report, default isfalse; if this parameter is set, an analysis report will be generated using the judge model, including analysis interpretation and suggestions for the model evaluation results. The report output language will be automatically determined based onlocale.getlocale().
Sandbox Parameters#
--use-sandbox: Determines whether to utilize a Sandbox for model evaluation, defaulting tofalse. If this parameter is set, ms-enclave will be activated to isolate the code execution environment, enhancing security. This is currently effective only for code evaluation tasks such ashumaneval.--sandbox-manager-config: Configuration parameters for the Sandbox manager, passed as ajsonstring and parsed into a dictionary. It supports the following fields:base_url: The base URL for the Sandbox manager, defaulting toNone, which indicates the use of a local manager. Configuring this parameter will enable the use of a remote manager.
--sandbox-type: Specifies the type of Sandbox, with a default value ofdocker.--sandbox-config: Configuration parameters for the Sandbox, passed as ajsonstring and parsed into a dictionary. It supports the following fields:image: The name of the Docker image, defaulting topython:3.11-slim.network_enabled: Whether to enable networking, default istrue.tools_config: Tool configuration dictionary, defaulting to an empty dictionary{'shell_executor': {},'python_executor': {}}, which indicates that both shell and python executors are enabled.
Other Parameters#
--work-dir: Model evaluation output path, default is./outputs/{timestamp}, folder structure example is as follows:. βββ configs β βββ task_config_b6f42c.yaml βββ logs β βββ eval_log.log βββ predictions β βββ Qwen2.5-0.5B-Instruct β βββ general_qa_example.jsonl βββ reports β βββ Qwen2.5-0.5B-Instruct β βββ general_qa.json βββ reviews βββ Qwen2.5-0.5B-Instruct βββ general_qa_example.jsonl--use-cache: Path to use for local caching, default isNone. If a specific path is provided (e.g.outputs/20241210_194434), the model inference results and evaluation results in that path will be reused; if inference is incomplete, it will continue from where it left off, and then proceed to evaluation.--rerun-review: Boolean value. Set to True if you want to reuse the model inference results and only rerun the evaluation. Default is False; if evaluation results exist locally, evaluation will be skipped.--seed: Random seed, default is42.--debug: Whether to enable debug mode, default isfalse.--ignore-errors: Whether to ignore errors during model generation, default isfalse.