Parameters#
Run evalscope eval --help to get the full list of parameters.
Environment Variables#
The following environment variables can be set before launch to control global default behavior:
Environment Variable |
Description |
Default |
|---|---|---|
|
Root cache directory for EvalScope, used to store datasets, intermediate evaluation files, etc. |
|
|
Global default language, affects output language for reports, etc. ( |
|
|
Heartbeat reporting interval (seconds) |
|
|
Root cache directory for ModelScope models and datasets |
|
Model Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Name of the model to be evaluated |
- |
|
|
Alias for the evaluated model, used in reports |
Last part of |
|
|
Model API endpoint, supports OpenAI-compatible format |
|
|
|
Model API endpoint key |
|
|
|
Model loading parameters, comma-separated |
|
|
|
Model task type |
|
|
|
Model inference template, supports Jinja template string |
|
Example:
# key=value format
--model-args revision=master,precision=torch.float16,device_map=auto
# JSON string format
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}'
Model Inference Parameters#
The --generation-config parameter supports the following options (comma-separated key=value or JSON string):
Parameter |
Type |
Description |
Supported Backends |
|---|---|---|---|
|
|
Request timeout (seconds) |
All |
|
|
Number of retries, default is 5. |
OpenAI-compatible |
|
|
Retry interval (seconds), default is 10. |
OpenAI-compatible |
|
|
Whether to return responses in streaming mode |
All |
|
|
Maximum number of tokens generated |
All |
|
|
Nucleus sampling; only considers tokens accounting for top_p probability mass |
All |
|
|
Sampling temperature, range 0~2; higher means more randomness |
All |
|
|
Range -2.0~2.0; positive values penalize repeated tokens |
OpenAI-compatible |
|
|
Range -2.0~2.0; positive values penalize already appeared tokens |
OpenAI-compatible |
|
|
Mapping of token IDs to bias values (-100~100) |
OpenAI, Grok, vLLM |
|
|
Random seed |
OpenAI, Google, Mistral, Groq, HuggingFace, vLLM |
|
|
Whether to use sampling strategy (otherwise greedy decoding) |
Transformers |
|
|
Sample next token from the top_k most likely candidates |
Anthropic, Google, HuggingFace, vLLM, SGLang |
|
|
Whether to return log probabilities for output tokens |
OpenAI, Grok, TogetherAI, HuggingFace, llama-cpp-python, vLLM, SGLang |
|
|
Return the top N tokens and their probabilities (range 0~20) |
OpenAI, Grok, HuggingFace, vLLM, SGLang |
|
|
Whether to support parallel tool calls |
OpenAI, Groq |
|
|
Maximum bytes for tool output |
All (default 16*1024) |
|
|
Extra request body for OpenAI-compatible services |
OpenAI-compatible services |
|
|
Extra query parameters for OpenAI-compatible services |
OpenAI-compatible services |
|
|
Extra headers for OpenAI-compatible services |
OpenAI-compatible services |
|
|
For image generation models, specifies image height |
Image generation models |
|
|
For image generation models, specifies image width |
Image generation models |
|
|
For image models, number of inference steps |
Image generation models |
|
|
For image models, guidance scale |
Image generation models |
Example:
# key=value format
--generation-config do_sample=true,temperature=0.5
# JSON string format (supports more complex parameters)
--generation-config '{"do_sample":true,"temperature":0.5,"extra_body": {"chat_template_kwargs":{"enable_thinking": false}}}'
Dataset Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Dataset name list, space-separated |
- |
|
|
Dataset download path |
|
|
|
Dataset source |
|
|
|
Maximum samples to evaluate per dataset |
|
|
|
Number of times to repeat inference on the same sample |
|
|
|
Dataset configuration parameters (JSON string), see table below |
|
dataset-args Configuration Options#
--dataset-args is a JSON string; each dataset can be configured with the following parameters:
Parameter |
Type |
Description |
|---|---|---|
|
|
ModelScope dataset ID or local path |
|
|
Local dataset path, deprecated, please use |
|
|
Timeout for evaluation samples (seconds), recommended for code tasks |
|
|
Prompt template, example: |
|
|
System prompt |
|
|
List of dataset subsets to evaluate |
|
|
Number of few-shot examples |
|
|
Whether to randomly sample few-shot data |
|
|
Whether to shuffle the data |
|
|
Whether to shuffle choice order (multiple-choice only) |
|
|
Metric list, default supports |
|
|
Aggregation method for evaluation results, default is |
|
|
Output filters |
|
|
Whether to force re-download the dataset |
|
|
Dataset-related extra parameters, refer to dataset documentation, specify |
|
|
Sandbox configuration (see Sandbox Parameters below) |
sandbox_config Options:
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Docker image name |
|
|
|
Whether to enable networking |
|
|
|
Tool configuration dictionary |
|
Example:
--datasets gsm8k arc ifeval hle \
--dataset-args '{
"gsm8k": {
"few_shot_num": 4,
"few_shot_random": false
},
"arc": {
"dataset_id": "/path/to/arc"
},
"ifeval": {
"filters": {
"remove_until": "</think>"
}
},
"hle": {
"extra_params": {
"include_multi_modal": false
}
}
}'
Evaluation Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Evaluation type |
|
|
|
Evaluation batch size, applies to the following stages: |
|
|
|
Evaluation backend |
|
|
|
Configuration file path for non-Native backends |
- |
See also
Refer to the other backend usage guide
Judge Parameters#
LLM-as-a-Judge evaluation parameters using a judge model to determine correctness:
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Judge model strategy |
|
|
|
[Deprecated] Use |
|
|
|
Judge model configuration (JSON string), see table below |
- |
|
|
Whether to generate analysis report (language auto-detected) |
|
judge-model-args Configuration Options#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
API key |
Read from |
|
|
API endpoint |
Read from |
|
|
Model ID |
Read from |
|
|
System prompt |
- |
|
|
Prompt template |
Auto-selected based on |
|
|
Generation parameters (same as |
- |
|
|
Judge model loading parameters (same as |
|
|
|
Scoring method |
|
|
|
Regex to parse output |
|
|
|
Score mapping for |
|
See also
For more information on ModelScope model inference services, refer to ModelScope API Inference Services
pattern Mode Default Prompt Template
Your job is to look at a question, a gold target, and a predicted answer, and return a letter "A" or "B" to indicate whether the predicted answer is correct or incorrect.
[Question]
{question}
[Reference Answer]
{gold}
[Predicted Answer]
{pred}
Evaluate the model's answer based on correctness compared to the reference answer.
Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
numeric Mode Default Prompt Template
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response.
Begin your evaluation by providing a short explanation. Be as objective as possible.
After providing your explanation, you must rate the response on a scale of 0 (worst) to 1 (best) by strictly following this format: "[[rating]]", for example: "Rating: [[0.5]]"
[Question]
{question}
[Response]
{pred}
Sandbox Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Whether to use ms-enclave to isolate code execution environment |
|
|
|
Sandbox manager configuration (JSON string) |
|
|
|
Sandbox type |
|
Other Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Evaluation output path (see directory structure below) |
|
|
|
Do not add timestamp to work_dir |
|
|
|
Reuse local cache path (e.g., |
|
|
|
Only rerun evaluation (reuses inference results) |
|
|
|
Whether to enable progress tracking, writing hierarchical evaluation progress to |
|
|
|
Random seed |
|
|
|
Whether to enable debug mode |
|
|
|
Whether to ignore errors during generation |
|
|
|
Dry run to check parameters without executing inference |
|
work-dir Directory Structure Example#
./outputs/{timestamp}/
βββ configs/
β βββ task_config_b6f42c.yaml # Task configuration
βββ logs/
β βββ eval_log.log # Evaluation log
βββ predictions/
β βββ {model_id}/
β βββ {dataset}.jsonl # Model inference results
βββ reports/
β βββ {model_id}/
β βββ {dataset}.json # Evaluation report
βββ reviews/
β βββ {model_id}/
β βββ {dataset}.jsonl # Evaluation result details
βββ progress.json # Progress tracking file (generated when --enable-progress-tracker is enabled)
Example progress.json format:
{
"status": "running",
"pipeline": "eval",
"total_count": 14042,
"processed_count": 5200,
"percent": 37.03,
"stage": {
"name": "Evaluating", "label": "mmlu",
"current": 1, "total": 3, "status": "running",
"children": [
{"name": "Predicting", "label": "mmlu@test", "current": 1000, "total": 1000, "status": "completed", "children": []},
{"name": "Reviewing", "label": "mmlu@test", "current": 320, "total": 1000, "status": "running", "children": []}
]
},
"updated_at": "2026-03-09T10:05:42Z"
}