Parameters#

Run evalscope eval --help to get the full list of parameters.

Environment Variables#

The following environment variables can be set before launch to control global default behavior:

Environment Variable	Description	Default
`EVALSCOPE_CACHE`	Root cache directory for EvalScope, used to store datasets, intermediate evaluation files, etc.	`~/.cache/evalscope`
`EVALSCOPE_LANGUAGE`	Global default language, affects output language for reports, etc. (`en` or `zh`)	`en`
`EVALSCOPE_HEARTBEAT_INTERVAL`	Heartbeat reporting interval (seconds)	`60`
`MODELSCOPE_CACHE`	Root cache directory for ModelScope models and datasets	`~/.cache/modelscope/hub`

Model Parameters#

Parameter	Type	Description	Default
`--model`	`str`	Name of the model to be evaluated • ModelScope model ID (e.g., `Qwen/Qwen2.5-0.5B-Instruct`) • Local model path (e.g., `/path/to/model`) • Model ID for API service (e.g., `Qwen2.5-0.5B-Instruct`)	-
`--model-id`	`str`	Alias for the evaluated model, used in reports	Last part of `model`
`--api-url`	`str`	Model API endpoint, supports OpenAI-compatible and OpenAI Responses API roots Example: `http://127.0.0.1:8000/v1` or `https://api.openai.com/v1`	`None`
`--api-key`	`str`	Model API endpoint key	`EMPTY`
`--model-args`	`str`	Model loading parameters, comma-separated `key=value` or JSON string • `revision`: Model revision • `precision`: Model precision • `device_map`: Device allocation	`revision=master` `precision=torch.float16` `device_map=auto`
`--model-task`	`str`	Model task type	`text_generation` (Options: `image_generation`)
`--chat-template`	`str`	Model inference template, supports Jinja template string	`None` (uses transformers default)

Example:

# key=value format
--model-args revision=master,precision=torch.float16,device_map=auto

# JSON string format
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}'

Model Inference Parameters#

The --generation-config parameter supports the following options (comma-separated key=value or JSON string):

Parameter	Type	Description	Supported Backends
`timeout`	`int`/`float`	Request timeout (seconds)	All
`retries`	`int`	Number of retries, default is 5.	OpenAI-compatible
`retry_interval`	`int`	Retry interval (seconds), default is 10.	OpenAI-compatible
`stream`	`bool`	Whether to return responses in streaming mode	All
`max_tokens`	`int`	Maximum number of tokens generated	All
`top_p`	`float`	Nucleus sampling; only considers tokens accounting for top_p probability mass	All
`temperature`	`float`	Sampling temperature, range 0~2; higher means more randomness	All
`stop_seqs`	`list[str]`	Sequences that trigger stop generation; the returned text does not include the stop sequence	All
`frequency_penalty`	`float`	Range -2.0~2.0; positive values penalize repeated tokens	OpenAI-compatible
`presence_penalty`	`float`	Range -2.0~2.0; positive values penalize already appeared tokens	OpenAI-compatible
`repetition_penalty`	`float`	Exponential penalty applied to existing tokens. 1.0 means no penalty	OpenAI-compatible, HuggingFace, vLLM
`logit_bias`	`dict`	Mapping of token IDs to bias values (-100~100) Example: `"42=10,43=-10"`	OpenAI-compatible
`seed`	`int`	Random seed	OpenAI-compatible
`do_sample`	`bool`	Whether to use sampling strategy (otherwise greedy decoding)	Transformers
`top_k`	`int`	Sample next token from the top_k most likely candidates	Anthropic, Google, HuggingFace, vLLM, SGLang
`logprobs`	`bool`	Whether to return log probabilities for output tokens	OpenAI-compatible, HuggingFace, llama-cpp-python
`top_logprobs`	`int`	Return the top N tokens and their probabilities (range 0~20)	OpenAI-compatible, HuggingFace
`parallel_tool_calls`	`bool`	Whether to support parallel tool calls	OpenAI, Groq
`response_schema`	`dict`	Request structured output (JSON Schema); the output still needs to be validated	OpenAI, Google, Mistral
`reasoning_effort`	`str`	Reasoning effort level. One of `low` / `medium` (default) / `high`	OpenAI o1 series
`reasoning_tokens`	`int`	Maximum tokens budget for reasoning (thinking budget)	Anthropic Claude
`reasoning_summary`	`str`	Reasoning summary verbosity. One of `concise` / `detailed` / `auto`	OpenAI reasoning series
`reasoning_history`	`str`	How to encode prior-turn assistant `reasoning_content` in multi-turn requests. One of `reasoning_field` (default; pass as independent top-level field, works for DeepSeek V4 thinking, Qwen3 thinking), `think_tag` (embed as `<think>...</think>` in content string, legacy Together/Groq compatible), `none` (strip entirely; required for DeepSeek R1 legacy which forbids `reasoning_content` in requests)	OpenAI-compatible
`extra_body`	`dict`	Extra request body for OpenAI-compatible services	OpenAI-compatible services
`extra_query`	`dict`	Extra query parameters for OpenAI-compatible services	OpenAI-compatible services
`extra_headers`	`dict`	Extra headers for OpenAI-compatible services	OpenAI-compatible services
`height`	`int`	For image generation models, specifies image height	Image generation models
`width`	`int`	For image generation models, specifies image width	Image generation models
`num_inference_steps`	`int`	For image models, number of inference steps	Image generation models
`guidance_scale`	`float`	For image models, guidance scale	Image generation models

Example:

# key=value format
--generation-config do_sample=true,temperature=0.5

# JSON string format (supports more complex parameters)
--generation-config '{"do_sample":true,"temperature":0.5,"extra_body": {"chat_template_kwargs":{"enable_thinking": false}}}'

Dataset Parameters#

Parameter	Type	Description	Default
`--datasets`	`list[str]`	Dataset name list, space-separated Refer to Dataset List	-
`--dataset-dir`	`str`	Dataset download path	`~/.cache/modelscope/datasets`
`--dataset-hub`	`str`	Dataset source	`modelscope` (Options: `huggingface`)
`--limit`	`int`/`float`	Maximum samples to evaluate per dataset • int: First N samples • float: First N% of samples Example: `100` or `0.1`	`None` (evaluate all)
`--repeats`	`int`	Number of times to repeat inference on the same sample	`1`
`--dataset-args`	`str`	Dataset configuration parameters (JSON string), see table below	`{}`

dataset-args Configuration Options#

--dataset-args is a JSON string; each dataset can be configured with the following parameters:

Parameter	Type	Description
`dataset_id`	`str`	ModelScope dataset ID or local path
`local_path`	`str`	Local dataset path, deprecated, please use `dataset_id`
`review_timeout`	`float`	Timeout for evaluation samples (seconds), recommended for code tasks
`prompt_template`	`str`	Prompt template, example: `Question: {query}\nAnswer:`
`system_prompt`	`str`	System prompt
`subset_list`	`list[str]`	List of dataset subsets to evaluate
`few_shot_num`	`int`	Number of few-shot examples
`few_shot_random`	`bool`	Whether to randomly sample few-shot data
`shuffle`	`bool`	Whether to shuffle the data
`shuffle_choices`	`bool`	Whether to shuffle choice order (multiple-choice only)
`metric_list`	`list[str\|dict]`	Metric list, default supports `acc`
`aggregation`	`str`	Aggregation method for evaluation results, default is `mean`. Options: `mean_and_pass_at_k`, `mean_and_vote_at_k`, `mean_and_pass_hat_k` (all require setting `repeats=k`). • `pass_at_k`: Probability that the same sample passes at least once in k generations (e.g., set `repeats=5` for `humaneval`) • `vote_at_k`: Scoring by voting on k results for the same sample • `pass_hat_k`: Probability that the same sample passes all k times (e.g., set `repeats=3` for `tau2_bench`)
`filters`	`dict`	Output filters • `remove_until`: Remove content before specified string • `extract`: Extract regex-matched content
`force_redownload`	`bool`	Whether to force re-download the dataset
`extra_params`	`dict`	Dataset-related extra parameters, refer to dataset documentation, specify `{<param_name>:<value>}` as needed, where the type (`type`) and choices (`choices`) of `value` depend on the specific parameter. For SWE-bench agentic and similar benchmarks, see Agent Evaluation
`sandbox_config`	`dict`	Sandbox configuration (see Sandbox Parameters below)

sandbox_config Options:

Parameter	Type	Description	Default
`image`	`str`	Docker image name	`python:3.11-slim`
`network_enabled`	`bool`	Whether to enable networking	`true`
`tools_config`	`dict`	Tool configuration dictionary	`{'shell_executor': {}, 'python_executor': {}}`

Example:

--datasets gsm8k arc ifeval hle \
--dataset-args '{
  "gsm8k": {
    "few_shot_num": 4,
    "few_shot_random": false
  },
  "arc": {
    "dataset_id": "/path/to/arc"
  },
  "ifeval": {
    "filters": {
      "remove_until": "</think>"
    }
  },
  "hle": {
    "extra_params": {
      "include_multi_modal": false
    }
  }
}'

Evaluation Parameters#

Parameter	Type	Description	Default
`--eval-type`	`str`	Evaluation type • `llm_ckpt`: Local model inference (transformers) • `openai_api`: OpenAI-compatible Chat Completions API service • `openai_responses_api`: OpenAI official Responses API service • `anthropic_api`: Anthropic Claude API service • `litellm`: LiteLLM multi-provider routing (supports 100+ LLM providers) • `text2image`: Text-to-image model (diffusers) • `text2speech`: Text-to-speech model service • `image_editing`: Image editing model • `mock_llm`: Simulated inference (for verification) • `custom`: Custom evaluation type	`None` (auto-detect)
`--eval-batch-size`	`int`	Evaluation batch size, applies to the following stages: • Inference: concurrent requests (service mode) or batch size (checkpoint mode) • LLM-judge review: number of concurrent threads • `batch_calculate_metrics`: number of samples per batch window	`1` (service mode: `8`)
`--eval-backend`	`str`	Evaluation backend • `Native`: Default backend • `OpenCompass`: LLM evaluation • `VLMEvalKit`: Multimodal model evaluation • `RAGEval`: RAG/Embedding/Reranker/CLIP evaluation • `ThirdParty`: Special task evaluation	`Native`
`--eval-config`	`str`	Configuration file path for non-Native backends	-

Judge Parameters#

LLM-as-a-Judge evaluation parameters using a judge model to determine correctness:

Parameter	Type	Description	Default
`--judge-strategy`	`str`	Judge model strategy • `auto`: Automatically decide based on dataset requirements • `llm`: Always use judge model • `rule`: Use rule-based judgment only • `llm_recall`: Use judge model after rule-based judgment fails	`auto`
`--judge-worker-num`	`int`	[Deprecated] Use `--eval-batch-size` instead. Will be removed in v2.0.0.	`1`
`--judge-model-args`	`str`	Judge model configuration (JSON string), see table below	-
`--analysis-report`	`bool`	Whether to generate analysis report (language auto-detected)	`false`

judge-model-args Configuration Options#

Parameter	Type	Description	Default
`api_key`	`str`	API key	Read from `MODELSCOPE_SDK_TOKEN`, default `EMPTY`
`api_url`	`str`	API endpoint	Read from `MODELSCOPE_API_BASE`, default `https://api-inference.modelscope.cn/v1/`
`model_id`	`str`	Model ID	Read from `MODELSCOPE_JUDGE_LLM`, default `Qwen/Qwen3-235B-A22B`
`system_prompt`	`str`	System prompt	-
`prompt_template`	`str`	Prompt template	Auto-selected based on `score_type`
`generation_config`	`dict`	Generation parameters (same as `--generation-config`)	-
`model_args`	`dict`	Judge model loading parameters (same as `--model-args`), e.g. `{"default_headers": {"X-API-KEY": "your-api-key"}}`	`{}`
`score_type`	`str`	Scoring method • `pattern`: Judge if answer matches reference • `numeric`: Score without reference (0-1)	`pattern`
`score_pattern`	`str`	Regex to parse output	`pattern` mode: `(A\|B)` `numeric` mode: `\[\[(\d+(?:\.\d+)?)\]\]`
`score_mapping`	`dict`	Score mapping for `pattern` mode	`{'A': 1.0, 'B': 0.0}`

Sandbox Parameters#

EvalScope manages sandbox settings via the nested --sandbox configuration (mapped to SandboxTaskConfig).

–sandbox Options#

Field	Type	Description	Default
`enabled`	`bool`	Whether to enable the sandbox	`false`
`engine`	`str`	Sandbox engine: `docker`, `volcengine`, etc.	`docker`
`default_config`	`dict`	Task-level sandbox config; merged with `BenchmarkMeta.sandbox_config`, and used as the default per-sample environment config in Agent mode	`{}`
`manager_config`	`dict`	Forwarded to the ms_enclave manager (e.g. `base_url` for remote docker, volcengine credentials)	`{}`
`pool_size`	`int \| None`	Warmup pool size for pooled execution; falls back to `eval_batch_size` when `None`	`None`

For full usage including local and remote manager examples, see Sandbox Environment Usage.

Agent Parameters#

--agent-config / agent_config enables Agent Evaluation: once set, all benchmarks based on DefaultDataAdapter switch to native AgentLoop inference, or delegate to a third-party CLI such as Claude Code / Codex via the external Agent Bridge. AgentLoopAdapter subclasses (such as swe_bench_*_agentic) ignore this global config and rely on dataset_args.extra_params instead.

Parameter	Type	Description	Default
`--agent-config`	`dict \| NativeAgentConfig`	Global Agent configuration (see below)	`None` (Agent mode disabled)

agent-config Options#

Field	Type	Description	Default
`strategy`	`str`	Strategy name: `function_calling` / `react` / `swe_bench_toolcall` / `swe_bench_backticks`	`function_calling`
`tools`	`list[str]`	Tool whitelist: `bash` / `python_exec` (`submit` is auto-injected by the strategy)	`[]`
`environment`	`str \| None`	Tool execution environment: `local` (subprocess) / `docker` (isolated sandbox)	`None`
`max_steps`	`int`	Hard upper bound of loop iterations	`10`
`extra`	`dict`	Strategy constructor kwargs, e.g. `{'system_prompt': '...'}`	`{}`
`environment_extra`	`dict`	Environment constructor kwargs. `local` supports `working_dir`/`env_vars`; `docker` supports `image`/`timeout`/`environment`	`{}`

Other Parameters#

Parameter	Type	Description	Default
`--work-dir`	`str`	Evaluation output path (see directory structure below)	`./outputs`
`--no-timestamp`	`bool`	Do not add timestamp to work_dir	`false`
`--use-cache`	`str`	Reuse local cache path (e.g., `outputs/20241210_194434`) Reuses inference and evaluation results	`None`
`--rerun-review`	`bool`	Used with `--use-cache`: deletes the existing reviews cache and re-runs the review/scoring stage while still reusing prediction cache	`false`
`--enable-progress-tracker`	`bool`	Whether to enable progress tracking, writing hierarchical evaluation progress to `progress.json` in real time, queryable via the service API	`false`
`--collect-perf`	`bool`	Collect per-request performance metrics (latency, TTFT, token usage) and write them into the evaluation report. TTFT requires `--generation-config stream=true`. Use `--no-collect-perf` to disable	`true`
`--seed`	`int`	Random seed	`42`
`--debug`	`bool`	Whether to enable debug mode	`false`
`--ignore-errors`	`bool`	Whether to ignore errors during generation	`false`
`--dry-run`	`bool`	Dry run to check parameters without executing inference	`false`

work-dir Directory Structure Example#

./outputs/{timestamp}/
├── configs/
│   └── task_config_b6f42c.yaml      # Task configuration
├── logs/
│   └── eval_log.log                 # Evaluation log
├── predictions/
│   └── {model_id}/
│       └── {dataset}.jsonl          # Model inference results
├── reports/
│   └── {model_id}/
│       └── {dataset}.json           # Evaluation report
├── reviews/
│   └── {model_id}/
│       └── {dataset}.jsonl          # Evaluation result details
└── progress.json                    # Progress tracking file (generated when --enable-progress-tracker is enabled)

Example progress.json format:

{
  "status": "running",
  "pipeline": "eval",
  "total_count": 14042,
  "processed_count": 5200,
  "percent": 37.03,
  "stage": {
    "name": "Evaluating", "label": "mmlu",
    "current": 1, "total": 3, "status": "running",
    "children": [
      {"name": "Predicting", "label": "mmlu@test", "current": 1000, "total": 1000, "status": "completed", "children": []},
      {"name": "Reviewing",  "label": "mmlu@test", "current": 320,  "total": 1000, "status": "running",  "children": []}
    ]
  },
  "updated_at": "2026-03-09T10:05:42Z"
}