Parameter#

Execute evalscope perf --help to get a full parameter description.

Basic Settings#

Parameter	Type	Description	Default
`--model`	`str`	Name or path of the test model	-
`--url`	`str`	API address, supporting `/chat/completion` and `/completion` endpoints	-
`--name`	`str`	Name for wandb/swanlab database result and result database	`{model_name}_{current_time}`
`--api`	`str`	Service API type • `openai`: OpenAI-compatible API (requires `--url`) • `openai_embedding`: OpenAI-compatible Embedding API • `openai_rerank`: OpenAI/Cohere-compatible Rerank API • `local`: Start local transformers inference • `local_vllm`: Start local vLLM inference service • Custom: See Custom API Guide	-
`--port`	`int`	Port for local inference service Only applicable to `local` and `local_vllm`	`8877`
`--attn-implementation`	`str`	Attention implementation method Only effective when `api=local`	`None` (Optional: `flash_attention_2`, `eager`, `sdpa`)
`--api-key`	`str`	API key	`None`
`--debug`	`bool`	Whether to output debug information	`False`

Network Configuration#

Parameter	Type	Description	Default
`--total-timeout`	`int`	Total timeout for each request (seconds)	`21600` (6 hours)
`--connect-timeout`	`int`	Network connection timeout (seconds)	`None`
`--read-timeout`	`int`	Network read timeout (seconds)	`None`
`--headers`	`str`	Additional HTTP headers Format: `key1=value1 key2=value2` Will be used for each query	-
`--no-test-connection`	`bool`	Do not send connection test, start stress test directly	`False`

Request Control#

Parameter	Type	Description	Default
`--parallel`	`list[int]`	Number of concurrent requests Can input multiple values separated by spaces	`1`
`--number`	`list[int]`	Total number of requests to be sent Can input multiple values (must correspond one-to-one with `parallel`)	`1000`
`--rate`	`float`	Request generation rate (requests/second) • `-1`: No rate limit; all requests are generated immediately and placed in the queue • `> 0`: Requests are generated following a Poisson arrival model — the inter-arrival interval follows an exponential distribution with mean `1/rate`, resulting in an average of `rate` requests per second	`-1`
`--log-every-n-query`	`int`	Log every N queries	`100`
`--stream`	`bool`	Whether to use SSE stream output Must be enabled to measure TTFT (Time to First Token) metric	`True`
`--sleep-interval`	`int`	Sleep time between each performance test (seconds) Helps avoid overloading the server	`5`
`--open-loop`	`bool`	Enable open-loop mode: dispatch requests following a Poisson arrival schedule without semaphore backpressure. Requests are fired at the rate set by `--rate` regardless of whether the server has finished processing previous requests. • `--rate` becomes the sweep variable (accepts multiple values), replacing `--parallel` to drive multi-run iterations • `--number` must have the same length as `--rate`; each pair `(rate, number)` corresponds to one independent run • `--parallel` is ignored in this mode (internally set to -1 / INF) See Usage Example	`False`
`--warmup-num`	`float`	Number or ratio of warmup requests: • `0`: disabled (default) • `>= 1`: absolute count, e.g. `--warmup-num 10` sends 10 warmup requests • `0 < value < 1`: ratio mode, e.g. `--warmup-num 0.1` = 10% of `--number` Warmup requests are sent with the same concurrency/rate as the benchmark but excluded from performance metrics Useful for eliminating cold-start effects (KV-cache filling, JIT compilation, etc.) See Usage Example	`0`

Tip

Closed-loop (default) vs Open-loop (--open-loop) — parameter behaviour comparison:

	Closed-loop (default)	Open-loop (`--open-loop`)
`--rate`	Controls enqueue rate (`-1` = unlimited; `R` = Poisson-arrival mean)	Controls dispatch rate; must be > 0; accepts multiple values (e.g. `5 10 20`), each driving one independent run
`--number`	Total requests per run; must match `--parallel` in length	Total requests per run; must match `--rate` in length
`--parallel`	Max in-flight requests; each worker waits for a response before sending the next (backpressure)	Ignored; concurrency is unbounded (INF); requests are fired on schedule without waiting for responses
Use case	Measure latency and throughput under controlled concurrency	Simulate realistic traffic (arrivals independent of service time); sweep throughput-latency curve across multiple rates

SLA Settings#

Parameter	Type	Description	Default
`--sla-auto-tune`	`bool`	Whether to enable SLA auto-tuning mode	`False`
`--sla-variable`	`str`	Variable for auto-tuning Options: `parallel` (concurrency), `rate` (request rate)	`parallel`
`--sla-params`	`str`	SLA constraint conditions JSON string Supported metrics: `avg_latency`, `p99_latency`, `avg_ttft`, `p99_ttft`, `avg_tpot`, `p99_tpot`, `rps`, `tps` Supported operators: `<=`, `<`, `min` (for latency metrics); `>=`, `>`, `max` (for throughput metrics) Example: `'[{"p99_latency": "<=2"}]'`	`None`
`--sla-upper-bound`	`int`	Upper bound of the tuned SLA variable search range	`65536`
`--sla-lower-bound`	`int`	Lower bound of the tuned SLA variable search range	`1`
`--sla-fixed-parallel`	`int`	Fixed parallel workers used when `--sla-variable=rate`; defaults to `--sla-upper-bound` for backward compatibility	`None`
`--sla-num-runs`	`int`	Number of runs per concurrency level (average taken)	`3`
`--sla-number-multiplier`	`float`	Multiplier of total requests relative to the tuned variable (concurrency or rate), i.e. `number = round(variable × N)`; defaults to `2` when not set	`None`

Prompt Settings#

Parameter	Type	Description	Default
`--max-prompt-length`	`int`	Maximum input prompt length Prompts exceeding this length will be discarded	`131072`
`--min-prompt-length`	`int`	Minimum input prompt length Prompts shorter than this will be discarded	`0`
`--prefix-length`	`int`	Length of the prompt prefix Only effective for `random` dataset	`0`
`--prompt`	`str`	Specify request prompt String or local file (specify via `@/path/to/file`) Higher priority than `dataset` Example: `@./prompt.txt`	-
`--query-template`	`str`	Specify query template JSON string or local file (specify via `@/path/to/file`) Example: `@./query_template.json`	-
`--apply-chat-template`	`bool`	Whether to apply chat template	`None` (automatically determined based on URL suffix)
`--image-width`	`int`	Image width for random VL dataset	`224`
`--image-height`	`int`	Image height for random VL dataset	`224`
`--image-format`	`str`	Image format for random VL dataset	`RGB`
`--image-num`	`int`	Number of images for random VL dataset	`1`
`--image-patch-size`	`int`	Patch size for the image Only used for local image token calculation	`28`

Dataset Configuration#

Parameter	Type	Description	Default
`--dataset`	`str`	Dataset mode, see table below for details	-
`--dataset-path`	`str`	Dataset file path Used in conjunction with dataset	-

Dataset Mode Description#

Text / Chat

Mode	Description	Supports dataset-path
`openqa`	Automatically downloads OpenQA from ModelScope Prompts are relatively short (usually <100 tokens) Uses `question` field from jsonl file when `dataset_path` is specified	✓
`longalpaca`	Automatically downloads LongAlpaca-12k from ModelScope Prompts are much longer (generally >6000 tokens) Uses `instruction` field from jsonl file when `dataset_path` is specified	✓
`line_by_line`	Each line in txt file is used as a separate prompt Requires `dataset_path`	✓ (Required)
`random`	Randomly generates prompts based on `prefix-length`, `max-prompt-length`, and `min-prompt-length` Requires `tokenizer-path` Usage example	✗
`custom`	Custom dataset parser See Custom Dataset Guide	✓

Multimodal

Mode	Description	Supports dataset-path
`flickr8k`	Automatically downloads Flick8k from ModelScope Builds image-text inputs; large dataset suitable for evaluating multimodal models	✗
`kontext_bench`	Automatically downloads Kontext-Bench from ModelScope Builds image-text inputs; approximately 1,000 samples, suitable for quick evaluation of multimodal models	✗
`random_vl`	Randomly generates both image and text inputs Based on `random`, with additional image-related parameters Usage example	✗

Embedding

Mode	Description	Supports dataset-path
`embedding`	Load text data from file to evaluate Embedding model Supports Line-by-line (TXT) or JSONL format (with `text` field)	✓ (Required)
`random_embedding`	Randomly generate queries based on `max-prompt-length` and `min-prompt-length` to evaluate Embedding model Must specify `tokenizer-path`	✗
`embedding_batch`	Batch send text data to evaluate Embedding model Load data from file Supports `--extra-args '{"batch_size": 8}'` to set batch size	✓ (Required)
`random_embedding_batch`	Batch send randomly generated query data to evaluate Embedding model Must specify `tokenizer-path` Supports `--extra-args '{"batch_size": 8}'` to set batch size	✗

Rerank

Mode	Description	Supports dataset-path
`rerank`	Load Query-Document pairs from file to evaluate Rerank model Supports JSONL format (with `query` and `documents` fields)	✓ (Required)
`random_rerank`	Randomly generate query data to evaluate Rerank model Must specify `tokenizer-path` Supports `--extra-args '{"num_documents": 10, "document_length_ratio": 5}'` to set number of documents and length ratio	✗

Multi-turn Conversation

Must be used with --multi-turn. See the Multi-turn Benchmark Guide for details.

Mode	Description	Supports dataset-path
`random_multi_turn`	Synthetic multi-turn conversations; each turn randomly generates a token sequence Requires `--tokenizer-path` and `--max-turns` Usage example	✗
`share_gpt_zh_multi_turn`	Automatically downloads the Chinese ShareGPT dataset (~70k conversations) from ModelScope, preserving full multi-turn conversations Usage example	✓
`share_gpt_en_multi_turn`	Automatically downloads the English ShareGPT dataset (~70k conversations) from ModelScope, preserving full multi-turn conversations	✓
`custom_multi_turn`	Uses a local JSONL file as a custom multi-turn dataset Each line must be a JSON array of OpenAI message dicts; ideal for benchmarking with your own conversation data Requires `--dataset-path` Usage example	✓ (Required)

Model Settings#

Parameter	Type	Description	Default
`--tokenizer-path`	`str`	Tokenizer weights path Used to calculate the number of tokens in input and output Usually located in the same directory as model weights	`None`
`--frequency-penalty`	`float`	frequency_penalty value	-
`--logprobs`	`bool`	Whether to return logarithmic probabilities	-
`--max-tokens`	`int`	Maximum number of tokens that can be generated	-
`--min-tokens`	`int`	Minimum number of tokens to generate Note: Not all model services support this parameter For `vLLM>=0.8.1`, you need to additionally set `--extra-args '{"ignore_eos": true}'`	-
`--n-choices`	`int`	Number of completion choices to generate	-
`--seed`	`int`	Random seed	`None`
`--stop`	`str`	Tokens that stop the generation	-
`--stop-token-ids`	`list[int]`	IDs of tokens that stop the generation	-
`--temperature`	`float`	Sampling temperature	`0`
`--top-p`	`float`	Top-p sampling	-
`--top-k`	`int`	Top-k sampling	-
`--extra-args`	`str`	Additional parameters to be passed in the request body JSON string format Example: `'{"ignore_eos": true}'`	-
`--tokenize-prompt`	`bool`	Tokenize the prompt client-side into a token-ID list and send it directly via `/v1/completions`, bypassing server-side re-tokenization	`False`

Data Storage#

Parameter	Type	Description	Default
`--visualizer`	`str`	Visualizer to use Options: `wandb`, `swanlab`, `clearml` If set, metrics will be saved to the specified visualizer	`None`
`--enable-progress-tracker`	`bool`	Whether to enable progress tracking, writing hierarchical stress-test progress to `progress.json` in real time, queryable via the service API	`False`
`--wandb-api-key`	`str`	wandb API key for logging metrics to wandb Deprecated, please use `--visualizer wandb` instead	-
`--swanlab-api-key`	`str`	swanlab API key for logging metrics to swanlab Deprecated, please use `--visualizer swanlab` instead	-
`--outputs-dir`	`str`	Output file path	`./outputs`
`--no-timestamp`	`bool`	Exclude timestamp from output directory name	`False`

Multi-turn Settings#

Parameter	Type	Description	Default
`--multi-turn`	`bool`	Enable multi-turn conversation benchmark mode; `--number` is the total number of turns to send and `--parallel` is the number of concurrent turn-level requests	`False`
`--min-turns`	`int`	Minimum number of user turns per conversation; used by `random_multi_turn` only	`1`
`--max-turns`	`int`	Maximum number of user turns per conversation; required for `random_multi_turn`; optional for ShareGPT datasets to truncate long conversations	`None`

Other Parameters#

Parameter	Type	Description	Default
`--db-commit-interval`	`int`	Number of rows buffered before writing results to SQLite database	`1000`
`--queue-size-multiplier`	`int`	Maximum size of the request queue Calculated as: `parallel * multiplier`	`5`
`--in-flight-task-multiplier`	`int`	Maximum number of in-flight tasks Calculated as: `parallel * multiplier`	`2`