Parameter#

Execute evalscope perf --help to get a full parameter description.

Basic Settings#

Parameter	Type	Description	Default
`--model`	`str`	Name or path of the test model	-
`--url`	`str`	API address, supporting `/chat/completion` and `/completion` endpoints	-
`--name`	`str`	Name for wandb/swanlab database result and result database	`{model_name}_{current_time}`
`--api`	`str`	Service API type • `openai`: OpenAI-compatible API (requires `--url`) • `openai_embedding`: OpenAI-compatible Embedding API • `openai_rerank`: OpenAI/Cohere-compatible Rerank API • `local`: Start local transformers inference • `local_vllm`: Start local vLLM inference service • Custom: See Custom API Guide	-
`--port`	`int`	Port for local inference service Only applicable to `local` and `local_vllm`	`8877`
`--attn-implementation`	`str`	Attention implementation method Only effective when `api=local`	`None` (Optional: `flash_attention_2`, `eager`, `sdpa`)
`--api-key`	`str`	API key	`None`
`--debug`	`bool`	Whether to output debug information	`False`

Network Configuration#

Parameter	Type	Description	Default
`--total-timeout`	`int`	Total timeout for each request (seconds)	`21600` (6 hours)
`--connect-timeout`	`int`	Network connection timeout (seconds)	`None`
`--read-timeout`	`int`	Network read timeout (seconds)	`None`
`--headers`	`str`	Additional HTTP headers Format: `key1=value1 key2=value2` Will be used for each query	-
`--no-test-connection`	`bool`	Do not send connection test, start stress test directly	`False`

Request Control#

Parameter	Type	Description	Default
`--parallel`	`list[int]`	Number of concurrent requests Can input multiple values separated by spaces	`1`
`--number`	`list[int]`	Total number of requests to be sent Can input multiple values (must correspond one-to-one with `parallel`)	`1000`
`--rate`	`float`	Request generation rate (requests/second) • `-1`: No rate limit; all requests are generated immediately and placed in the queue • `> 0`: Requests are generated following a Poisson arrival model — the inter-arrival interval follows an exponential distribution with mean `1/rate`, resulting in an average of `rate` requests per second	`-1`
`--log-every-n-query`	`int`	Log every N queries	`10`
`--stream`	`bool`	Whether to use SSE stream output Must be enabled to measure TTFT (Time to First Token) metric	`True`
`--sleep-interval`	`int`	Sleep time between each performance test (seconds) Helps avoid overloading the server	`5`

Tip

--rate and --parallel control two independent phases:

Generation phase (controlled by --rate): Requests are generated and placed into a queue at the specified rate.
- rate=-1: No rate limit; all requests are enqueued immediately.
- rate=R: Inter-arrival intervals follow an exponential distribution with mean 1/R seconds (Poisson arrival model), resulting in an average of R requests enqueued per second.
Sending phase (controlled by --parallel): At most parallel requests are in-flight simultaneously (sent but not yet responded to); each worker fetches the next request from the queue only after receiving a response to the previous one.

The two parameters are independent: --rate determines how quickly requests enter the queue, while --parallel determines how many requests are actively being sent at any given time.

SLA Settings#

Parameter	Type	Description	Default
`--sla-auto-tune`	`bool`	Whether to enable SLA auto-tuning mode	`False`
`--sla-variable`	`str`	Variable for auto-tuning Options: `parallel` (concurrency), `rate` (request rate)	`parallel`
`--sla-params`	`str`	SLA constraint conditions JSON string Supported metrics: `avg_latency`, `p99_latency`, `avg_ttft`, `p99_ttft`, `avg_tpot`, `p99_tpot`, `rps`, `tps` Supported operators: `<=`, `<`, `min` (for latency metrics); `>=`, `>`, `max` (for throughput metrics) Example: `'[{"p99_latency": "<=2"}]'`	`None`
`--sla-upper-bound`	`int`	Upper bound of the tuned SLA variable search range	`65536`
`--sla-lower-bound`	`int`	Lower bound of the tuned SLA variable search range	`1`
`--sla-fixed-parallel`	`int`	Fixed parallel workers used when `--sla-variable=rate`; defaults to `--sla-upper-bound` for backward compatibility	`None`
`--sla-num-runs`	`int`	Number of runs per concurrency level (average taken)	`3`
`--sla-number-multiplier`	`float`	Multiplier of total requests relative to the tuned variable (concurrency or rate), i.e. `number = round(variable × N)`; defaults to `2` when not set	`None`

Prompt Settings#

Parameter	Type	Description	Default
`--max-prompt-length`	`int`	Maximum input prompt length Prompts exceeding this length will be discarded	`131072`
`--min-prompt-length`	`int`	Minimum input prompt length Prompts shorter than this will be discarded	`0`
`--prefix-length`	`int`	Length of the prompt prefix Only effective for `random` dataset	`0`
`--prompt`	`str`	Specify request prompt String or local file (specify via `@/path/to/file`) Higher priority than `dataset` Example: `@./prompt.txt`	-
`--query-template`	`str`	Specify query template JSON string or local file (specify via `@/path/to/file`) Example: `@./query_template.json`	-
`--apply-chat-template`	`bool`	Whether to apply chat template	`None` (automatically determined based on URL suffix)
`--image-width`	`int`	Image width for random VL dataset	`224`
`--image-height`	`int`	Image height for random VL dataset	`224`
`--image-format`	`str`	Image format for random VL dataset	`RGB`
`--image-num`	`int`	Number of images for random VL dataset	`1`
`--image-patch-size`	`int`	Patch size for the image Only used for local image token calculation	`28`

Dataset Configuration#

Parameter	Type	Description	Default
`--dataset`	`str`	Dataset mode, see table below for details	-
`--dataset-path`	`str`	Dataset file path Used in conjunction with dataset	-

Dataset Mode Description#

Mode	Description	Supports dataset-path
`openqa`	Automatically downloads OpenQA from ModelScope Prompts are relatively short (usually <100 tokens) Uses `question` field from jsonl file when `dataset_path` is specified	✓
`longalpaca`	Automatically downloads LongAlpaca-12k from ModelScope Prompts are much longer (generally >6000 tokens) Uses `instruction` field from jsonl file when `dataset_path` is specified	✓
`line_by_line`	Each line in txt file is used as a separate prompt Requires `dataset_path`	✓ (Required)
`flickr8k`	Automatically downloads Flick8k from ModelScope Builds image-text inputs; large dataset suitable for evaluating multimodal models	✗
`kontext_bench`	Automatically downloads Kontext-Bench from ModelScope Builds image-text inputs; approximately 1,000 samples, suitable for quick evaluation of multimodal models	✗
`random`	Randomly generates prompts based on `prefix-length`, `max-prompt-length`, and `min-prompt-length` Requires `tokenizer-path` Usage example	✗
`random_vl`	Randomly generates both image and text inputs Based on `random`, with additional image-related parameters Usage example	✗
`embedding`	Load text data from file to evaluate Embedding model Supports Line-by-line (TXT) or JSONL format (with `text` field)	✓ (Required)
`random_embedding`	Randomly generate queries based on `max-prompt-length` and `min-prompt-length` to evaluate Embedding model Must specify `tokenizer-path`	✗
`embedding_batch`	Batch send text data to evaluate Embedding model Load data from file Supports `--extra-args '{"batch_size": 8}'` to set batch size	✓ (Required)
`random_embedding_batch`	Batch send randomly generated query data based on `max-prompt-length` and `min-prompt-length` to evaluate Embedding model Must specify `tokenizer-path` Supports `--extra-args '{"batch_size": 8}'` to set batch size	✗
`rerank`	Load Query-Document pairs from file to evaluate Rerank model Supports JSONL format (with `query` and `documents` fields)	✓ (Required)
`random_rerank`	Randomly generate query data based on `max-prompt-length` and `min-prompt-length` to evaluate Rerank model Must specify `tokenizer-path` Supports `--extra-args '{"num_documents": 10, "document_length_ratio": 5}'` to set number of documents and length ratio relative to query	✗
`custom`	Custom dataset parser See Custom Dataset Guide	✓

Model Settings#

Parameter	Type	Description	Default
`--tokenizer-path`	`str`	Tokenizer weights path Used to calculate the number of tokens in input and output Usually located in the same directory as model weights	`None`
`--frequency-penalty`	`float`	frequency_penalty value	-
`--logprobs`	`bool`	Whether to return logarithmic probabilities	-
`--max-tokens`	`int`	Maximum number of tokens that can be generated	-
`--min-tokens`	`int`	Minimum number of tokens to generate Note: Not all model services support this parameter For `vLLM>=0.8.1`, you need to additionally set `--extra-args '{"ignore_eos": true}'`	-
`--n-choices`	`int`	Number of completion choices to generate	-
`--seed`	`int`	Random seed	`None`
`--stop`	`str`	Tokens that stop the generation	-
`--stop-token-ids`	`list[int]`	IDs of tokens that stop the generation	-
`--temperature`	`float`	Sampling temperature	`0`
`--top-p`	`float`	Top-p sampling	-
`--top-k`	`int`	Top-k sampling	-
`--extra-args`	`str`	Additional parameters to be passed in the request body JSON string format Example: `'{"ignore_eos": true}'`	-

Data Storage#

Parameter	Type	Description	Default
`--visualizer`	`str`	Visualizer to use Options: `wandb`, `swanlab`, `clearml` If set, metrics will be saved to the specified visualizer	`None`
`--enable-progress-tracker`	`bool`	Whether to enable progress tracking, writing hierarchical stress-test progress to `progress.json` in real time, queryable via the service API	`False`
`--wandb-api-key`	`str`	wandb API key for logging metrics to wandb Deprecated, please use `--visualizer wandb` instead	-
`--swanlab-api-key`	`str`	swanlab API key for logging metrics to swanlab Deprecated, please use `--visualizer swanlab` instead	-
`--outputs-dir`	`str`	Output file path	`./outputs`

Other Parameters#

Parameter	Type	Description	Default
`--db-commit-interval`	`int`	Number of rows buffered before writing results to SQLite database	`1000`
`--queue-size-multiplier`	`int`	Maximum size of the request queue Calculated as: `parallel * multiplier`	`5`
`--in-flight-task-multiplier`	`int`	Maximum number of in-flight tasks Calculated as: `parallel * multiplier`	`2`