API address, supporting /chat/completions, /completions, and /responses endpoints
-
--name
str
Name for wandb/swanlab database result and result database
{model_name}_{current_time}
--api
str
Service API type • openai: OpenAI-compatible Chat Completions API (requires --url) • openai_responses: OpenAI official Responses API • openai_embedding: OpenAI-compatible Embedding API • openai_rerank: OpenAI/Cohere-compatible Rerank API • local: Start local transformers inference • local_vllm: Start local vLLM inference service • Custom: See Custom API Guide
-
--port
int
Port for local inference service Only applicable to local and local_vllm
8877
--attn-implementation
str
Attention implementation method Only effective when api=local
Number of concurrent requests Can input multiple values separated by spaces
1
--number
list[int]
Total number of requests to be sent Can input multiple values (must correspond one-to-one with parallel)
1000
--rate
float
Request scheduling rate (requests/second) • -1: No rate pacing; in the default closed-loop mode, requests are scheduled as fast as possible, but the number of in-flight HTTP requests is still capped by --parallel, so requests are not all sent to the server at once • >0: Requests are scheduled following a Poisson arrival model — the inter-arrival interval follows an exponential distribution with mean 1/rate, resulting in an average of rate scheduled requests per second
-1
--log-every-n-query
int
Log every N queries
100
--stream
bool
Whether to use SSE stream output Must be enabled to measure TTFT (Time to First Token) metric
True
--sleep-interval
int
Sleep time between each performance test (seconds) Helps avoid overloading the server
5
--open-loop
bool
Enable open-loop mode: dispatch requests following a Poisson arrival schedule without semaphore backpressure. Requests are fired at the rate set by --rate regardless of whether the server has finished processing previous requests. • --rate becomes the sweep variable (accepts multiple values), replacing --parallel to drive multi-run iterations • --number must have the same length as --rate; each pair (rate,number) corresponds to one independent run • --parallel is ignored in this mode (internally set to -1 / INF) See Usage Example
False
--warmup-num
float
Number or ratio of warmup requests: • 0: disabled (default) • >=1: absolute count, e.g. --warmup-num10 sends 10 warmup requests • 0<value<1: ratio mode, e.g. --warmup-num0.1 = 10% of --number Warmup requests are sent with the same concurrency/rate as the benchmark but excluded from performance metrics Useful for eliminating cold-start effects (KV-cache filling, JIT compilation, etc.) See Usage Example
0
--duration
float
Wall-clock budget for one benchmark run (seconds) Soft-exit semantics: once the deadline elapses no new requests are dispatched, but already in-flight requests are allowed to finish before exit In multi-turn mode “in-flight” means already-claimed traces run every remaining turn (trace-level soft exit, aligned with upstream trie) When combined with --number, whichever cap is hit first ends the run
None
Tip
Closed-loop (default) vs Open-loop (--open-loop) — parameter behaviour comparison:
Closed-loop (default)
Open-loop (--open-loop)
--rate
Controls request scheduling rate (-1 = no pacing, but still bounded by the --parallel concurrency cap; R = Poisson-arrival mean)
Controls dispatch rate; must be > 0; accepts multiple values (e.g. 51020), each driving one independent run
--number
Total requests per run; must match --parallel in length
Total requests per run; must match --rate in length
--parallel
Max in-flight requests; each worker waits for a response before sending the next (backpressure)
Ignored; concurrency is unbounded (INF); requests are fired on schedule without waiting for responses
Use case
Measure latency and throughput under controlled concurrency
Simulate realistic traffic (arrivals independent of service time); sweep throughput-latency curve across multiple rates
Automatically downloads OpenQA from ModelScope Prompts are relatively short (usually <100 tokens) Uses question field from jsonl file when dataset_path is specified
✓
longalpaca
Automatically downloads LongAlpaca-12k from ModelScope Prompts are much longer (generally >6000 tokens) Uses instruction field from jsonl file when dataset_path is specified
✓
line_by_line
Each line in txt file is used as a separate prompt Requires dataset_path
✓ (Required)
random
Randomly generates prompts based on prefix-length, max-prompt-length, and min-prompt-length Requires tokenizer-path Usage example
Automatically downloads Flick8k from ModelScope Builds image-text inputs; large dataset suitable for evaluating multimodal models
✗
kontext_bench
Automatically downloads Kontext-Bench from ModelScope Builds image-text inputs; approximately 1,000 samples, suitable for quick evaluation of multimodal models
✗
random_vl
Randomly generates both image and text inputs Based on random, with additional image-related parameters Usage example
✗
Embedding
Mode
Description
Supports dataset-path
embedding
Load text data from file to evaluate Embedding model Supports Line-by-line (TXT) or JSONL format (with text field)
✓ (Required)
random_embedding
Randomly generate queries based on max-prompt-length and min-prompt-length to evaluate Embedding model Must specify tokenizer-path
✗
embedding_batch
Batch send text data to evaluate Embedding model Load data from file Supports --extra-args'{"batch_size":8}' to set batch size
✓ (Required)
random_embedding_batch
Batch send randomly generated query data to evaluate Embedding model Must specify tokenizer-path Supports --extra-args'{"batch_size":8}' to set batch size
✗
Rerank
Mode
Description
Supports dataset-path
rerank
Load Query-Document pairs from file to evaluate Rerank model Supports JSONL format (with query and documents fields)
✓ (Required)
random_rerank
Randomly generate query data to evaluate Rerank model Must specify tokenizer-path Supports --extra-args'{"num_documents":10,"document_length_ratio":5}' to set number of documents and length ratio
Synthetic multi-turn conversations; each turn randomly generates a token sequence Requires --tokenizer-path and --max-turns Usage example
✗
share_gpt_zh_multi_turn
Automatically downloads the Chinese ShareGPT dataset (~70k conversations) from ModelScope, preserving full multi-turn conversations Usage example
✓
share_gpt_en_multi_turn
Automatically downloads the English ShareGPT dataset (~70k conversations) from ModelScope, preserving full multi-turn conversations
✓
custom_multi_turn
Uses a local JSONL file as a custom multi-turn dataset Each line must be a JSON array of OpenAI message dicts; ideal for benchmarking with your own conversation data Requires --dataset-path Usage example
Tokenizer weights path Used to calculate the number of tokens in input and output Usually located in the same directory as model weights
None
--frequency-penalty
float
frequency_penalty value
-
--logprobs
bool
Whether to return logarithmic probabilities
-
--max-tokens
int
Maximum number of tokens that can be generated
-
--min-tokens
int
Minimum number of tokens to generate Note: Not all model services support this parameter For vLLM>=0.8.1, you need to additionally set --extra-args'{"ignore_eos":true}'
-
--n-choices
int
Number of completion choices to generate
-
--seed
int
Random seed
None
--stop
str
Tokens that stop the generation
-
--stop-token-ids
list[int]
IDs of tokens that stop the generation
-
--temperature
float
Sampling temperature
0
--top-p
float
Top-p sampling
-
--top-k
int
Top-k sampling
-
--extra-args
str
Additional parameters to be passed in the request body JSON string format Example: '{"ignore_eos":true}'
-
--tokenize-prompt
bool
Tokenize the prompt client-side into a token-ID list and send it directly via /v1/completions, bypassing server-side re-tokenization
Enable multi-turn conversation benchmark mode; --number is the total number of turns to send and --parallel is the number of concurrent turn-level requests
False
--min-turns
int
Minimum number of user turns per conversation; used by random_multi_turn and swe_smith
1
--max-turns
int
Maximum number of user turns per conversation; required for random_multi_turn; optional for ShareGPT / custom_multi_turn (truncates long conversations); for swe_smith it’s the upper bound for per-conversation turn sampling, falling back to --min-turns when unset