Parameter Description#
Execute evalscope perf --help to get a full parameter description:
Basic Settings#
--model: Name of the test model.--urlspecifies the API address, supporting two types of endpoints:/chat/completionand/completion.--name: Name for the wandb/swanlab database result and result database, default is{model_name}_{current_time}, optional.--api: Specify the service API, currently supports [openai|dashscope|local|local_vllm].Select
openaito use the API supporting OpenAI, requiring the--urlparameter.Select
dashscopeto use the API supporting DashScope, requiring the--urlparameter.Select
localto use local files as models and perform inference using transformers.--modelshould be the model file path or model_id, which will be automatically downloaded from modelscope, e.g.,Qwen/Qwen2.5-0.5B-Instruct.Select
local_vllmto use local files as models and start the vllm inference service.--modelshould be the model file path or model_id, which will be automatically downloaded from modelscope, e.g.,Qwen/Qwen2.5-0.5B-Instruct.You can also use a custom API, refer to Custom API Guide.
--port: The port for the local inference service, defaulting to 8877. This is only applicable tolocalandlocal_vllm.--attn-implementation: Attention implementation method, default is None, optional [flash_attention_2|eager|sdpa], only effective whenapiislocal.--api-key: API key, optional.--debug: Output debug information.
Network Configuration#
--connect-timeout: Network connection timeout, default is 600 seconds.--read-timeout: Network read timeout, default is 600 seconds.--headers: Additional HTTP headers, formatted askey1=value1 key2=value2. This header will be used for each query.--no-test-connection: Do not send a connection test, start the stress test directly, default is False.
Request Control#
--parallelspecifies the number of concurrent requests, and you can input multiple values separated by spaces; the default is 1.--numberindicates the total number of requests to be sent, and you can input multiple values separated by spaces (must correspond one-to-one withparallel); the default is 1000.--ratedefines the number of requests generated per second (without sending them), with a default of -1, meaning all requests are generated at time 0 with no interval; otherwise, a Poisson process is used to generate request intervals.Tip
In the implementation of this tool, request generation and sending are separate: The
--rateparameter controls the number of requests generated per second, which are placed in a request queue. The--parallelparameter controls the number of workers sending requests, with each worker retrieving requests from the queue and sending them, only proceeding to the next request after receiving a response to the previous one.--log-every-n-query: Log every n queries, default is 10.--streamuses SSE (Server-Sent Events) stream output, default is True. Note: Setting--streamis necessary to measure the Time to First Token (TTFT) metric; setting--no-streamwill disable streaming output.--sleep-interval: The sleep time between each performance test, in seconds, default is 5 seconds. This parameter can help avoid overloading the server.
Prompt Settings#
--max-prompt-length: The maximum input prompt length, default is131072. Prompts exceeding this length will be discarded.--min-prompt-length: The minimum input prompt length, default is 0. Prompts shorter than this will be discarded.--prefix-length: The length of the prompt prefix, default is 0. This is only effective for therandomdataset.--prompt: Specifies the request prompt, which can be a string or a local file. This has higher priority thandataset. When using a local file, specify the file path with@/path/to/file, e.g.,@./prompt.txt.--query-template: Specifies the query template, which can be aJSONstring or a local file. When using a local file, specify the file path with@/path/to/file, e.g.,@./query_template.json.--apply-chat-templatedetermines whether to apply the chat template, default is None. It will automatically choose based on whether the URL suffix ischat/completion.--image-widthThe image width for the random VL dataset. Default is 224.--image-heightThe image height for the random VL dataset. Default is 224.--image-formatThe image format for the random VL dataset. Default is ‘RGB’.--image-numThe number of images for the random VL dataset. Default is 1.--image-patch-sizePatch size for the image, only used for local image token calculation, default is 28.
Dataset Configuration#
--datasetsupports the following dataset modes:openqa: Automatically downloads OpenQA from ModelScope. Prompts are relatively short, usually under 100 tokens. Ifdataset_pathis specified, thequestionfield in your jsonl file will be used as the prompt.longalpaca: Automatically downloads LongAlpaca-12k from ModelScope. Prompts are much longer, generally over 6000 tokens. Ifdataset_pathis specified, theinstructionfield in your jsonl file will be used as the prompt.line_by_line: Requiresdataset_path. Each line in the txt file is used as a separate prompt.flickr8k: Automatically downloads Flick8k from ModelScope. Builds image-text inputs; this dataset is large and suitable for evaluating multimodal models.dataset_pathis not supported.kontext_bench: Automatically downloads Kontext-Bench from ModelScope. Builds image-text inputs; this dataset is smaller (about 1,000 samples), making it suitable for quick evaluation of multimodal models.dataset_pathis not supported.random: Randomly generates prompts based onprefix-length,max-prompt-length, andmin-prompt-length.tokenizer-pathis required. Usage example.random_vl: Randomly generates both image and text inputs. Based onrandom, with additional image-related parameters (image-width,image-height,image-format,image-num). Usage example.custom: Custom dataset parser. See the Custom Dataset Guide.
Model Settings#
--tokenizer-path: Optional. Specifies the tokenizer weights path, used to calculate the number of tokens in the input and output, usually located in the same directory as the model weights.--frequency-penalty: The frequency_penalty value.--logprobs: Logarithmic probabilities.--max-tokens: The maximum number of tokens that can be generated.--min-tokens: The minimum number of tokens to generate. Not all model services support this parameter; please check the corresponding API documentation. ForvLLM>=0.8.1versions, you need to additionally set--extra-args '{"ignore_eos": true}'.--n-choices: The number of completion choices to generate.--seed: The random seed, default is 0.--stop: Tokens that stop the generation.--stop-token-ids: Sets the IDs of tokens that stop the generation.--temperature: Sampling temperature, default is 0.0--top-p: Top-p sampling.--top-k: Top-k sampling.--extra-args: Additional parameters to be passed in the request body, formatted as a JSON string. For example:'{"ignore_eos": true}'.
Data Storage#
--wandb-api-key: wandb API key, if set, metrics will be saved to wandb.--swanlab-api-key: swanlab API key, if set, metrics will be saved to swanlab.--outputs-dirspecifies the output file path, with a default value of./outputs.