Quick Start#
Below is a quick guide for using EvalScope to conduct model inference performance testing. It supports OpenAI API format model services and various dataset formats, making it convenient for users to perform performance evaluations.
Environment Preparation#
EvalScope supports usage in Python environments. Users can install EvalScope via pip or from source. Here are examples of both installation methods:
# Install additional dependencies
pip install evalscope[perf] -U
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'
Basic Usage#
You can start the model inference performance testing tool using the following two methods (command line/Python script):
The example below demonstrates performance testing of the Qwen2.5-0.5B-Instruct model using the vLLM framework on an A100, with fixed input of 1024 tokens and output of 1024 tokens. Users can modify parameters according to their needs.
evalscope perf \
--parallel 1 10 50 100 200 \
--number 10 20 100 200 400 \
--model Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--min-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--extra-args '{"ignore_eos": true}'
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments
task_cfg = Arguments(
parallel=[1, 10, 50, 100, 200],
number=[10, 20, 100, 200, 400],
model='Qwen2.5-0.5B-Instruct',
url='http://127.0.0.1:8801/v1/chat/completions',
api='openai',
dataset='random',
min_tokens=1024,
max_tokens=1024,
prefix_length=0,
min_prompt_length=1024,
max_prompt_length=1024,
tokenizer_path='Qwen/Qwen2.5-0.5B-Instruct',
extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)
Parameter description:
parallel: Number of concurrent requests, multiple values can be passed, separated by spacesnumber: Number of total requests for each concurrency, multiple values can be passed, separated by spaces (corresponding one-to-one withparallel)url: Request URL addressmodel: Name of the model usedapi: API service used, default isopenaidataset: Dataset name, here itβsrandom, indicating randomly generated dataset, for specific usage instructions refer to; for more available (multimodal) datasets, please refer to Dataset Configurationtokenizer-path: Modelβs tokenizer path, used to calculate the number of tokens (necessary for random datasets)extra-args: Additional parameters in the request, passed as a JSON format string, e.g.,{"ignore_eos": true}indicates ignoring the end token
See also
Output Results#
The output test report summary is shown in the image below, including basic information, metrics for each concurrency, and performance test suggestions:

Note
The stress test report in the diagram is a summary of test results across multiple concurrency levels, allowing users to compare model performance under different concurrency settings. No summary report is generated for a single concurrency level.
This report will be saved in
outputs/<timestamp>/<model>/performance_summary.txt, which users can view as needed.For explanations of the metrics in the table below, please refer to the βMetric Descriptionsβ section that follows. Results will be saved in
outputs/<timestamp>/<model>/benchmark.log.
Additionally, the test results for each concurrency level are output separately, including metrics such as the number of requests, successful requests, failed requests, average latency, and average latency per token for each concurrency level.
Benchmarking summary:
βββββββββββββββββββββββββββββ¬ββββββββββββββββ
β Metric β Value β
βββββββββββββββββββββββββββββΌββββββββββββββββ€
β ββ General ββ β β
β Test Duration (s) β 38.31 β
β Concurrency β 200 β
β Request Rate (req/s) β -1.00 β
β Total / Success / Failed β 400 / 400 / 0 β
β Req Throughput (req/s) β 10.44 β
β ββ Latency ββ β β
β Avg Latency (s) β 18.78 β
β TTFT (ms) β 717.63 β
β TPOT (ms) β 17.66 β
β ITL (ms) β 17.64 β
β ββ Tokens ββ β β
β Avg Input Tokens β 1024.00 β
β Avg Output Tokens β 1024.00 β
β Output Throughput (tok/s) β 10692.45 β
β Total Throughput (tok/s) β 21384.91 β
βββββββββββββββββββββββββββββ΄ββββββββββββββββ
Percentile results:
ββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Metric β 1% β 5% β 10% β 25% β 50% β 75% β 90% β 95% β 99% β
ββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Latency (s) β 17.25 β 17.46 β 17.74 β 18.14 β 18.71 β 19.23 β 20.28 β 20.62 β 20.84 β
β TTFT (ms) β 176.30 β 212.60 β 246.76 β 279.87 β 333.79 β 1082.24 β 1812.13 β 2097.23 β 2319.52 β
β ITL (ms) β 0.00 β 0.01 β 0.01 β 10.62 β 15.67 β 20.37 β 26.97 β 32.50 β 50.11 β
β TPOT (ms) β 16.61 β 16.80 β 17.02 β 17.31 β 17.76 β 18.06 β 18.18 β 18.28 β 18.36 β
β Input tokens β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β
β Output tokens β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β
β Output (tok/s) β 49.15 β 49.75 β 50.55 β 53.33 β 54.79 β 56.48 β 57.73 β 58.73 β 59.45 β
β Total (tok/s) β 98.30 β 99.50 β 101.09 β 106.66 β 109.58 β 112.97 β 115.46 β 117.45 β 118.89 β
β Decode (tok/s) β 54.46 β 54.74 β 55.07 β 55.38 β 56.31 β 57.80 β 58.77 β 59.55 β 60.24 β
ββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Metric Descriptions#
Metrics
Metric |
Explanation |
Formula |
|---|---|---|
Test Duration (s) |
Total time from the start to the end of the test process |
Last request end time - First request start time |
Concurrency |
Number of clients sending requests simultaneously |
Preset value |
Request Rate (req/s) |
Target request sending rate, -1 means unlimited |
Preset value |
Total / Success / Failed |
Total number of requests sent during the test and their success/failure counts |
Successful requests + Failed requests |
Req Throughput (req/s) |
Average number of successful requests processed per second |
Successful requests / Test Duration |
Avg Latency (s) |
Average time from sending a request to receiving a complete response |
Total latency / Successful requests |
TTFT (ms) |
Average time from sending a request to receiving the first response token |
Total first chunk latency / Successful requests |
TPOT (ms) |
Average time required to generate each output token (excluding first token) |
Total time per output token / Successful requests |
ITL (ms) |
Average time interval between generating each output token |
Total inter-token latency / Successful requests |
Avg Input Tokens |
Average number of input tokens per request |
Total input tokens / Successful requests |
Avg Output Tokens |
Average number of output tokens per request |
Total output tokens / Successful requests |
Output Throughput (tok/s) |
Average number of output tokens processed per second |
Total output tokens / Test Duration |
Total Throughput (tok/s) |
Average number of tokens (input + output) processed per second |
(Total input tokens + Total output tokens) / Test Duration |
Avg Input Turns |
Average number of historical dialogue turns per request in multi-turn conversations (1 for single-turn) |
Total input turns / Successful requests |
KV Cache Hit Rate (%) |
Estimated prefix cache hit rate based on the ratio of cached tokens to input tokens (requires prefix caching enabled on the server and |
Total cached tokens / Total input tokens Γ 100% |
Avg Decoded Tokens/Iter |
In speculative decoding, the average number of tokens accepted per model forward pass (iteration), reflecting draft model accuracy |
(Total output tokens β 1) / (Total chunks β 1) |
Spec Decode Acceptance (%) |
Approximate token acceptance rate for speculative decoding, derived from decoded tokens per iter; closer to 1 means a more accurate draft model |
1 β 1 / (Avg Decoded Tokens/Iter) |
Percentile Metrics
Metric |
Explanation |
|---|---|
Latency (s) |
The time from sending a request to receiving a complete response (in seconds): TTFT + TPOT * Output tokens. |
TTFT (ms) |
The time from sending a request to generating the first token (in milliseconds), assessing the initial packet delay. |
ITL (ms) |
The time interval between generating each output token (in milliseconds), assessing the smoothness of output. |
TPOT (ms) |
The time required to generate each output token (excluding the first token, in milliseconds), assessing decoding speed. |
Input tokens |
The number of tokens input in the request. |
Output tokens |
The number of tokens generated in the response. |
Output (tok/s) |
The number of tokens output per second: Output tokens / Latency. |
Total (tok/s) |
The number of tokens processed per second: (Input tokens + Output tokens) / Latency. |
Decode (tok/s) |
The number of decoded output tokens per second. |
Visualizing Test Results#
Using WandB#
First, install wandb and obtain the corresponding API Key:
pip install wandb
To upload the test results to the wandb server and visualize them, add the following parameters when launching the evaluation:
# ...
--visualizer wandb
--name 'name_of_wandb_log'
For example:

Using SwanLab#
First, install SwanLab and obtain the corresponding API Key:
pip install swanlab
To upload the test results to the swanlab server and visualize them, add the following parameters when launching the evaluation:
# ...
--visualizer swanlab
--name 'name_of_swanlab_log'
For example:

If you prefer to use SwanLab in local dashboard mode, install swanlab dashboard first:
pip install 'swanlab[dashboard]'
and set the following parameters instead:
--swanlab-api-key local
Then, use swanlab watch <log_path> to launch the local visualization dashboard.
Using ClearML#
Please install ClearML using the following command:
pip install clearml
Initialize the ClearML server:
clearml-init
Add the following parameters before starting the test:
# You can use the CLEARML_PROJECT_NAME environment variable to specify the project name
--visualizer clearml
--name 'name_of_clearml_task'
