Quick Start#

Environment Preparation#

# Install additional dependencies
pip install evalscope[perf] -U
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'

Basic Usage#

You can start the model inference performance stress testing tool using the following two methods:

evalscope perf \
    --url "http://127.0.0.1:8000/v1/chat/completions" \
    --parallel 5 \
    --model qwen2.5 \
    --number 20 \
    --api openai \
    --dataset openqa \
    --stream
from evalscope.perf.main import run_perf_benchmark

task_cfg = {"url": "http://127.0.0.1:8000/v1/chat/completions",
            "parallel": 5,
            "model": "qwen2.5",
            "number": 20,
            "api": "openai",
            "dataset": "openqa",
            "stream": True}
run_perf_benchmark(task_cfg)

Parameter Descriptions:

  • url: The URL for the request

  • parallel: The number of tasks for concurrent requests

  • model: The name of the model used

  • number: The number of requests

  • api: The API service used

  • dataset: The name of the dataset

  • stream: Whether to enable streaming

Important

To accurately measure the Time to First Token (TTFT) metric, you need to include the --stream parameter in your request. Otherwise, TTFT will be the same as Latency.

Output Results#

Benchmarking summary:
+-----------------------------------+-----------------------------------------------------------------+
| Key                               | Value                                                           |
+===================================+=================================================================+
| Time taken for tests (s)          | 3.882                                                           |
+-----------------------------------+-----------------------------------------------------------------+
| Number of concurrency             | 5                                                               |
+-----------------------------------+-----------------------------------------------------------------+
| Total requests                    | 20                                                              |
+-----------------------------------+-----------------------------------------------------------------+
| Succeed requests                  | 20                                                              |
+-----------------------------------+-----------------------------------------------------------------+
| Failed requests                   | 0                                                               |
+-----------------------------------+-----------------------------------------------------------------+
| Output token throughput (tok/s)   | 983.7757                                                        |
+-----------------------------------+-----------------------------------------------------------------+
| Total token throughput (tok/s)    | 1242.6641                                                       |
+-----------------------------------+-----------------------------------------------------------------+
| Request throughput (req/s)        | 5.152                                                           |
+-----------------------------------+-----------------------------------------------------------------+
| Average latency (s)               | 0.8416                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Average time to first token (s)   | 0.0247                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Average time per output token (s) | 0.0044                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Average input tokens per request  | 50.25                                                           |
+-----------------------------------+-----------------------------------------------------------------+
| Average output tokens per request | 190.95                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Average package latency (s)       | 0.0043                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Average package per request       | 190.95                                                          |
+-----------------------------------+-----------------------------------------------------------------+
| Expected number of requests       | 20                                                              |
+-----------------------------------+-----------------------------------------------------------------+
| Result DB path                    | outputs/20250410_133556/Qwen2.5-0.5B-Instruct/benchmark_data.db |
+-----------------------------------+-----------------------------------------------------------------+

Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
|    10%     |  0.0118  |  0.004  |   0.3434    |      42      |      80       |       215.1057       |
|    25%     |  0.0122  | 0.0041  |   0.5027    |      47      |      113      |       223.166        |
|    50%     |  0.0157  | 0.0042  |   0.7857    |      49      |      194      |       226.2531       |
|    66%     |  0.0161  | 0.0043  |   1.0707    |      52      |      246      |       229.2028       |
|    75%     |  0.0293  | 0.0044  |   1.1826    |      55      |      268      |       229.7484       |
|    80%     |  0.0633  | 0.0044  |   1.2737    |      58      |      290      |       230.8304       |
|    90%     |  0.0654  | 0.0046  |   1.4551    |      62      |      328      |       232.939        |
|    95%     |  0.0655  | 0.0049  |   1.4913    |      66      |      335      |       250.1984       |
|    98%     |  0.0655  | 0.0065  |   1.4913    |      66      |      335      |       250.1984       |
|    99%     |  0.0655  | 0.0072  |   1.4913    |      66      |      335      |       250.1984       |
+------------+----------+---------+-------------+--------------+---------------+----------------------+

Metric Descriptions#

Metrics

Metric

Explanation

Formula

Time taken for tests

Total time from the start to the end of the test process

Last request end time - First request start time

Number of concurrency

Number of clients sending requests simultaneously

Preset value

Total requests

Total number of requests sent during the testing process

Successful requests + Failed requests

Succeed requests

Number of requests completed successfully and returning expected results

Direct count

Failed requests

Number of requests that failed to complete successfully

Direct count

Output token throughput (tok/s)

Average number of tokens processed per second

Total output tokens / Time taken for tests

Total token throughput (tok/s)

Average number of tokens processed per second

Total tokens / Time taken for tests

Request throughput (req/s)

Average number of successful queries processed per second

Successful requests / Time taken for tests

Total latency

Sum of latency times for all successful requests

Sum of all successful request latencies

Average latency

Average time from sending a request to receiving a complete response

Total latency / Successful requests

Average time to first token

Average time from sending a request to receiving the first response token

Total first chunk latency / Successful requests

Average time per output token

Average time required to generate each output token

Total time per output token / Successful requests

Average input tokens per request

Average number of input tokens per request

Total input tokens / Successful requests

Average output tokens per request

Average number of output tokens per request

Total output tokens / Successful requests

Average package latency

Average delay time for receiving each data package

Total package time / Total packages

Average package per request

Average number of data packages received per request

Total packages / Successful requests

Percentile Metrics

Metric

Explanation

Time to First Token

Time from sending a request to generating the first token (in seconds)

Inter-token Latency

Time needed to generate each output token (in seconds)

Latency

Time from sending a request to receiving a complete response (in seconds)

Input tokens

Number of tokens inputted in the request

Output tokens

Number of tokens generated in the response

Throughput

Number of tokens outputted per second (tokens/s)

Visualizing Test Results#

Visualizing with WandB#

First, install wandb and obtain the corresponding API Key:

pip install wandb

To upload the test results to the wandb server and visualize them, add the following parameters when launching the evaluation:

# ...
--wandb-api-key 'wandb_api_key'
--name 'name_of_wandb_log'

For example:

wandb sample

Visualizing Test Results with SwanLab#

First, install SwanLab and obtain the corresponding API Key:

pip install swanlab

To upload the test results to the swanlab server and visualize them, add the following parameters when launching the evaluation:

# ...
--swanlab-api-key 'swanlab_api_key'
--name 'name_of_swanlab_log'

For example:

swanlab sample

If you prefer to use SwanLab in local dashboard mode, install swanlab dashboard first:

pip install 'swanlab[dashboard]'

and set the following parameters instead:

--swanlab-api-key local

Then, use swanlab watch <log_path> to launch the local visualization dashboard.