快速开始#

环境准备#

# 安装额外依赖
pip install evalscope[perf] -U
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'

基本使用#

可以使用以下两种方式启动模型推理性能压测工具:

evalscope perf \
    --url "http://127.0.0.1:8000/v1/chat/completions" \
    --parallel 1 \
    --model qwen2.5 \
    --number 15 \
    --api openai \
    --dataset openqa \
    --stream
from evalscope.perf.main import run_perf_benchmark

task_cfg = {"url": "http://127.0.0.1:8000/v1/chat/completions",
            "parallel": 1,
            "model": "qwen2.5",
            "number": 15,
            "api": "openai",
            "dataset": "openqa",
            "stream": True}
run_perf_benchmark(task_cfg)

参数说明:

  • url: 请求的URL地址

  • parallel: 并行请求的任务数量

  • model: 使用的模型名称

  • number: 请求数量

  • api: 使用的API服务

  • dataset: 数据集名称

  • stream: 是否启用流式处理

输出结果#

Benchmarking summary: 
+-----------------------------------+-----------------------------------------------------+
| Key                               | Value                                               |
+===================================+=====================================================+
| Time taken for tests (s)          | 10.739                                              |
+-----------------------------------+-----------------------------------------------------+
| Number of concurrency             | 1                                                   |
+-----------------------------------+-----------------------------------------------------+
| Total requests                    | 15                                                  |
+-----------------------------------+-----------------------------------------------------+
| Succeed requests                  | 15                                                  |
+-----------------------------------+-----------------------------------------------------+
| Failed requests                   | 0                                                   |
+-----------------------------------+-----------------------------------------------------+
| Throughput(average tokens/s)      | 324.059                                             |
+-----------------------------------+-----------------------------------------------------+
| Average QPS                       | 1.397                                               |
+-----------------------------------+-----------------------------------------------------+
| Average latency (s)               | 0.696                                               |
+-----------------------------------+-----------------------------------------------------+
| Average time to first token (s)   | 0.029                                               |
+-----------------------------------+-----------------------------------------------------+
| Average time per output token (s) | 0.00309                                             |
+-----------------------------------+-----------------------------------------------------+
| Average input tokens per request  | 50.133                                              |
+-----------------------------------+-----------------------------------------------------+
| Average output tokens per request | 232.0                                               |
+-----------------------------------+-----------------------------------------------------+
| Average package latency (s)       | 0.003                                               |
+-----------------------------------+-----------------------------------------------------+
| Average package per request       | 232.0                                               |
+-----------------------------------+-----------------------------------------------------+
| Expected number of requests       | 15                                                  |
+-----------------------------------+-----------------------------------------------------+
| Result DB path                    | ./outputs/20241216_194204/qwen2.5/benchmark_data.db |
+-----------------------------------+-----------------------------------------------------+

Percentile results: 
+------------+----------+----------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+----------+-------------+--------------+---------------+----------------------+
|    10%     |  0.0202  |  0.0027  |   0.1846    |      41      |      50       |       270.8324       |
|    25%     |  0.0209  |  0.0028  |   0.2861    |      44      |      83       |       290.0714       |
|    50%     |  0.0233  |  0.0028  |   0.7293    |      49      |      250      |       335.644        |
|    66%     |  0.0267  |  0.0029  |   0.9052    |      50      |      308      |       340.2603       |
|    75%     |  0.0437  |  0.0029  |   0.9683    |      53      |      325      |       341.947        |
|    80%     |  0.0438  |  0.003   |   1.0799    |      58      |      376      |       342.7985       |
|    90%     |  0.0439  |  0.0032  |   1.2474    |      62      |      424      |       345.5268       |
|    95%     |  0.0463  |  0.0033  |   1.3038    |      66      |      431      |       348.1648       |
|    98%     |  0.0463  |  0.0035  |   1.3038    |      66      |      431      |       348.1648       |
|    99%     |  0.0463  |  0.0037  |   1.3038    |      66      |      431      |       348.1648       |
+------------+----------+----------+-------------+--------------+---------------+----------------------+

指标说明#

指标

说明

Time taken for tests (s)

测试所用的时间(秒)

Number of concurrency

并发数量

Total requests

总请求数

Succeed requests

成功的请求数

Failed requests

失败的请求数

Throughput(average tokens/s)

吞吐量(平均每秒处理的token数)

Average QPS

平均每秒请求数(Queries Per Second)

Average latency (s)

平均延迟时间(秒)

Average time to first token (s)

平均首次token时间(秒)

Average time per output token (s)

平均每个输出token的时间(秒)

Average input tokens per request

每个请求的平均输入token数

Average output tokens per request

每个请求的平均输出token数

Average package latency (s)

平均包延迟时间(秒)

Average package per request

每个请求的平均包数

Expected number of requests

预期的请求数

Result DB path

结果数据库路径

Percentile

数据被分为100个相等部分,第n百分位表示n%的数据点在此值之下

TTFT (s)

Time to First Token,首次生成token的时间

TPOT (s)

Time Per Output Token,生成每个输出token的时间

Latency (s)

延迟时间,指请求到响应之间的时间

Input tokens

输入的token数量

Output tokens

输出的token数量

Throughput (tokens/s)

吞吐量,指每秒处理token的数量