Speed Benchmark Testing#
To conduct speed tests and obtain a speed benchmark report similar to the official Qwen report, as shown below:
You can specify the dataset for speed testing using --dataset [speed_benchmark|speed_benchmark_long]
:
speed_benchmark
: Tests prompts of lengths [1, 6144, 14336, 30720], with a fixed output of 2048 tokens.speed_benchmark_long
: Tests prompts of lengths [63488, 129024], with a fixed output of 2048 tokens.
Online API Inference#
Note
For speed testing, the --url
should use the /v1/completions
endpoint instead of /v1/chat/completions
to avoid the additional processing of the chat template affecting input length.
evalscope perf \
--rate 1 \
--url http://127.0.0.1:8000/v1/completions \
--model qwen2.5 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api openai \
--dataset speed_benchmark \
--debug
Local Transformer Inference#
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--rate 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local \
--dataset speed_benchmark \
--debug
Example Output:
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
| 1 | 50.69 | 0.97 |
| 6144 | 51.36 | 1.23 |
| 14336 | 49.93 | 1.59 |
| 30720 | 49.56 | 2.34 |
+---------------+-----------------+----------------+
Local vLLM Inference#
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--rate 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
Example Output:
Tip
vLLM will pre-allocate GPU memory, so GPU usage is not displayed here.
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
| 1 | 343.08 | 0.0 |
| 6144 | 334.71 | 0.0 |
| 14336 | 318.88 | 0.0 |
| 30720 | 292.86 | 0.0 |
+---------------+-----------------+----------------+