vLLM Bench vs Evalscope Perf Load Testing Comparison#
Goal: To present a reproducible, comparable, and extensible load testing methodology, enabling evalscope perf and vllm bench serve to output consistent request metrics and statistics for the same vLLM model service.
Conclusion:
With identical request parameters and concurrency configurations,
evalscope perfachieves consistent metrics (TTFT, TPOT, ITL, throughput, etc.) withvllm bench serve.This guide provides parameter mappings and validation steps to help you quickly reproduce and extend tests.
TL;DR: Quick Comparison Recipe#
Start vLLM OpenAI Chat Service
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-0.5B-Instruct \
--gpu-memory-utilization 0.5 \
--served-model-name Qwen2.5-0.5B-Instruct \
--trust-remote-code \
--port 8801 \
--no-enable-prefix-caching
Load Test with vLLM Bench (Random Input)
vllm bench serve \
--max-concurrency 50 \
--num-prompts 1000 \
--host 127.0.0.1 \
--port 8801 \
--backend openai-chat \
--model Qwen2.5-0.5B-Instruct \
--dataset-name random \
--random-input-len 100 \
--random-output-len 100 \
--tokenizer /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-0.5B-Instruct \
--endpoint /v1/chat/completions \
--ignore-eos
Load Test with Evalscope Perf (Random Input)
evalscope perf \
--parallel 50 \
--number 1000 \
--log-every-n-query 1000 \
--model Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 100 \
--prefix-length 0 \
--min-prompt-length 100 \
--max-prompt-length 100 \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--extra-args '{"ignore_eos": true}'
Environment and Prerequisites#
Hardware: A100 80GB GPU
Versions:
vLLM: v0.17.0
evalscope: v1.5.0
Notes:
When using ModelScope weights, set
VLLM_USE_MODELSCOPE=Trueand provide the corresponding tokenizer path.Ensure the endpoint uses OpenAI Chat
/v1/chat/completionsto maintain consistent request structures.
Unified Server Configuration#
Using the OpenAI Chat endpoint (recommended):
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-0.5B-Instruct \
--gpu-memory-utilization 0.5 \
--served-model-name Qwen2.5-0.5B-Instruct \
--trust-remote-code \
--port 8801 \
--no-enable-prefix-caching
Tips:
Set
--gpu-memory-utilizationconservatively to prevent OOM issues.If customizing
--served-model-name, ensure the corresponding--modelvalue matches on the benchmarking tools.Use
--no-enable-prefix-cachingto avoid caching effects on load test results.
Parameter Alignment Guide (Key Mappings)#
To maintain an apples-to-apples comparison, align the following parameters:
Concurrency and Requests
vLLM:
--max-concurrency↔ Evalscope:--parallelvLLM:
--num-prompts↔ Evalscope:--number
Endpoint and Protocol
vLLM:
--backend openai-chat/--endpoint /v1/chat/completionsEvalscope:
--api openai+--url http://host:port/v1/chat/completions
Model Name
vLLM:
--model ...Evalscope:
--model ...(used for populating model field in request body)
Data Generation (Random)
vLLM:
--dataset-name random --random-input-len N --random-output-len MEvalscope:
--dataset random --min-prompt-length N --max-prompt-length N --max-tokens M --prefix-length 0
Tokenizer (Consistent Template)
vLLM:
--tokenizer /path/to/tokenizerEvalscope:
--tokenizer-path <same-as-above-or-model-id>
Decoding Control
vLLM:
--ignore-eosEvalscope:
--extra-args '{"ignore_eos": true}'
Consistency Validation: Minimum Example (1 Concurrent / 1 Request)#
vLLM:
vllm bench serve \
--max-concurrency 1 \
--num-prompts 1 \
--host 127.0.0.1 \
--port 8801 \
--backend openai-chat \
--model Qwen2.5-0.5B-Instruct \
--dataset-name random \
--random-input-len 100 \
--random-output-len 100 \
--tokenizer /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-0.5B-Instruct \
--endpoint /v1/chat/completions \
--ignore-eos
Sample Logs:
INFO ... Received request ... params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Evalscope:
evalscope perf \
--parallel 1 \
--number 1 \
--log-every-n-query 500 \
--model Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 100 \
--prefix-length 0 \
--min-prompt-length 100 \
--max-prompt-length 100 \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--extra-args '{"ignore_eos": true}'
Sample Logs:
INFO ... Received request ... params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Comparison Results: Both tools produce consistent request parameters (metric methodologies for TTFT, TPOT, and ITL also align).
Full Load Test: 50 Concurrency / 1000 Requests#
vLLM:
vllm bench serve \
--max-concurrency 50 \
--num-prompts 1000 \
--host 127.0.0.1 \
--port 8801 \
--backend openai-chat \
--model Qwen2.5-0.5B-Instruct \
--dataset-name random \
--random-input-len 100 \
--random-output-len 100 \
--tokenizer /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-0.5B-Instruct \
--endpoint /v1/chat/completions \
--ignore-eos
Evalscope:
evalscope perf \
--parallel 50 \
--number 1000 \
--log-every-n-query 1000 \
--model Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 100 \
--prefix-length 0 \
--min-prompt-length 100 \
--max-prompt-length 100 \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--extra-args '{"ignore_eos": true}'
vLLM Output:
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Maximum request concurrency: 50
Benchmark duration (s): 9.25
Total input tokens: 100000
Total generated tokens: 100000
Request throughput (req/s): 108.08
Output token throughput (tok/s): 10808.22
Peak output token throughput (tok/s): 11399.00
Peak concurrent requests: 176.00
Total token throughput (tok/s): 21616.43
---------------Time to First Token----------------
Mean TTFT (ms): 73.18
Median TTFT (ms): 74.81
P99 TTFT (ms): 144.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 3.85
Median TPOT (ms): 3.85
P99 TPOT (ms): 4.14
---------------Inter-token Latency----------------
Mean ITL (ms): 3.86
Median ITL (ms): 3.63
P99 ITL (ms): 12.26
==================================================
Evalscope Output:
Benchmarking summary:
+-----------------------------------+------------+
| Key | Value |
+===================================+============+
| Time taken for tests (s) | 9.4961 |
+-----------------------------------+------------+
| Number of concurrency | 50 |
+-----------------------------------+------------+
| Request rate (req/s) | -1 |
+-----------------------------------+------------+
| Total requests | 1000 |
+-----------------------------------+------------+
| Succeed requests | 1000 |
+-----------------------------------+------------+
| Failed requests | 0 |
+-----------------------------------+------------+
| Output token throughput (tok/s) | 10530.7 |
+-----------------------------------+------------+
| Total token throughput (tok/s) | 21061.2 |
+-----------------------------------+------------+
| Request throughput (req/s) | 105.307 |
+-----------------------------------+------------+
| Average latency (s) | 0.4663 |
+-----------------------------------+------------+
| Average time to first token (s) | 0.1131 |
+-----------------------------------+------------+
| Average time per output token (s) | 0.0036 |
+-----------------------------------+------------+
| Average inter-token latency (s) | 0.0037 |
+-----------------------------------+------------+
| Average input tokens per request | 99.999 |
+-----------------------------------+------------+
| Average output tokens per request | 100 |
+-----------------------------------+------------+
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.0532 | 0.0 | 0.003 | 0.3699 | 100 | 100 | 167.0738 | 334.1475 |
| 25% | 0.0802 | 0.0025 | 0.003 | 0.3836 | 100 | 100 | 190.1434 | 380.2868 |
| 50% | 0.0949 | 0.0029 | 0.0032 | 0.4225 | 100 | 100 | 236.7219 | 473.4438 |
| 66% | 0.1054 | 0.0031 | 0.0039 | 0.4846 | 100 | 100 | 253.8581 | 507.7162 |
| 75% | 0.1136 | 0.0033 | 0.004 | 0.526 | 100 | 100 | 260.8281 | 521.6561 |
| 80% | 0.1398 | 0.0036 | 0.0041 | 0.5509 | 100 | 100 | 264.4042 | 528.8084 |
| 90% | 0.163 | 0.0052 | 0.0043 | 0.5985 | 100 | 100 | 270.4783 | 540.9567 |
| 95% | 0.4063 | 0.0067 | 0.005 | 0.738 | 100 | 100 | 276.9653 | 553.9306 |
| 98% | 0.4287 | 0.0108 | 0.0055 | 0.8134 | 100 | 100 | 287.3055 | 574.6111 |
| 99% | 0.4302 | 0.0141 | 0.0059 | 0.8161 | 100 | 100 | 293.9073 | 587.8146 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
Metric Definitions and Naming Correspondence#
To better understand the outputs from both tools, here’s a comparison of key metric definitions and the naming conventions used by vLLM and Evalscope:
Successful/Failed Requests
vLLM:
Successful requestsEvalscope:
Succeed requests/Failed requests
Request Throughput
vLLM:
Request throughput (req/s)Evalscope:
Request throughput (req/s)
Token Throughput
vLLM:
Output token throughput / Total Token throughputEvalscope:
Output token throughput / Total token throughput
Latency Metrics (Unit Differences)
vLLM: Reports in milliseconds (
ms) for TTFT (Time to First Token), TPOT (Time Per Output Token), and ITL (Inter-Token Latency).Evalscope: Reports in seconds (
s) for TTFT, TPOT, and ITL, as well as for overall latency.Conversion between the two:
1 ms = 0.001 s
Percentiles
vLLM: Reports percentile metrics such as
mean,median, andP99.Evalscope: Provides a more comprehensive percentile breakdown (e.g.,
10%,25%,50%(median),75%,90%,95%,98%,99%). This is helpful for analyzing tail latencies and variances.
Common Sources of Discrepancies and Troubleshooting Suggestions#
The following are common sources of discrepancies between vLLM and Evalscope performance results, along with tips for resolving them:
Streaming vs. Non-Streaming
Both tools use token streaming by default to measure token-level outputs.
Ensure that no non-streaming paths are mistakenly mixed in, as this could affect metrics like TTFT, ITL, and throughput.
Consistent Termination Criteria
Both tools must use identical settings for ending conditions. For example, ensure that
ignore_eosandstop_tokensare consistently configured; otherwise, output lengths and metrics (e.g., TPOT/ITL) can diverge.
Tokenizer and Prompt Template Consistency
Use the same tokenizer and implement a consistent prompt formatting template (e.g., for models like
Qwen).Specify the correct
--tokenizeror--tokenizer-pathin both tools.
Decoding Parameters
Match decoding parameters such as
temperature=0,top_p=1,top_k=0, andrepetition_penalty.Any inconsistency here can significantly affect both output length and latency metrics.
Warm-Up
Perform a small batch of warm-up queries before large-scale benchmarking to eliminate the impacts of initial loading or model compilation overheads.
Connection and Concurrency Settings
Ensure that both client-side connection pooling and server-side Keep-Alive settings are properly configured if running tests over a network.
Avoid DNS resolution delays or network jitter by running benchmarks locally (e.g., using
127.0.0.1as the host).
Resource Contention
Background processes, NVIDIA NVLink bandwidth, and PCIe throughput may introduce non-deterministic delays.
For consistent measurements, perform benchmarks in a stable and isolated environment.