Multi-turn Conversation Benchmark#
The multi-turn conversation benchmark allows you to test a model service in realistic multi-turn interaction scenarios. Unlike a standard benchmark, multi-turn mode appends the modelβs actual replies to the conversation context so that every subsequent request carries the full history β faithfully simulating a userβs continuous conversation with the model, and enabling measurement of latency, throughput, and KV-cache utilization as context grows.
Features#
Real context accumulation: After each successful turn, the modelβs actual output is appended to the conversation history; the next turn sends the complete history rather than just the current user message.
Approx KV cache hit rate estimation: Based on client-side token counts, estimates the proportion of history tokens relative to the total input tokens in each request β i.e., the theoretical upper bound of tokens that could benefit from server-side prefix caching. Whether caching actually occurs depends on whether the server has prefix caching enabled and has retained the relevant cache.
Multiple dataset support: Provides five datasets β random synthetic (
random_multi_turn), real conversations (share_gpt_zh_multi_turn/share_gpt_en_multi_turn), custom local data (custom_multi_turn), and real Agent trajectories (swe_smith).Consistent parameter semantics:
--numberis the total number of conversations and--parallelis the number of concurrent conversations, keeping the same semantics as standard benchmark mode.
Parameters#
Multi-turn Specific Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Enable multi-turn conversation benchmark mode |
|
|
|
Minimum number of user turns per conversation; used by |
|
|
|
Maximum number of user turns per conversation; required for |
|
|
|
Skip the first N conversations in the dataset; useful for sharded testing or avoiding cache hits |
|
multi_turn_args (swe_smith-specific parameters)#
The swe_smith datasetβs live construction mode supports fine-grained control of conversation structure and token-length targets via a MultiTurnArgs object.
The number of turns per conversation is sampled from [--min-turns, --max-turns]; the amount of content filled per turn is controlled by the token-length parameters below.
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Target prompt token count for turn 1; trajectory messages are sliced until this length is reached |
|
|
|
Target token increment per subsequent turn; controls the delta size added each round |
|
|
|
Characters-per-token estimate used for pre-filtering trajectories when no tokenizer is available |
|
|
|
Number of parallel workers for live conversation building (>1 uses multiprocessing.Pool) |
|
Semantics of Existing Parameters in Multi-turn Mode#
Parameter |
Meaning in multi-turn mode |
|---|---|
|
Total number of conversations to run; all workers stop once this many conversations have been completed |
|
Number of concurrently active conversations (each worker owns one conversation) |
Workflow#
Load conversation pool: At startup, conversations are read sequentially from the dataset file and pre-loaded into memory, up to a maximum of
--numberconversations (to avoid excessive memory usage with large datasets).Start workers:
--parallelconcurrent coroutines (workers) are launched, each running independently and sharing a single global conversation counter.Conversation assignment (first-come, first-served, sequential cycling): Conversations are assigned to workers in order. Whoever finishes the current conversation first picks up the next one. Once all conversations are exhausted, cycling restarts from the beginning.
Example with
--parallel 3and 5 conversations:Startup: 3 workers begin simultaneously Worker 0 β takes conversation 1 Worker 1 β takes conversation 2 Worker 2 β takes conversation 3 β Each sends requests (other workers can continue executing while waiting for replies) Suppose Worker 1 finishes all turns of conversation 2 first: Worker 1 β takes conversation 4 Suppose Worker 0 finishes conversation 1 second: Worker 0 β takes conversation 5 If --number is not yet reached after conversation 5: Worker 0 β cycles back to conversation 1 ... and so on β whoever finishes first picks up the next oneSingle-conversation execution (turn by turn): Each worker maintains its own independent conversation history, not shared with other workers:
Turn 1: send [user_msg_1] β receive model_reply_1 Turn 2: send [user_msg_1, model_reply_1, user_msg_2] β receive model_reply_2 Turn 3: send [user_msg_1, model_reply_1, user_msg_2, model_reply_2, user_msg_3] β ...After each turn, the user message is appended to the history before sending the request. The modelβs actual reply is then appended for use in the next turn. Dataset
assistantcontent is never sent to the model β only real model outputs build the context.Budget control and stopping: The global conversation counter is incremented synchronously before each conversation starts, ensuring the total number of completed conversations across all workers does not exceed
--number. Once the limit is reached, all workers stop and no new conversations are started.Failure handling: If a turn request fails, the current conversation is immediately abandoned and the worker starts a new conversation from scratch β no failed context is carried into subsequent requests.
Metrics aggregation: After all workers finish, all latency, throughput, and multi-turn-specific metrics (average context turns, KV cache hit rate) are aggregated and output.
Note: When the request success rate is below 100%, interrupted conversations do not contribute subsequent turns to the context, which may result in lower reported KV cache hit rates.
Datasets#
random_multi_turn#
Generates synthetic token sequences based on the random dataset. Each conversation contains [min_turns, max_turns] user turns. No external data file is required, making it ideal for quick benchmarking and performance comparisons.
Required: --tokenizer-path, --max-turns
Optional: --min-turns (default 1), --min-prompt-length, --max-prompt-length (control the token length range of each user message)
Each conversation produced by the dataset has the following structure:
[
{"role": "user", "content": "...turn 1 random token sequence..."},
{"role": "user", "content": "...turn 2 random token sequence..."}
]
Note:
--tokenize-promptis not supported in multi-turn mode and will be silently ignored. Multi-turn conversations are always sent as message dicts to the/v1/chat/completionsendpoint.
Usage example: Quickly evaluate service performance at a specified prompt length distribution and conversation depth without a real dataset.
evalscope perf \
--model YOUR_MODEL \
--tokenizer-path YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset random_multi_turn \
--min-prompt-length 256 \
--max-prompt-length 512 \
--max-tokens 256 \
--multi-turn \
--min-turns 2 \
--max-turns 5 \
--number 20 \
--parallel 10
Example output:
Detailed Performance Metrics
ββββββββ³βββββββ³βββββββ³ββββββββββ³ββββββββββ³ββββββββββ³ββββββββββ³ββββββββββ³βββββββββ³ββββββββββ³βββββββββ
β β β β Avg β P99 β Avg β P99 β Avg β P99 β Gen. β Successβ
βConc. β Rate β RPS β Lat.(s) β Lat.(s) β TTFT(s) β TTFT(s) β TPOT(s) β TPOT(β¦ β toks/s β Rateβ
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 10 β INF β 4.32 β 2.289 β 3.541 β 0.041 β 0.072 β 0.009 β 0.011 β 1103.48 β 100.0%β
ββββββββ΄βββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄βββββββββ
Request Metrics
ββββββββ³βββββββββββ³ββββββββββββββ³βββββββββββββ³ββββββββββββββ³βββββββββββββ³ββββββββββββββ³βββββββββββββ
β β β β P99 In β Avg Out β P99 Out β Avg β Approxβ
βConc. β Num Reqs β Avg In Toks β Toks β Toks β Toks β Turns/Req β Cache Hitβ
β 10 β 20 β 315.4 β 650.0 β 92.0 β 128.0 β 1.60 β 58.1%β
ββββββββ΄βββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ
Interpreting the metrics:
Avg Turns/Req: 1.60: Each request carried an average of 1.60 turns of context during the test, consistent with the--min-turns 2 --max-turns 5random sampling distribution.Approx Cache Hit: 58.1%: About 58% of input tokens came from conversation history.
custom_multi_turn#
Uses a local JSONL file as a custom multi-turn conversation dataset. Each line stores a complete conversation directly in OpenAI messages format β no format conversion required. Ideal for benchmarking with your own existing conversation data.
--dataset-pathis required and must point to a local JSONL file.Optional truncation: Use
--max-turnsto limit the maximum number of user turns per conversation.
JSONL dataset format (one conversation per line, as an OpenAI messages array):
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi! How can I help you?"}, {"role": "user", "content": "Write me a poem"}]
[{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}, {"role": "user", "content": "Tell me more about it."}]
Each line must satisfy:
Must be a JSON array.
Every element must have
roleandcontentfields.rolemust be eitheruserorassistant.Must contain at least one
usermessage.
Runtime context structure (when sending turn 2):
[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "<model's actual reply to turn 1>"},
{"role": "user", "content": "Write me a poem"}
]
Note: The
assistantmessages in the dataset are used only to identify conversation structure and are never sent directly to the model. At runtime, workers always append the modelβs actual output to the context to ensure accurate history.
Usage example: You have conversation data already in OpenAI messages format and want to benchmark directly without any format conversion.
First, prepare the JSONL data file (one conversation per line):
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi! How can I help you?"}, {"role": "user", "content": "Write me a poem"}, {"role": "assistant", "content": "Sure, ..."}, {"role": "user", "content": "Write another one"}]
[{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}, {"role": "user", "content": "What are some famous landmarks there?"}]
Then run the benchmark:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset custom_multi_turn \
--dataset-path /path/to/my_conversations.jsonl \
--max-tokens 512 \
--multi-turn \
--max-turns 3 \
--number 100 \
--parallel 10
swe_smith#
Uses real Agent code-repair trajectory data from SWE-bench/SWE-smith-trajectories, designed specifically for long-context + multi-turn Agent scenario benchmarking. Each trajectory consists of tool calls, code snippets, patch results, etc. A single prompt typically exceeds tens of thousands of tokens, making it ideal for evaluating prefill throughput and KV cache hit rates under large contexts.
Two data source modes are supported:
Pre-built JSON mode (recommended): Specify
--dataset-pathto load a pre-generatedagentic_dataset.json. No tokenizer is required and startup is fast.Live construction mode (no
--dataset-path): Pulls raw trajectories from ModelScope at runtime and dynamically builds conversations.--tokenizer-pathis required for accurate token counting.
Common features of both modes:
Offset support: Skip the first N conversations via
--dataset-offset, useful for sharded testing or avoiding KV cache hot-spots.
Note: The turn count for each conversation is sampled from
[--min-turns, --max-turns]; the amount of content per turn is determined byfirst_turn_length/subsequent_turn_length.
Building the Dataset
It is recommended to pre-build agentic_dataset.json using examples/perf/build_swe_smith_dataset.py before running a benchmark β build once, reuse many times, avoiding repeated downloads and on-the-fly construction.
Key parameters:
Parameter |
Description |
Default |
|---|---|---|
|
Tokenizer path for accurate token counting (ModelScope model ID or local path) |
|
|
Target prompt token count for turn 1 |
|
|
Target token increment per subsequent turn |
|
|
Minimum number of turns per conversation |
|
|
Maximum number of turns per conversation; the actual turn count is sampled from |
|
|
Number of conversations to generate |
|
|
Output file path |
|
|
Random seed for reproducibility |
|
|
Number of parallel workers |
CPU count |
python examples/perf/build_swe_smith_dataset.py \
--model-path Qwen/Qwen2.5-7B-Instruct \
--first-turn-length 8192 \
--subsequent-turn-length 1024 \
--min-turns 3 \
--max-turns 8 \
--number 128 \
--output-path agentic_dataset.json \
--seed 42 \
--num-workers 8
Usage Example: Pre-built JSON Mode (Recommended)
After generating agentic_dataset.json, load it via --dataset-path β no tokenizer needed and startup is faster:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset swe_smith \
--dataset-path /path/to/agentic_dataset.json \
--max-tokens 512 \
--multi-turn \
--dataset-offset 100 \
--number 200 \
--parallel 20
Note:
--dataset-offsetskips the first N conversations in the dataset, making it suitable for multi-machine sharded benchmarking or avoiding KV cache hot-spots.
Usage Example: Live Construction Mode
Automatically pulls SWE-smith-trajectories from ModelScope and constructs conversations at runtime. --tokenizer-path is required:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset swe_smith \
--tokenizer-path YOUR_MODEL \
--max-tokens 512 \
--min-tokens 512 \
--multi-turn \
--multi-turn-args '{
"first_turn_length": 8192,
"subsequent_turn_length": 1024
}' \
--min-turns 3 \
--max-turns 8 \
--seed 42 \
--number 10 20 \
--parallel 5 10 \
--extra-args '{"ignore_eos": true}'