Multi-turn Conversation Benchmark#
The multi-turn conversation benchmark allows you to test a model service in realistic multi-turn interaction scenarios. Unlike a standard benchmark, multi-turn mode appends the modelβs actual replies to the conversation context so that every subsequent request carries the full history β faithfully simulating a userβs continuous conversation with the model, and enabling measurement of latency, throughput, and KV-cache utilization as context grows.
Features#
Real context accumulation: After each successful turn, the modelβs actual output is appended to the conversation history; the next turn sends the complete history rather than just the current user message.
Approx KV cache hit rate estimation: Based on client-side token counts, estimates the proportion of history tokens relative to the total input tokens in each request β i.e., the theoretical upper bound of tokens that could benefit from server-side prefix caching. Whether caching actually occurs depends on whether the server has prefix caching enabled and has retained the relevant cache.
Multiple dataset support: Provides random synthetic (
random_multi_turn), real conversations (share_gpt_zh_multi_turn/share_gpt_en_multi_turn), custom local data (custom_multi_turn), real Agent trajectories (swe_smith), and production agentic trace replay (trie_agentic_coding/trie_code_qa/trie_office_work) datasets.Consistent parameter semantics:
--numberis the total number of conversations and--parallelis the number of concurrent conversations, keeping the same semantics as standard benchmark mode.
Parameters#
Multi-turn Specific Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Enable multi-turn conversation benchmark mode |
|
|
|
Minimum number of user turns per conversation; used by |
|
|
|
Maximum number of user turns per conversation; required for |
|
|
|
Skip the first N conversations in the dataset; useful for sharded testing or avoiding cache hits |
|
multi_turn_args (swe_smith-specific parameters)#
The swe_smith datasetβs live construction mode supports fine-grained control of conversation structure and token-length targets via a MultiTurnArgs object.
The number of turns per conversation is sampled from [--min-turns, --max-turns]; the amount of content filled per turn is controlled by the token-length parameters below.
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Target prompt token count for turn 1; trajectory messages are sliced until this length is reached |
|
|
|
Target token increment per subsequent turn; controls the delta size added each round |
|
|
|
Characters-per-token estimate used for pre-filtering trajectories when no tokenizer is available |
|
|
|
Number of parallel workers for live conversation building (>1 uses multiprocessing.Pool) |
|
Semantics of Existing Parameters in Multi-turn Mode#
Parameter |
Meaning in multi-turn mode |
|---|---|
|
Total number of conversations to run; all workers stop once this many conversations have been completed |
|
Number of concurrently active conversations (each worker owns one conversation) |
Workflow#
Load conversation pool: At startup, conversations are read sequentially from the dataset file and pre-loaded into memory, up to a maximum of
--numberconversations (to avoid excessive memory usage with large datasets).Start workers:
--parallelconcurrent coroutines (workers) are launched, each running independently and sharing a single global conversation counter.Conversation assignment (first-come, first-served, sequential cycling): Conversations are assigned to workers in order. Whoever finishes the current conversation first picks up the next one. Once all conversations are exhausted, cycling restarts from the beginning.
Example with
--parallel 3and 5 conversations:Startup: 3 workers begin simultaneously Worker 0 β takes conversation 1 Worker 1 β takes conversation 2 Worker 2 β takes conversation 3 β Each sends requests (other workers can continue executing while waiting for replies) Suppose Worker 1 finishes all turns of conversation 2 first: Worker 1 β takes conversation 4 Suppose Worker 0 finishes conversation 1 second: Worker 0 β takes conversation 5 If --number is not yet reached after conversation 5: Worker 0 β cycles back to conversation 1 ... and so on β whoever finishes first picks up the next oneSingle-conversation execution (turn by turn): Each worker maintains its own independent conversation history, not shared with other workers:
Turn 1: send [user_msg_1] β receive model_reply_1 Turn 2: send [user_msg_1, model_reply_1, user_msg_2] β receive model_reply_2 Turn 3: send [user_msg_1, model_reply_1, user_msg_2, model_reply_2, user_msg_3] β ...After each turn, the user message is appended to the history before sending the request. The modelβs actual reply is then appended for use in the next turn. Dataset
assistantcontent is never sent to the model β only real model outputs build the context.Budget control and stopping: The global conversation counter is incremented synchronously before each conversation starts, ensuring the total number of completed conversations across all workers does not exceed
--number. Once the limit is reached, all workers stop and no new conversations are started.Failure handling: If a turn request fails, the current conversation is immediately abandoned and the worker starts a new conversation from scratch β no failed context is carried into subsequent requests.
Metrics aggregation: After all workers finish, all latency, throughput, and multi-turn-specific metrics (average context turns, KV cache hit rate) are aggregated and output.
Note: When the request success rate is below 100%, interrupted conversations do not contribute subsequent turns to the context, which may result in lower reported KV cache hit rates.
Datasets#
evalscope provides the following multi-turn datasets:
random_multi_turn#
Generates synthetic token sequences based on the random dataset. Each conversation contains [min_turns, max_turns] user turns. No external data file is required, making it ideal for quick benchmarking and performance comparisons.
Required: --tokenizer-path, --max-turns
Optional: --min-turns (default 1), --min-prompt-length, --max-prompt-length (control the token length range of each user message)
Each conversation produced by the dataset has the following structure:
[
{"role": "user", "content": "...turn 1 random token sequence..."},
{"role": "user", "content": "...turn 2 random token sequence..."}
]
Note:
--tokenize-promptis not supported in multi-turn mode and will be silently ignored. Multi-turn conversations are always sent as message dicts to the/v1/chat/completionsendpoint.
Usage example: Quickly evaluate service performance at a specified prompt length distribution and conversation depth without a real dataset.
evalscope perf \
--model YOUR_MODEL \
--tokenizer-path YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset random_multi_turn \
--min-prompt-length 256 \
--max-prompt-length 512 \
--max-tokens 256 \
--multi-turn \
--min-turns 2 \
--max-turns 5 \
--number 20 \
--parallel 10
Example output:
Performance Overview
ββββββββ³βββββββ³ββββββ³βββββββ³ββββββββββ³ββββββββββ
βConc. β Rate β Num β RPS β Gen/s β Success β
β‘βββββββββββββββββββββββββββββββββββββββββββββββ©
β 10 β - β 20 β 4.32 β 1103.48 β 100.0% β
ββββββββ΄βββββββ΄ββββββ΄βββββββ΄ββββββββββ΄ββββββββββ
Per-Request Metrics
ββββββββ³βββββββ³ββββββββββββββββ³βββββββββ³ββββββββ³ββββββββ
βConc. β Rate β Metric β avg β p50 β p99 β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 10 β - β Latency (s) β 2.289 β 2.180 β 3.541 β
β β β TTFT (ms) β 41.0 β 38.0 β 72.0 β
β β β TPOT (ms) β 9.0 β 8.8 β 11.0 β
β β β Input Tokens β 315.4 β 280.0 β 650.0 β
β β β Output Tokens β 92.0 β 85.0 β 128.0 β
β β β Turns/Req β 1.60 β - β - β
β β β Cache Hit (%) β 58.1% β - β - β
ββββββββ΄βββββββ΄ββββββββββββββββ΄βββββββββ΄ββββββββ΄ββββββββ
Interpreting the metrics:
Turns/Req: 1.60: Each request carried an average of 1.60 turns of context during the test, consistent with the--min-turns 2 --max-turns 5random sampling distribution.Cache Hit (%): 58.1%: About 58% of input tokens came from conversation history.
custom_multi_turn#
Uses a local JSONL file as a custom multi-turn conversation dataset. Each line stores a complete conversation directly in OpenAI messages format β no format conversion required. Ideal for benchmarking with your own existing conversation data.
--dataset-pathis required and must point to a local JSONL file.Optional truncation: Use
--max-turnsto limit the maximum number of user turns per conversation.
JSONL dataset format (one conversation per line, as an OpenAI messages array):
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi! How can I help you?"}, {"role": "user", "content": "Write me a poem"}]
[{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}, {"role": "user", "content": "Tell me more about it."}]
Each line must satisfy:
Must be a JSON array.
Every element must have
roleandcontentfields.rolemust be eitheruserorassistant.Must contain at least one
usermessage.
Runtime context structure (when sending turn 2):
[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "<model's actual reply to turn 1>"},
{"role": "user", "content": "Write me a poem"}
]
Note: The
assistantmessages in the dataset are used only to identify conversation structure and are never sent directly to the model. At runtime, workers always append the modelβs actual output to the context to ensure accurate history.
Usage example: You have conversation data already in OpenAI messages format and want to benchmark directly without any format conversion.
First, prepare the JSONL data file (one conversation per line):
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi! How can I help you?"}, {"role": "user", "content": "Write me a poem"}, {"role": "assistant", "content": "Sure, ..."}, {"role": "user", "content": "Write another one"}]
[{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}, {"role": "user", "content": "What are some famous landmarks there?"}]
Then run the benchmark:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset custom_multi_turn \
--dataset-path /path/to/my_conversations.jsonl \
--max-tokens 512 \
--multi-turn \
--max-turns 3 \
--number 100 \
--parallel 10
swe_smith#
Uses real Agent code-repair trajectory data from SWE-bench/SWE-smith-trajectories, designed specifically for long-context + multi-turn Agent scenario benchmarking. Each trajectory consists of tool calls, code snippets, patch results, etc. A single prompt typically exceeds tens of thousands of tokens, making it ideal for evaluating prefill throughput and KV cache hit rates under large contexts.
Two data source modes are supported:
Pre-built JSON mode (recommended): Specify
--dataset-pathto load a pre-generatedagentic_dataset.json. No tokenizer is required and startup is fast.Live construction mode (no
--dataset-path): Pulls raw trajectories from ModelScope at runtime and dynamically builds conversations.--tokenizer-pathis required for accurate token counting.
Common features of both modes:
Offset support: Skip the first N conversations via
--dataset-offset, useful for sharded testing or avoiding KV cache hot-spots.
Note: The turn count for each conversation is sampled from
[--min-turns, --max-turns]; the amount of content per turn is determined byfirst_turn_length/subsequent_turn_length.
Building the Dataset
It is recommended to pre-build agentic_dataset.json using examples/perf/build_swe_smith_dataset.py before running a benchmark β build once, reuse many times, avoiding repeated downloads and on-the-fly construction.
Key parameters:
Parameter |
Description |
Default |
|---|---|---|
|
Tokenizer path for accurate token counting (ModelScope model ID or local path) |
|
|
Target prompt token count for turn 1 |
|
|
Target token increment per subsequent turn |
|
|
Minimum number of turns per conversation |
|
|
Maximum number of turns per conversation; the actual turn count is sampled from |
|
|
Number of conversations to generate |
|
|
Output file path |
|
|
Random seed for reproducibility |
|
|
Number of parallel workers |
CPU count |
python examples/perf/build_swe_smith_dataset.py \
--model-path Qwen/Qwen2.5-7B-Instruct \
--first-turn-length 8192 \
--subsequent-turn-length 1024 \
--min-turns 3 \
--max-turns 8 \
--number 128 \
--output-path agentic_dataset.json \
--seed 42 \
--num-workers 8
Usage Example: Pre-built JSON Mode (Recommended)
After generating agentic_dataset.json, load it via --dataset-path β no tokenizer needed and startup is faster:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset swe_smith \
--dataset-path /path/to/agentic_dataset.json \
--max-tokens 512 \
--multi-turn \
--dataset-offset 100 \
--number 200 \
--parallel 20
Note:
--dataset-offsetskips the first N conversations in the dataset, making it suitable for multi-machine sharded benchmarking or avoiding KV cache hot-spots.
Usage Example: Live Construction Mode
Automatically pulls SWE-smith-trajectories from ModelScope and constructs conversations at runtime. --tokenizer-path is required:
evalscope perf \
--model YOUR_MODEL \
--url OPENAI_API_COMPAT_URL \
--api openai \
--dataset swe_smith \
--tokenizer-path YOUR_MODEL \
--max-tokens 512 \
--min-tokens 512 \
--multi-turn \
--multi-turn-args '{
"first_turn_length": 8192,
"subsequent_turn_length": 1024
}' \
--min-turns 3 \
--max-turns 8 \
--seed 42 \
--number 10 20 \
--parallel 5 10 \
--extra-args '{"ignore_eos": true}'
trie Trace Replay#
Replays the token-length sequences of real production agentic workloads to measure inference-service performance under multi-turn, long-tail, tool-call-paced load. The traces come from applied-compute/trie (Apache-2.0); evalscope re-hosts the three workloads on ModelScope dataset evalscope/trie-workloads and downloads them on demand.
Each trace contains only token-length metadata (per-turn prompt length, output length, tool-call wait time, etc.) without real conversation text. At benchmark time the client automatically synthesizes prompts to the recorded lengths, preserving the original load characteristics.
Three built-in workloads:
Dataset alias |
Source jsonl |
Typical num_turns |
When to use |
|---|---|---|---|
|
|
8-15 |
Coding-agent apps (IDE completion, refactor agents) |
|
|
5-12 |
Code-Q&A agents (repo analysis, doc generation) |
|
|
18-41 |
Office-automation agents (longest, heaviest; best for stress-ceiling tests) |
Each trace starts with ~8 k initial-prompt tokens; total input volume per trace is ~25 k-50 k tokens. Each jsonl has ~8000 traces, more than enough for long-running statistically meaningful tests.
Bring your own trace? Feed any jsonl in the same format via --dataset-path:
evalscope perf ... --dataset trie_office_work --dataset-path /path/to/your_traces.jsonl
Required: --tokenizer-path (for synthesizing prompts to exact target lengths)
Usage example:
evalscope perf \
--model qwen-plus \
--url https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
--api-key $DASHSCOPE_API_KEY \
--api openai \
--dataset trie_office_work \
--multi-turn \
--parallel 4 \
--number 10 \
--duration 600 \
--tokenizer-path Qwen/Qwen2.5-7B-Instruct \
--extra-args '{"ignore_eos": true}' \
--stream
Note: trie trace replay relies on
ignore_eosto force the model to generate to the exact recorded length. vLLM and SGLang support this parameter; DashScope / OpenAI API and most cloud services do not β the model stops at natural EOS, so actual output will be shorter than the recorded length. This does not affect latency / cache hit / decode TPS correctness.
Multi-turn Output Metrics#
In addition to the Performance Overview and Per-Request Metrics tables, multi-turn mode outputs the following additional tables.
Per-Trace Metrics#
Per-conversation (trace) metrics, aggregated as mean / p50 / p90 / p99 / max across all completed conversations:
Column |
Meaning |
Where to look when itβs off |
|---|---|---|
Latency (s) |
Wall-clock from conversationβs first-turn start to last-turn completion |
Slow conversation β check TTFAT split (is prefill slow or decode slow?) |
First-Turn TTFT (s) |
TTFT of the first turn; reflects cold-prefill performance |
High β serverβs prefill stage is slow |
TTFAT (s) |
Wall-clock from conversation start to the first token of the final reply; end-to-end βhow long until the final answer starts streamingβ |
Any slow turn in the conversation inflates this |
Decode TPS |
Arithmetic mean over the conversationβs turns of |
Reflects steady-state decode speed |
Cache Hit Rate (%) |
|
Low β server prefix cache isnβt enabled or evicts too aggressively |
Eligible Cache Hit Rate (%) |
Denominator only counts the theoretically cacheable prefix, excluding turn 1 and the current turnβs new content |
Both this and Cache Hit Rate low β cache off; this high but Cache Hit low β cache enabled but capacity-bound |
Also outputs trace_summary.json to the results directory.
Workload Throughput#
Time-based token throughput rates, output in all modes (single-turn and multi-turn):
ββββββββ¬βββββββ¬ββββββββββββββββββββββ¬ββββββββββ¬βββββββββββ¬βββββββββββββββββββ
βConc. β Rate β Metric (tok/s) β Overall β Last 30s β Steady (drop 20%)β
ββββββββΌβββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββΌβββββββββββββββββββ€
β 10 β - β Total Prompt tok/s β ... β ... β ... β
β β β New Prompt tok/s β ... β ... β ... β
β β β Cached Prompt tok/s β ... β ... β ... β
β β β Completion tok/s β ... β ... β ... β
ββββββββ΄βββββββ΄ββββββββββββββββββββββ΄ββββββββββ΄βββββββββββ΄βββββββββββββββββββ
Column |
Meaning |
|---|---|
Overall |
total tokens / wall_time over the whole run |
Last 30s |
30-second tail-window tok/s; falls back to Overall when the run is shorter than 30s |
Steady (drop 20%) |
tok/s after dropping the first 20% of wall_time; removes the cold-start phase |
Row |
Meaning |
|---|---|
Total Prompt |
New Prompt + Cached Prompt |
New Prompt |
|
Cached Prompt |
the part the server skipped via prefix cache hit |
Completion |
output-token rate |
Steady-state is the fairest number for βhow fast can this endpoint sustainably runβ; Last 30s is good for tail jitter. Also outputs workload_throughput.json and workload_timeline.json (raw cumulative-token timeline, pandas-friendly) to the results directory.
First-Turn vs Subsequent-Turn TTFT#
The TTFT (ms) in the Per-Request Metrics table averages every turn. In multi-turn scenarios, the first turn is a cold prefill while subsequent turns benefit heavily from prefix cache, so the two TTFT types can differ by 2-10x. The Per-Request Metrics table shows:
First-Turn TTFT (ms)Subsequent-Turn TTFT (ms)
These appear only in multi-turn runs; single-turn benchmarks omit them.
--duration Soft Exit#
In multi-turn mode, --duration uses soft-exit semantics: once the deadline elapses, no new conversations are claimed, but already in-flight conversations run every remaining turn before exit. This keeps every conversation complete, so partial conversations donβt skew statistics.
Three cap combinations:
Command |
Meaning |
|---|---|
|
run 50 conversations, no time limit |
|
run for 300 seconds, however many conversations fit |
|
both caps; whichever is reached first ends the run (recommended) |
Side effect: actual wall_time may overshoot --duration slightly (the overshoot equals the time needed for in-flight conversations to finish).
FAQs#
Chat-template token overhead#
When using the /v1/chat/completions endpoint, the chat template adds 10-50 extra tokens per turn (role markers, special tokens, etc.), causing Cache Hit Rate to be 2-3pp lower than raw completions. This is expected behaviour.