SLA Auto-Tuning#

The SLA (Service Level Agreement) auto-tuning feature allows users to define service quality metrics (such as latency and throughput), and the tool will automatically adjust request pressure (concurrency or request rate) to find the maximum pressure value that the service can sustain while meeting these metrics.

Features#

  • Automatic Detection: Uses binary search algorithm to automatically find the maximum concurrency (parallel) or request rate (rate) that satisfies SLA constraints.

  • Multi-Metric Support: Supports end-to-end latency (Latency), time to first token (TTFT), time per output token (TPOT), as well as request throughput (RPS) and token throughput (TPS).

  • Flexible Constraints: Supports setting upper limits (e.g., p99_latency <= 2s) or finding extremes (e.g., tps: max).

  • Stable Results: Each test point runs multiple times by default and takes the average to reduce network fluctuation interference.

Parameter Description#

Parameter

Type

Description

Default

--sla-auto-tune

bool

Whether to enable SLA auto-tuning mode

False

--sla-variable

str

Variable for auto-tuning
Options: parallel (concurrency), rate (request rate)

parallel

--sla-params

str

SLA constraint conditions, JSON string, supports multiple constraint groups (AND/OR logic), see description below

None

--sla-upper-bound

int

Upper bound of the tuned SLA variable search range

65536

--sla-lower-bound

int

Lower bound of the tuned SLA variable search range

1

--sla-fixed-parallel

int

Fixed parallel workers used when --sla-variable=rate; defaults to --sla-upper-bound for backward compatibility

None

--sla-num-runs

int

Number of repeated runs per test point (average taken to reduce fluctuation)

3

--sla-number-multiplier

float

Multiplier of total requests relative to the tuned variable (concurrency or rate), i.e. number = round(variable Γ— N); defaults to 2 when not set

None

Supported Metrics and Operators#

Metric Category

Metric Name

Description

Supported Operators

Latency

avg_latency

Average request latency

<=, <, min

p99_latency

99th percentile request latency

<=, <, min

avg_ttft

Average time to first token

<=, <, min

p99_ttft

99th percentile time to first token

<=, <, min

avg_tpot

Average time per output token

<=, <, min

p99_tpot

99th percentile time per output token

<=, <, min

Throughput

rps

Requests per second

>=, >, max

tps

Tokens per second

>=, >, max

--sla-params Logic#

--sla-params accepts a JSON array, where each element is an object (group). Logic rules are as follows:

  • Multiple metrics within the same object: AND (must all be satisfied simultaneously)

  • Between different objects: OR (any one group satisfied is sufficient)

The overall semantics are: (Group1 ConditionA AND Group1 ConditionB) OR (Group2 ConditionC AND Group2 ConditionD) OR ...

AND Example: Satisfy TTFT and TPOT Simultaneously#

Write multiple metrics in the same object to indicate they must all be satisfied:

--sla-params '[{"avg_ttft": "<=2", "avg_tpot": "<=0.05"}]'

Meaning: Find the maximum concurrency satisfying avg_ttft <= 2s AND avg_tpot <= 0.05s. Only when both metrics are met does that concurrency level pass.

OR Example: Independently Evaluate Multiple TTFT Thresholds#

Write each metric in a different object so each group of conditions is evaluated independently:

--sla-params '[{"p99_ttft": "<0.05"}, {"p99_ttft": "<0.01"}]'

Meaning: Find the maximum request rate satisfying p99_ttft < 0.05s and satisfying p99_ttft < 0.01s separately, each outputting results independently.

AND + OR Combined Example#

--sla-params '[{"avg_ttft": "<=1", "avg_tpot": "<=0.05"}, {"p99_latency": "<=5"}]'

Meaning:

  • Group 1: avg_ttft <= 1s AND avg_tpot <= 0.05s (both satisfied simultaneously)

  • Group 2: p99_latency <= 5s

  • Each group independently completes a binary search and outputs its maximum concurrency value separately.

Extremum Optimization Mode#

When the array has only one object with only one metric, and the operator is max or min, the tool enters extremum optimization mode and directly finds the pressure value corresponding to the optimal metric:

--sla-params '[{"tps": "max"}]'

Meaning: Find the concurrency corresponding to maximum TPS (token throughput).

Workflow#

  1. Baseline Test: Start testing with the user-specified initial parallel or rate (recommended to set a small value, such as 1 or 2).

  2. Boundary Detection:

    • If current metrics meet SLA, double the pressure until SLA is first violated or --sla-upper-bound is reached.

    • If initial metrics violate SLA, halve the pressure to find a lower bound that satisfies conditions.

  3. Binary Search: Perform binary search within the determined boundary window to precisely lock in the maximum pressure value that β€œjust doesn’t violate” SLA.

  4. Result Confirmation: Each test point runs --sla-num-runs times (default 3), taking the average for judgment.

  5. Report Output: After testing, output a summary of the tuning process and final results.

Note: If the request success rate during testing is below 100%, that test point will be considered failed (violating SLA).

When --sla-variable=rate, use --sla-fixed-parallel to explicitly control the fixed concurrency. If not set, the implementation falls back to --sla-upper-bound for backward compatibility.

Usage Examples#

1. Find Maximum Concurrency Meeting P99 Latency <= 2s#

evalscope perf \
 --model Qwen2.5-0.5B-Instruct \
 --tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
 --url http://127.0.0.1:8801/v1/chat/completions \
 --api openai \
 --dataset random \
 --max-tokens 1024 \
 --prefix-length 0 \
 --min-prompt-length 1024 \
 --max-prompt-length 1024 \
 --sla-auto-tune \
 --sla-variable parallel \
 --sla-params '[{"p99_latency": "<=2"}]' \
 --parallel 2 \
 --sla-upper-bound 64
                                    Detailed Performance Metrics                                    
┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃      ┃      ┃      ┃     Avg ┃     P99 ┃     Avg ┃     P99 ┃     Avg ┃    P99 ┃    Gen. ┃ Success┃
┃Conc. ┃ Rate ┃  RPS ┃ Lat.(s) ┃ Lat.(s) ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(… ┃  toks/s ┃    Rate┃
┑━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
β”‚    2 β”‚  INF β”‚ 2.19 β”‚   0.928 β”‚   1.413 β”‚   0.030 β”‚   0.038 β”‚   0.003 β”‚  0.003 β”‚  640.20 β”‚  100.0%β”‚
β”‚    4 β”‚  INF β”‚ 7.18 β”‚   0.783 β”‚   1.635 β”‚   0.033 β”‚   0.050 β”‚   0.003 β”‚  0.004 β”‚ 1013.67 β”‚  100.0%β”‚
β”‚    5 β”‚  INF β”‚ 6.39 β”‚   0.743 β”‚   1.657 β”‚   0.038 β”‚   0.061 β”‚   0.003 β”‚  0.004 β”‚ 1210.93 β”‚  100.0%β”‚
β”‚    6 β”‚  INF β”‚ 3.86 β”‚   0.893 β”‚   3.001 β”‚   0.039 β”‚   0.064 β”‚   0.003 β”‚  0.004 β”‚ 1095.79 β”‚  100.0%β”‚
β”‚    8 β”‚  INF β”‚ 4.03 β”‚   1.286 β”‚   3.181 β”‚   0.044 β”‚   0.081 β”‚   0.003 β”‚  0.004 β”‚ 1615.33 β”‚  100.0%β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜


               Best Performance Configuration               
 Highest RPS         Concurrency 2 (INF req/sec)            
 Lowest Latency      Concurrency 2 (2.19 seconds)           

Performance Recommendations:
β€’ Consider lowering concurrency, current load may be too high
β€’ Success rate is low at high concurrency, check system resources or reduce concurrency
2025-12-18 16:32:39 - evalscope - INFO: Performance summary saved to: outputs/20251218_163037/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 16:32:39 - evalscope - INFO: SLA Auto-tune Summary:
+--------------------+------------+-----------------+-----------+
| Criteria           | Variable   |   Max Satisfied | Note      |
+====================+============+=================+===========+
| p99_latency <= 2.0 | parallel   |               5 | Satisfied |
+--------------------+------------+-----------------+-----------+

2. Find Concurrency with Maximum TPS#

evalscope perf \
 --model Qwen2.5-0.5B-Instruct \
 --tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
 --url http://127.0.0.1:8801/v1/chat/completions \
 --api openai \
 --dataset random \
 --max-tokens 1024 \
 --prefix-length 0 \
 --min-prompt-length 1024 \
 --max-prompt-length 1024 \
 --sla-auto-tune \
 --sla-variable parallel \
 --sla-params '[{"tps": "max"}]' \
 --parallel 4

Example output:

                                    Detailed Performance Metrics                                    
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┑━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
β”‚ 32.0 β”‚ 5.68 β”‚    5.590 β”‚    6.250 β”‚ 5813.67 β”‚    0.194 β”‚   0.294 β”‚    0.005 β”‚   0.006 β”‚    100.0%β”‚
β”‚ 64.0 β”‚ 5.76 β”‚   11.040 β”‚   11.875 β”‚ 5902.57 β”‚    0.220 β”‚   0.580 β”‚    0.011 β”‚   0.011 β”‚    100.0%β”‚
β”‚128.0 β”‚ 6.96 β”‚   18.212 β”‚   19.624 β”‚ 7124.25 β”‚    0.429 β”‚   1.379 β”‚    0.017 β”‚   0.019 β”‚    100.0%β”‚
β”‚256.0 β”‚ 7.81 β”‚   32.062 β”‚   37.111 β”‚ 8000.89 β”‚    1.567 β”‚   4.752 β”‚    0.030 β”‚   0.032 β”‚    100.0%β”‚
β”‚384.0 β”‚ 7.87 β”‚   43.487 β”‚   65.909 β”‚ 8057.14 β”‚   12.210 β”‚  32.809 β”‚    0.031 β”‚   0.033 β”‚    100.0%β”‚
β”‚385.0 β”‚ 7.58 β”‚   43.956 β”‚   66.302 β”‚ 7766.79 β”‚   12.487 β”‚  32.461 β”‚    0.031 β”‚   0.033 β”‚    100.0%β”‚
β”‚386.0 β”‚ 7.69 β”‚   43.658 β”‚   66.308 β”‚ 7869.97 β”‚   12.541 β”‚  32.859 β”‚    0.030 β”‚   0.033 β”‚    100.0%β”‚
β”‚388.0 β”‚ 7.60 β”‚   44.322 β”‚   66.909 β”‚ 7784.20 β”‚   12.873 β”‚  33.074 β”‚    0.031 β”‚   0.033 β”‚    100.0%β”‚
β”‚392.0 β”‚ 7.57 β”‚   45.006 β”‚   67.572 β”‚ 7748.62 β”‚   13.501 β”‚  33.293 β”‚    0.031 β”‚   0.034 β”‚    100.0%β”‚
β”‚400.0 β”‚ 7.76 β”‚   44.831 β”‚   66.181 β”‚ 7945.06 β”‚   14.184 β”‚  33.387 β”‚    0.030 β”‚   0.033 β”‚    100.0%β”‚
β”‚416.0 β”‚ 7.56 β”‚   46.748 β”‚   67.288 β”‚ 7738.68 β”‚   16.175 β”‚  33.624 β”‚    0.030 β”‚   0.033 β”‚    100.0%β”‚
β”‚448.0 β”‚ 7.61 β”‚   50.028 β”‚   68.255 β”‚ 7790.59 β”‚   19.657 β”‚  35.208 β”‚    0.030 β”‚   0.032 β”‚    100.0%β”‚
β”‚512.0 β”‚ 7.76 β”‚   57.843 β”‚   70.814 β”‚ 7941.28 β”‚   25.967 β”‚  37.136 β”‚    0.031 β”‚   0.033 β”‚    100.0%β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


               Best Performance Configuration               
 Highest RPS         Concurrency 384.0 (7.87 req/sec)       
 Lowest Latency      Concurrency 32.0 (5.590 seconds)       

Performance Recommendations:
β€’ Optimal concurrency range is around 384.0
2025-12-18 15:06:49 - evalscope - INFO: Performance summary saved to: ./outputs/20251218_144530/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 15:06:49 - evalscope - INFO: SLA Auto-tune Summary:
+------------+------------+-----------------+---------------------+
| Criteria   | Variable   |   Max Satisfied | Note                |
+============+============+=================+=====================+
| tps -> max | parallel   |             384 | Best tps: 8057.1438 |
+------------+------------+-----------------+---------------------+

3. Find Maximum Request Rate Meeting TTFT < 0.05s and TTFT < 0.01s in Specific Range#

evalscope perf \
 --model Qwen2.5-0.5B-Instruct \
 --tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
 --url http://127.0.0.1:8801/v1/chat/completions \
 --api openai \
 --dataset random \
 --max-tokens 512 \
 --prefix-length 0 \
 --min-prompt-length 512 \
 --max-prompt-length 512 \
 --sla-auto-tune \
 --sla-variable rate \
 --sla-params '[{"p99_ttft": "<0.05"}, {"p99_ttft": "<0.01"}]' \
 --rate 2 \
 --sla-num-runs 1 \
 --sla-fixed-parallel 40 \
 --sla-lower-bound 10 \
 --sla-upper-bound 40

Example output:

                                    Detailed Performance Metrics                                    
┏━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃      ┃      ┃       ┃     Avg ┃     P99 ┃     Avg ┃    P99 ┃     Avg ┃    P99 ┃    Gen. ┃ Success┃
┃Conc. ┃ Rate ┃   RPS ┃ Lat.(s) ┃ Lat.(s) ┃ TTFT(s) ┃ TTFT(… ┃ TPOT(s) ┃ TPOT(… ┃  toks/s ┃    Rate┃
┑━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
β”‚   40 β”‚   10 β”‚  5.16 β”‚   0.570 β”‚   1.530 β”‚   0.021 β”‚  0.029 β”‚   0.003 β”‚  0.004 β”‚  948.33 β”‚  100.0%β”‚
β”‚   40 β”‚   15 β”‚  9.78 β”‚   0.793 β”‚   1.743 β”‚   0.024 β”‚  0.034 β”‚   0.003 β”‚  0.004 β”‚ 2249.29 β”‚  100.0%β”‚
β”‚   40 β”‚   17 β”‚  8.17 β”‚   0.606 β”‚   1.623 β”‚   0.022 β”‚  0.031 β”‚   0.003 β”‚  0.004 β”‚ 1530.79 β”‚  100.0%β”‚
β”‚   40 β”‚   18 β”‚ 10.30 β”‚   0.799 β”‚   1.712 β”‚   0.023 β”‚  0.042 β”‚   0.003 β”‚  0.004 β”‚ 2466.09 β”‚  100.0%β”‚
β”‚   40 β”‚   19 β”‚ 12.83 β”‚   0.611 β”‚   1.682 β”‚   0.021 β”‚  0.027 β”‚   0.004 β”‚  0.005 β”‚ 2296.22 β”‚  100.0%β”‚
β”‚   40 β”‚   20 β”‚ 11.81 β”‚   0.744 β”‚   1.861 β”‚   0.023 β”‚  0.054 β”‚   0.004 β”‚  0.006 β”‚ 2435.94 β”‚  100.0%β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜


               Best Performance Configuration               
 Highest RPS         Concurrency 40 (20 req/sec)            
 Lowest Latency      Concurrency 40 (5.16 seconds)          

Performance Recommendations:
β€’ The system seems not to have reached its performance bottleneck, try higher concurrency
β€’ Success rate is low at high concurrency, check system resources or reduce concurrency
2025-12-18 16:19:48 - evalscope - INFO: Performance summary saved to: outputs/20251218_161909/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 16:19:48 - evalscope - INFO: SLA Auto-tune Summary:
+-----------------+------------+-----------------+----------------------------+
| Criteria        | Variable   | Max Satisfied   | Note                       |
+=================+============+=================+============================+
| p99_ttft < 0.05 | rate       | 19              | Satisfied                  |
+-----------------+------------+-----------------+----------------------------+
| p99_ttft < 0.01 | rate       | None            | Failed at lower bound (10) |
+-----------------+------------+-----------------+----------------------------+