SLA Auto-Tuning#
The SLA (Service Level Agreement) auto-tuning feature allows users to define service quality metrics (such as latency and throughput), and the tool will automatically adjust request pressure (concurrency or request rate) to find the maximum pressure value that the service can sustain while meeting these metrics.
Features#
Automatic Detection: Uses binary search algorithm to automatically find the maximum concurrency (
parallel) or request rate (rate) that satisfies SLA constraints.Multi-Metric Support: Supports end-to-end latency (Latency), time to first token (TTFT), time per output token (TPOT), as well as request throughput (RPS) and token throughput (TPS).
Flexible Constraints: Supports setting upper limits (e.g.,
p99_latency <= 2s) or finding extremes (e.g.,tps: max).Stable Results: Each test point runs multiple times by default and takes the average to reduce network fluctuation interference.
Parameter Description#
See Parameter Description for details.
Main parameters:
--sla-auto-tune: Enable auto-tuning.--sla-variable: Adjustment variable,parallelorrate.--sla-params: Define SLA rules.
Supported Metrics and Operators#
Metric Category |
Metric Name |
Description |
Supported Operators |
|---|---|---|---|
Latency |
|
Average request latency |
|
|
99th percentile request latency |
|
|
|
Average time to first token |
|
|
|
99th percentile time to first token |
|
|
|
Average time per output token |
|
|
|
99th percentile time per output token |
|
|
Throughput |
|
Requests per second |
|
|
Tokens per second |
|
Workflow#
Baseline Test: Start testing with the user-specified initial
parallelorrate(recommended to set a small value, such as 1 or 2).Boundary Detection:
If current metrics meet SLA, double the pressure until SLA is first violated or
--sla-upper-boundis reached.If initial metrics violate SLA, halve the pressure to find a lower bound that satisfies conditions.
Binary Search: Perform binary search within the determined boundary window to precisely lock in the maximum pressure value that βjust doesnβt violateβ SLA.
Result Confirmation: Each test point runs
--sla-num-runstimes (default 3), taking the average for judgment.Report Output: After testing, output a summary of the tuning process and final results.
Note: If the request success rate during testing is below 100%, that test point will be considered failed (violating SLA).
Usage Examples#
1. Find Maximum Concurrency Meeting P99 Latency <= 2s#
evalscope perf \
--model Qwen2.5-0.5B-Instruct \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--sla-auto-tune \
--sla-variable parallel \
--sla-params '[{"p99_latency": "<=2"}]' \
--parallel 2 \
--sla-upper-bound 64
Detailed Performance Metrics
ββββββββ³βββββββ³βββββββ³ββββββββββ³ββββββββββ³ββββββββββ³ββββββββββ³ββββββββββ³βββββββββ³ββββββββββ³βββββββββ
β β β β Avg β P99 β Avg β P99 β Avg β P99 β Gen. β Successβ
βConc. β Rate β RPS β Lat.(s) β Lat.(s) β TTFT(s) β TTFT(s) β TPOT(s) β TPOT(β¦ β toks/s β Rateβ
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 2 β INF β 2.19 β 0.928 β 1.413 β 0.030 β 0.038 β 0.003 β 0.003 β 640.20 β 100.0%β
β 4 β INF β 7.18 β 0.783 β 1.635 β 0.033 β 0.050 β 0.003 β 0.004 β 1013.67 β 100.0%β
β 5 β INF β 6.39 β 0.743 β 1.657 β 0.038 β 0.061 β 0.003 β 0.004 β 1210.93 β 100.0%β
β 6 β INF β 3.86 β 0.893 β 3.001 β 0.039 β 0.064 β 0.003 β 0.004 β 1095.79 β 100.0%β
β 8 β INF β 4.03 β 1.286 β 3.181 β 0.044 β 0.081 β 0.003 β 0.004 β 1615.33 β 100.0%β
ββββββββ΄βββββββ΄βββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄βββββββββ
Best Performance Configuration
Highest RPS Concurrency 2 (INF req/sec)
Lowest Latency Concurrency 2 (2.19 seconds)
Performance Recommendations:
β’ Consider lowering concurrency, current load may be too high
β’ Success rate is low at high concurrency, check system resources or reduce concurrency
2025-12-18 16:32:39 - evalscope - INFO: Performance summary saved to: outputs/20251218_163037/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 16:32:39 - evalscope - INFO: SLA Auto-tune Summary:
+--------------------+------------+-----------------+-----------+
| Criteria | Variable | Max Satisfied | Note |
+====================+============+=================+===========+
| p99_latency <= 2.0 | parallel | 5 | Satisfied |
+--------------------+------------+-----------------+-----------+
2. Find Concurrency with Maximum TPS#
evalscope perf \
--model Qwen2.5-0.5B-Instruct \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--sla-auto-tune \
--sla-variable parallel \
--sla-params '[{"tps": "max"}]' \
--parallel 4
Example output:
Detailed Performance Metrics
ββββββββ³βββββββ³βββββββββββ³βββββββββββ³ββββββββββ³βββββββββββ³ββββββββββ³βββββββββββ³ββββββββββ³βββββββββββ
β β β Avg β P99 β Gen. β Avg β P99 β Avg β P99 β Successβ
βConc. β RPS β Lat.(s) β Lat.(s) β toks/s β TTFT(s) β TTFT(s) β TPOT(s) β TPOT(s) β Rateβ
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 32.0 β 5.68 β 5.590 β 6.250 β 5813.67 β 0.194 β 0.294 β 0.005 β 0.006 β 100.0%β
β 64.0 β 5.76 β 11.040 β 11.875 β 5902.57 β 0.220 β 0.580 β 0.011 β 0.011 β 100.0%β
β128.0 β 6.96 β 18.212 β 19.624 β 7124.25 β 0.429 β 1.379 β 0.017 β 0.019 β 100.0%β
β256.0 β 7.81 β 32.062 β 37.111 β 8000.89 β 1.567 β 4.752 β 0.030 β 0.032 β 100.0%β
β384.0 β 7.87 β 43.487 β 65.909 β 8057.14 β 12.210 β 32.809 β 0.031 β 0.033 β 100.0%β
β385.0 β 7.58 β 43.956 β 66.302 β 7766.79 β 12.487 β 32.461 β 0.031 β 0.033 β 100.0%β
β386.0 β 7.69 β 43.658 β 66.308 β 7869.97 β 12.541 β 32.859 β 0.030 β 0.033 β 100.0%β
β388.0 β 7.60 β 44.322 β 66.909 β 7784.20 β 12.873 β 33.074 β 0.031 β 0.033 β 100.0%β
β392.0 β 7.57 β 45.006 β 67.572 β 7748.62 β 13.501 β 33.293 β 0.031 β 0.034 β 100.0%β
β400.0 β 7.76 β 44.831 β 66.181 β 7945.06 β 14.184 β 33.387 β 0.030 β 0.033 β 100.0%β
β416.0 β 7.56 β 46.748 β 67.288 β 7738.68 β 16.175 β 33.624 β 0.030 β 0.033 β 100.0%β
β448.0 β 7.61 β 50.028 β 68.255 β 7790.59 β 19.657 β 35.208 β 0.030 β 0.032 β 100.0%β
β512.0 β 7.76 β 57.843 β 70.814 β 7941.28 β 25.967 β 37.136 β 0.031 β 0.033 β 100.0%β
ββββββββ΄βββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββββ
Best Performance Configuration
Highest RPS Concurrency 384.0 (7.87 req/sec)
Lowest Latency Concurrency 32.0 (5.590 seconds)
Performance Recommendations:
β’ Optimal concurrency range is around 384.0
2025-12-18 15:06:49 - evalscope - INFO: Performance summary saved to: ./outputs/20251218_144530/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 15:06:49 - evalscope - INFO: SLA Auto-tune Summary:
+------------+------------+-----------------+---------------------+
| Criteria | Variable | Max Satisfied | Note |
+============+============+=================+=====================+
| tps -> max | parallel | 384 | Best tps: 8057.1438 |
+------------+------------+-----------------+---------------------+
3. Find Maximum Request Rate Meeting TTFT < 0.05s and TTFT < 0.01s in Specific Range#
evalscope perf \
--model Qwen2.5-0.5B-Instruct \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 512 \
--prefix-length 0 \
--min-prompt-length 512 \
--max-prompt-length 512 \
--sla-auto-tune \
--sla-variable rate \
--sla-params '[{"p99_ttft": "<0.05"}, {"p99_ttft": "<0.01"}]' \
--rate 2 \
--sla-num-runs 1 \
--sla-lower-bound 10 \
--sla-upper-bound 40
Example output:
Detailed Performance Metrics
ββββββββ³βββββββ³ββββββββ³ββββββββββ³ββββββββββ³ββββββββββ³βββββββββ³ββββββββββ³βββββββββ³ββββββββββ³βββββββββ
β β β β Avg β P99 β Avg β P99 β Avg β P99 β Gen. β Successβ
βConc. β Rate β RPS β Lat.(s) β Lat.(s) β TTFT(s) β TTFT(β¦ β TPOT(s) β TPOT(β¦ β toks/s β Rateβ
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 40 β 10 β 5.16 β 0.570 β 1.530 β 0.021 β 0.029 β 0.003 β 0.004 β 948.33 β 100.0%β
β 40 β 15 β 9.78 β 0.793 β 1.743 β 0.024 β 0.034 β 0.003 β 0.004 β 2249.29 β 100.0%β
β 40 β 17 β 8.17 β 0.606 β 1.623 β 0.022 β 0.031 β 0.003 β 0.004 β 1530.79 β 100.0%β
β 40 β 18 β 10.30 β 0.799 β 1.712 β 0.023 β 0.042 β 0.003 β 0.004 β 2466.09 β 100.0%β
β 40 β 19 β 12.83 β 0.611 β 1.682 β 0.021 β 0.027 β 0.004 β 0.005 β 2296.22 β 100.0%β
β 40 β 20 β 11.81 β 0.744 β 1.861 β 0.023 β 0.054 β 0.004 β 0.006 β 2435.94 β 100.0%β
ββββββββ΄βββββββ΄ββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄βββββββββ
Best Performance Configuration
Highest RPS Concurrency 40 (20 req/sec)
Lowest Latency Concurrency 40 (5.16 seconds)
Performance Recommendations:
β’ The system seems not to have reached its performance bottleneck, try higher concurrency
β’ Success rate is low at high concurrency, check system resources or reduce concurrency
2025-12-18 16:19:48 - evalscope - INFO: Performance summary saved to: outputs/20251218_161909/Qwen2.5-0.5B-Instruct/performance_summary.txt
2025-12-18 16:19:48 - evalscope - INFO: SLA Auto-tune Summary:
+-----------------+------------+-----------------+----------------------------+
| Criteria | Variable | Max Satisfied | Note |
+=================+============+=================+============================+
| p99_ttft < 0.05 | rate | 19 | Satisfied |
+-----------------+------------+-----------------+----------------------------+
| p99_ttft < 0.01 | rate | None | Failed at lower bound (10) |
+-----------------+------------+-----------------+----------------------------+