EvalScope Service Deployment#
Introduction#
EvalScope service mode provides HTTP API-based evaluation and stress testing capabilities, designed to address the following scenarios:
Remote Invocation: Support remote evaluation functionality through network without configuring complex evaluation environments locally
Service Integration: Easily integrate evaluation capabilities into existing workflows, CI/CD pipelines, or automated testing systems
Multi-user Collaboration: Support multiple users or systems calling the evaluation service simultaneously, improving resource utilization
Unified Management: Centrally manage evaluation resources and configurations for easier maintenance and monitoring
Flexible Deployment: Can be deployed on dedicated servers or container environments, decoupled from business systems
The Flask service encapsulates EvalScope’s core evaluation (eval) and stress testing (perf) functionalities, providing services through standard RESTful APIs, making evaluation capabilities callable and integrable like other microservices.
Features#
Model Evaluation (
/api/v1/eval): Support evaluation of OpenAI API-compatible models, request parameters refer to documentationPerformance Testing (
/api/v1/perf): Support performance benchmarking of OpenAI API-compatible models, request parameters refer to documentation
Environment Setup#
Full Installation (Recommended)#
pip install evalscope[service]
Development Environment Installation#
# Clone repository
git clone https://github.com/modelscope/evalscope.git
cd evalscope
# Install development version with service
pip install -e '.[service]'
Starting the Service#
Command Line Launch#
# Use default configuration (host: 0.0.0.0, port: 9000)
evalscope service
# Custom host and port
evalscope service --host 127.0.0.1 --port 9000
# Enable debug mode
evalscope service --debug
Python Code Launch#
from evalscope.service import run_service
# Start service
run_service(host='0.0.0.0', port=9000, debug=False)
API Endpoints#
1. Health Check#
GET /health
Response Example:
{
"status": "ok",
"service": "evalscope",
"timestamp": "2025-12-04T10:00:00"
}
2. Model Evaluation#
POST /api/v1/eval
Request Body Example:
{
"model": "qwen-plus",
"api_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key": "your-api-key",
"datasets": ["gsm8k", "iquiz"],
"limit": 10,
"generation_config": {
"temperature": 0.0,
"max_tokens": 2048
}
}
Required Parameters:
model: Model namedatasets: List of datasetsapi_url: API endpoint URL (OpenAI-compatible)
Optional Parameters:
api_key: API key (default: “EMPTY”)limit: Evaluation sample quantity limiteval_batch_size: Batch size (default: 1)generation_config: Generation configurationtemperature: Temperature parameter (default: 0.0)max_tokens: Maximum generation tokens (default: 2048)top_p: Nucleus sampling parametertop_k: Top-k sampling parameter
work_dir: Output directorydebug: Debug modeseed: Random seed (default: 42)
See also
For detailed parameter descriptions, refer to: Evaluation Parameter Documentation
Response Example:
{
"status": "success",
"message": "Evaluation completed",
"result": {"...": "..."},
"output_dir": "/path/to/outputs/20251204_100000"
}
3. Performance Testing#
POST /api/v1/perf
Request Body Example:
{
"model": "qwen-plus",
"url": "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions",
"api": "openai",
"api_key": "your-api-key",
"number": 100,
"parallel": 10,
"dataset": "openqa",
"max_tokens": 2048,
"temperature": 0.0
}
Required Parameters:
model: Model nameurl: Complete API endpoint URL
Optional Parameters:
api: API type (openai/dashscope/anthropic/gemini, default: “openai”)api_key: API keynumber: Total number of requests (default: 1000)parallel: Concurrency level (default: 1)rate: Requests per second limit (default: -1, unlimited)dataset: Dataset name (default: “openqa”)max_tokens: Maximum generation tokens (default: 2048)temperature: Temperature parameter (default: 0.0)stream: Whether to use streaming output (default: true)debug: Debug mode
See also
For detailed parameter descriptions, refer to: Performance Parameter Documentation
Response Example:
{
"status": "success",
"message": "Performance test completed",
"output_dir": "/path/to/outputs",
"results": {
"parallel_10_number_100": {
"metrics": {"...": "..."},
"percentiles": {"...": "..."}
}
}
}
Usage Examples#
Testing Evaluation Endpoint with curl#
curl -X POST http://localhost:9000/api/v1/eval \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"api_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key": "your-api-key",
"datasets": ["gsm8k"],
"limit": 5
}'
Testing Performance Endpoint with curl#
curl -X POST http://localhost:9000/api/v1/perf \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"url": "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions",
"api": "openai",
"number": 50,
"parallel": 5
}'
Using Python requests#
import requests
# Evaluation request
eval_response = requests.post(
'http://localhost:9000/api/v1/eval',
json={
'model': 'qwen-plus',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': 'your-api-key',
'datasets': ['gsm8k', 'iquiz'],
'limit': 10,
'generation_config': {
'temperature': 0.0,
'max_tokens': 2048
}
}
)
print(eval_response.json())
# Performance test request
perf_response = requests.post(
'http://localhost:9000/api/v1/perf',
json={
'model': 'qwen-plus',
'url': 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions',
'api': 'openai',
'number': 100,
'parallel': 10,
'dataset': 'openqa'
}
)
print(perf_response.json())
Important Notes#
OpenAI API-Compatible Models Only: This service is designed specifically for OpenAI API-compatible models
Long-Running Tasks: Evaluation and performance testing tasks may take considerable time. We recommend setting appropriate HTTP timeout values on the client side, as the API calls are synchronous and will block until completion.
Output Directory: Evaluation results are saved in the configured
work_dir, default isoutputs/Error Handling: The service returns detailed error messages and stack traces (in debug mode)
Resource Management: Pay attention to concurrency settings during stress testing to avoid server overload
Error Codes#
400: Invalid request parameters404: Endpoint not found500: Internal server error
Example Scenarios#
Scenario 1: Quick Evaluation of Qwen Model#
curl -X POST http://localhost:9000/api/v1/eval \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"api_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key": "sk-...",
"datasets": ["gsm8k"],
"limit": 100
}'
Scenario 2: Stress Testing Locally Deployed Model#
curl -X POST http://localhost:9000/api/v1/perf \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5",
"url": "http://localhost:8000/v1/chat/completions",
"api": "openai",
"number": 1000,
"parallel": 20,
"max_tokens": 2048
}'
Scenario 3: Multi-Dataset Evaluation#
curl -X POST http://localhost:9000/api/v1/eval \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"api_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"datasets": ["gsm8k", "iquiz", "ceval"],
"limit": 50,
"eval_batch_size": 4
}'