Skip to content
Logo LogoEvalScope
Docs Blogs
⌘ K
Logo LogoEvalScope
Docs Blogs

🚀 Quick Start

  • Introduction
  • Installation
  • Quick Start
  • Visualization
  • Parameters
  • Supported Benchmarks
    • LLM Benchmarks
    • VLM Benchmarks
    • AGENT Benchmarks
    • AIGC Benchmarks
    • Other Datasets
      • OpenCompass
      • VLMEvalKit Backend
      • MTEB
      • CLIP-Benchmark
  • ❓ FAQ

🔧 User Guides

  • Evaluation Backends
    • OpenCompass
    • VLMEvalKit
    • RAGEval
      • MTEB
      • CLIP Benchmark
      • RAGAS
  • Model Inference Stress Testing
    • Quick Start
    • Parameter Description
    • Examples
    • Speed Benchmark Testing
    • vLLM Bench vs Evalscope Perf Load Testing Comparison
    • Custom Usage
  • AIGC Evaluation
    • Text-to-Image Evaluation
    • Image Editing Evaluation
  • Arena Mode
  • Sandbox Environment Usage

🛠️ Advanced Tutorials

  • Building an Evaluation Index
    • Defining Your Schema
    • Sampling Your Index Data
    • Unified Evaluation with Your Index
  • Custom Datasets
    • Large Language Model
    • Multimodal Large Models
    • Embedding Model
    • CLIP Model
  • Custom Model Evaluation
  • 👍 Contribute Benchmark

🧰 Third-Party Tools

  • SWE-bench
  • τ-bench
  • τ²-bench
  • BFCL-v3
  • BFCL-v4
  • Needle in a Haystack
  • ToolBench
  • LongBench-Write

🧪 Benchmarking Results

  • Benchmarking
    • MMLU
  • Speed Benchmarking
    • QwQ-32B-Preview

📖 Best Practices

  • Best Practices for Evaluating the Qwen3-Omni Model
  • Evaluating the Qwen3-VL Model
  • Evaluating the Qwen3-Next Model
  • GPT-OSS Model Evaluation
  • Evaluating Qwen3-Coder+Instruct Model
  • Evaluating Text-to-Image Models
  • Evaluating the Qwen3 Model
  • Evaluating the QwQ Model
  • How Smart is Your AI? Full Assessment of IQ and EQ!
  • Evaluating the Inference Capability of R1 Models
  • Evaluating the Thinking Efficiency of Models
  • ms-swift Integration
  • Full-Chain LLM Training

🌟 Blogs

  • Welcome to the EvalScope Blogs!
    • RAG Evaluation Survey: Framework, Metrics, and Methods
EvalScope
/
Model Inference Stress Testing

Model Inference Stress Testing#

A stress testing tool for large language models that can be customized to support various dataset formats and different API protocol formats, with default support for the OpenAI API format.

  • Quick Start
    • Environment Preparation
    • Basic Usage
    • Visualizing Test Results
  • Parameter Description
    • Basic Settings
    • Network Configuration
    • Request Control
    • Prompt Settings
    • Dataset Configuration
    • Model Settings
    • Data Storage
    • Other Parameters
  • Examples
    • Using Local Model Inference
    • Using prompt
    • Complex Requests
    • Using query-template
    • Using the Random Dataset
    • Using the Random Multimodal Dataset
    • Using wandb to Record Test Results
    • Using swanlab to Record Test Results
    • Debugging Requests
  • Speed Benchmark Testing
    • Online API Inference
    • Local Transformer Inference
    • Local vLLM Inference
  • vLLM Bench vs Evalscope Perf Load Testing Comparison
    • TL;DR: Quick Comparison Recipe
    • Environment and Prerequisites
    • Unified Server Configuration
    • Parameter Alignment Guide (Key Mappings)
    • Consistency Validation: Minimum Example (1 Concurrent / 1 Request)
    • Full Load Test: 50 Concurrency / 500 Requests
    • Metric Definitions and Naming Correspondence
    • Common Sources of Discrepancies and Troubleshooting Suggestions
  • Custom Usage
    • Custom Result Analysis
    • Custom API Requests
    • Custom Dataset
    • Notes
RAGAS
Quick Start

© 2022-2024, Alibaba ModelScope Built with Sphinx 8.2.3