Skip to content
Logo LogoEvalScope
Docs Blogs
⌘ K
Logo LogoEvalScope
Docs Blogs

🚀 Quick Start

  • Introduction
  • Installation
  • Quick Start
  • Visualization
  • Parameters
  • Supported Benchmarks
    • LLM Benchmarks
    • VLM Benchmarks
    • AGENT Benchmarks
    • AIGC Benchmarks
    • Other Datasets
      • OpenCompass
      • VLMEvalKit Backend
      • MTEB
      • CLIP-Benchmark
  • ❓ FAQ

🔧 Tutorials

  • Evaluation Backends
    • OpenCompass
    • VLMEvalKit
    • RAGEval
      • MTEB
      • CLIP Benchmark
      • RAGAS
  • Model Inference Stress Testing
    • Quick Start
    • Parameter
    • Examples
    • SLA Auto-Tuning
    • Speed Benchmark Testing
    • vLLM Bench vs Evalscope Perf Load Testing Comparison
    • Custom Usage
  • AIGC Evaluation
    • Text-to-Image Evaluation
    • Image Editing Evaluation
  • Arena Mode
  • Sandbox Environment Usage
  • EvalScope Service Deployment

🛠️ Advanced Tutorials

  • Building an Evaluation Index
    • Defining Your Schema
    • Sampling Your Index Data
    • Unified Evaluation with Your Index
  • Custom Datasets
    • Large Language Model
    • Multimodal Large Models
    • Embedding Model
    • CLIP Model
  • Custom Model Evaluation
  • 👍 Contribute Benchmark

🧰 Extended Benchmarks

  • Extended Benchmarks
    • Terminal-Bench 2.0
    • SWE-bench
    • τ-bench
    • τ²-bench
    • BFCL-v3
    • BFCL-v4
    • Needle in a Haystack
    • ToolBench
    • LongBench-Write

📖 Best Practices

  • Best Practices
    • A Guide to Agent Tool Calling Evaluation
    • Benchmark Smarter: Tailor Your Model Evaluation Suite with EvalScope
    • Best Practices for Evaluating the Qwen3-Omni Model
    • Evaluating the Qwen3-VL Model
    • Evaluating the Qwen3-Next Model
    • GPT-OSS Model Evaluation
    • Evaluating Qwen3-Coder+Instruct Model
    • Evaluating Text-to-Image Models
    • Evaluating the Qwen3 Model
    • Evaluating the QwQ Model
    • How Smart is Your AI? Full Assessment of IQ and EQ!
    • Evaluating the Thinking Efficiency of Models
    • Evaluating the Inference Capability of R1 Models
    • Full-Chain LLM Training
    • ms-swift Integration

🧪 Benchmark Results

  • Benchmarking
    • MMLU
  • Speed Benchmarking
    • QwQ-32B-Preview

🌟 Blog

  • Welcome to the EvalScope Blogs!
    • RAG Evaluation Survey: Framework, Metrics, and Methods
EvalScope
/
Supported Benchmarks
/
Other Datasets

Other Datasets#

  • OpenCompass
  • VLMEvalKit Backend
    • Image Understanding Dataset
    • Video Understanding Dataset
  • MTEB
    • CMTEB Evaluation Dataset
    • MTEB Evaluation Dataset
  • CLIP-Benchmark
AIGC Benchmarks
OpenCompass

© 2022-2024, Alibaba ModelScope Built with Sphinx 8.2.3