Skip to content
Logo LogoEvalScope
Docs Blogs
⌘ K
Logo LogoEvalScope
Docs Blogs

🚀 Quick Start

  • Introduction
  • Installation
  • Quick Start
  • Visualization
  • Parameters
  • Supported Benchmarks
    • LLM Benchmarks
      • AA-LCR
      • AIME-2024
      • AIME-2025
      • AIME-2026
      • AlpacaEval2.0
      • AMC
      • AnatEM
      • ARC
      • ArenaHard
      • BBH
      • BC2GM
      • BC4CHEMD
      • BC5CDR
      • BioMixQA
      • BroadTwitterCorpus
      • C-Eval
      • Chinese-SimpleQA
      • CL-bench
      • C-MMLU
      • CoinFlip
      • CommonsenseQA
      • Competition-MATH
      • CoNLL2003
      • CoNLL++
      • Copious
      • CrossNER
      • Data-Collection
      • DocMath
      • DrivelologyBinaryClassification
      • DrivelologyMultilabelClassification
      • DrivelologyNarrativeSelection
      • DrivelologyNarrativeWriting
      • DROP
      • EQ-Bench
      • FinNER
      • FRAMES
      • GeneralArena
      • General-MCQ
      • General-QA
      • GeniaNER
      • GPQA-Diamond
      • GSM8K
      • HaluEval
      • HarveyNER
      • HealthBench
      • HellaSwag
      • Humanity’s-Last-Exam
      • HMMT25
      • HumanEval
      • HumanEvalPlus
      • IFBench
      • IFEval
      • IQuiz
      • JNLPBA
      • JNLPBA-Rare
      • Live-Code-Bench
      • LogiQA
      • LongBench-v2
      • MaritimeBench
      • MATH-500
      • MathQA
      • MBPP
      • MBPP-Plus
      • Med-MCQA
      • MGSM
      • Minerva-Math
      • MIT-Movie-Trivia
      • MIT-Restaurant
      • MMLU
      • MMLU-Pro
      • MMLU-Redux
      • MMMLU
      • MRI-MCQA
      • Multi-IF
      • MultiNERD
      • MultiPL-E HumanEval
      • MultiPL-E MBPP
      • MusicTrivia
      • MuSR
      • NCBI
      • Needle-in-a-Haystack
      • OntoNotes5
      • OpenAI MRCR
      • PIQA
      • PolyMath
      • ProcessBench
      • PubMedQA
      • QASC
      • RACE
      • RefCOCO
      • SciCode
      • SciQ
      • SimpleQA
      • SIQA
      • SuperGPQA
      • SWE-bench_Lite
      • SWE-bench_Verified
      • SWE-bench_Verified_mini
      • Terminal-Bench-2.0
      • TriviaQA
      • TruthfulQA
      • TweeBankNER
      • TweetNER7
      • Winogrande
      • WMT2024++
      • WNUT2017
      • ZebraLogicBench
    • VLM Benchmarks
      • A-OKVQA
      • AI2D
      • BLINK
      • CCBench
      • ChartQA
      • CMMMU
      • CMMU
      • DocVQA
      • FLEURS
      • General-VMCQ
      • General-VQA
      • GSM8K-V
      • HallusionBench
      • InfoVQA
      • LibriSpeech
      • MathVerse
      • MathVision
      • MathVista
      • MIA-Bench
      • MicroVQA
      • MMBench
      • MMStar
      • MMMU
      • MMMU-PRO
      • OCRBench
      • OCRBench-v2
      • OlympiadBench
      • OmniBench
      • OmniDocBench
      • POPE
      • RealWorldQA
      • ScienceQA
      • SEED-Bench-2-Plus
      • SimpleVQA
      • TORGO
      • VisuLogic
      • V*Bench
      • ZeroBench
    • AGENT Benchmarks
      • BFCL-v3
      • BFCL-v4
      • General-FunctionCalling
      • τ²-bench
      • τ-bench
      • ToolBench-Static
    • AIGC Benchmarks
      • EvalMuse
      • GEdit-Bench
      • GenAI-Bench
      • general_t2i
      • HPD-v2
      • TIFA-160
    • Other Datasets
      • OpenCompass
      • VLMEvalKit Backend
      • MTEB
      • CLIP-Benchmark
  • ❓ FAQ

🔧 Tutorials

  • Evaluation Backends
    • OpenCompass
    • VLMEvalKit
    • RAGEval
      • MTEB
      • CLIP Benchmark
      • RAGAS
  • Model Inference Stress Testing
    • Quick Start
    • Parameter
    • Examples
    • SLA Auto-Tuning
    • Speed Benchmark Testing
    • vLLM Bench vs Evalscope Perf Load Testing Comparison
    • Custom Usage
  • AIGC Evaluation
    • Text-to-Image Evaluation
    • Image Editing Evaluation
  • Arena Mode
  • Sandbox Environment Usage
  • EvalScope Service Deployment

🛠️ Advanced Tutorials

  • Building an Evaluation Index
    • Defining Your Schema
    • Sampling Your Index Data
    • Unified Evaluation with Your Index
  • Custom Datasets
    • Large Language Model
    • Multimodal Large Models
    • Embedding Model
    • CLIP Model
  • Custom Model Evaluation
  • 👍 Contribute Benchmark

🧰 Extended Benchmarks

  • Extended Benchmarks
    • Terminal-Bench 2.0
    • SWE-bench
    • τ-bench
    • τ²-bench
    • BFCL-v3
    • BFCL-v4
    • Needle in a Haystack
    • ToolBench
    • LongBench-Write

📖 Best Practices

  • Best Practices
    • Evaluating in the Wild: How Agentic Is Your AI Model Really?
    • Benchmark Smarter: Tailor Your Model Evaluation Suite with EvalScope
    • Best Practices for Evaluating the Qwen3-Omni Model
    • Evaluating the Qwen3-VL Model
    • Evaluating the Qwen3-Next Model
    • GPT-OSS Model Evaluation
    • Evaluating Qwen3-Coder+Instruct Model
    • Evaluating Text-to-Image Models
    • Evaluating the Qwen3 Model
    • Evaluating the QwQ Model
    • How Smart is Your AI? Full Assessment of IQ and EQ!
    • Evaluating the Thinking Efficiency of Models
    • Evaluating the Inference Capability of R1 Models
    • Full-Chain LLM Training
    • ms-swift Integration

🧪 Benchmark Results

  • Benchmarking
    • MMLU
  • Speed Benchmarking
    • QwQ-32B-Preview

🌟 Blog

  • Welcome to the EvalScope Blogs!
    • RAG Evaluation Survey: Framework, Metrics, and Methods
EvalScope
/
AIGC Evaluation

AIGC Evaluation#

Supports AIGC evaluation, currently including text-to-image consistency, image aesthetics, etc.

  • Text-to-Image Evaluation
    • Supported Evaluation Datasets
    • Supported Evaluation Metrics
    • Install Dependencies
    • Benchmark
    • Custom Evaluation
    • Visualization
  • Image Editing Evaluation
    • Supported Evaluation Datasets
    • Installation of Dependencies
    • End-to-End Evaluation
    • Custom Evaluation
    • Visualization
Custom Usage
Text-to-Image Evaluation

© 2022-2024, Alibaba ModelScope Built with Sphinx 9.1.0