Skip to content
Logo LogoEvalScope
Docs Blogs
⌘ K
Logo LogoEvalScope
Docs Blogs

🚀 Quick Start

  • Introduction
  • Installation
  • Quick Start
  • Visualization
  • Parameters
  • Supported Benchmarks
    • LLM Benchmarks
      • AA-LCR
      • AIME-2024
      • AIME-2025
      • AIME-2026
      • AlpacaEval2.0
      • AMC
      • AnatEM
      • ARC
      • ArenaHard
      • BBH
      • BC2GM
      • BC4CHEMD
      • BC5CDR
      • BioMixQA
      • BroadTwitterCorpus
      • C-Eval
      • Chinese-SimpleQA
      • CL-bench
      • C-MMLU
      • CoinFlip
      • CommonsenseQA
      • Competition-MATH
      • CoNLL2003
      • CoNLL++
      • Copious
      • CrossNER
      • Data-Collection
      • DocMath
      • DrivelologyBinaryClassification
      • DrivelologyMultilabelClassification
      • DrivelologyNarrativeSelection
      • DrivelologyNarrativeWriting
      • DROP
      • EQ-Bench
      • FinNER
      • FRAMES
      • GeneralArena
      • General-MCQ
      • General-QA
      • GeniaNER
      • GPQA-Diamond
      • GSM8K
      • HaluEval
      • HarveyNER
      • HealthBench
      • HellaSwag
      • Humanity’s-Last-Exam
      • HMMT25
      • HumanEval
      • HumanEvalPlus
      • IFBench
      • IFEval
      • IQuiz
      • JNLPBA
      • JNLPBA-Rare
      • Live-Code-Bench
      • LogiQA
      • LongBench-v2
      • MaritimeBench
      • MATH-500
      • MathQA
      • MBPP
      • MBPP-Plus
      • Med-MCQA
      • MGSM
      • Minerva-Math
      • MIT-Movie-Trivia
      • MIT-Restaurant
      • MMLU
      • MMLU-Pro
      • MMLU-Redux
      • MMMLU
      • MRI-MCQA
      • Multi-IF
      • MultiNERD
      • MultiPL-E HumanEval
      • MultiPL-E MBPP
      • MusicTrivia
      • MuSR
      • NCBI
      • Needle-in-a-Haystack
      • OntoNotes5
      • OpenAI MRCR
      • PIQA
      • PolyMath
      • ProcessBench
      • PubMedQA
      • QASC
      • RACE
      • RefCOCO
      • SciCode
      • SciQ
      • SimpleQA
      • SIQA
      • SuperGPQA
      • SWE-bench_Lite
      • SWE-bench_Verified
      • SWE-bench_Verified_mini
      • Terminal-Bench-2.0
      • TriviaQA
      • TruthfulQA
      • TweeBankNER
      • TweetNER7
      • Winogrande
      • WMT2024++
      • WNUT2017
      • ZebraLogicBench
    • VLM Benchmarks
      • A-OKVQA
      • AI2D
      • BLINK
      • CCBench
      • ChartQA
      • CMMMU
      • CMMU
      • DocVQA
      • FLEURS
      • General-VMCQ
      • General-VQA
      • GSM8K-V
      • HallusionBench
      • InfoVQA
      • LibriSpeech
      • MathVerse
      • MathVision
      • MathVista
      • MIA-Bench
      • MicroVQA
      • MMBench
      • MMStar
      • MMMU
      • MMMU-PRO
      • OCRBench
      • OCRBench-v2
      • OlympiadBench
      • OmniBench
      • OmniDocBench
      • POPE
      • RealWorldQA
      • ScienceQA
      • SEED-Bench-2-Plus
      • SimpleVQA
      • TORGO
      • VisuLogic
      • V*Bench
      • ZeroBench
    • AGENT Benchmarks
      • BFCL-v3
      • BFCL-v4
      • General-FunctionCalling
      • τ²-bench
      • τ-bench
      • ToolBench-Static
    • AIGC Benchmarks
      • EvalMuse
      • GEdit-Bench
      • GenAI-Bench
      • general_t2i
      • HPD-v2
      • TIFA-160
    • Other Datasets
      • OpenCompass
      • VLMEvalKit Backend
      • MTEB
      • CLIP-Benchmark
  • ❓ FAQ

🔧 Tutorials

  • Evaluation Backends
    • OpenCompass
    • VLMEvalKit
    • RAGEval
      • MTEB
      • CLIP Benchmark
      • RAGAS
  • Model Inference Stress Testing
    • Quick Start
    • Parameter
    • Examples
    • SLA Auto-Tuning
    • Speed Benchmark Testing
    • vLLM Bench vs Evalscope Perf Load Testing Comparison
    • Custom Usage
  • AIGC Evaluation
    • Text-to-Image Evaluation
    • Image Editing Evaluation
  • Arena Mode
  • Sandbox Environment Usage
  • EvalScope Service Deployment

🛠️ Advanced Tutorials

  • Building an Evaluation Index
    • Defining Your Schema
    • Sampling Your Index Data
    • Unified Evaluation with Your Index
  • Custom Datasets
    • Large Language Model
    • Multimodal Large Models
    • Embedding Model
    • CLIP Model
  • Custom Model Evaluation
  • 👍 Contribute Benchmark

🧰 Extended Benchmarks

  • Extended Benchmarks
    • Terminal-Bench 2.0
    • SWE-bench
    • τ-bench
    • τ²-bench
    • BFCL-v3
    • BFCL-v4
    • Needle in a Haystack
    • ToolBench
    • LongBench-Write

📖 Best Practices

  • Best Practices
    • Evaluating in the Wild: How Agentic Is Your AI Model Really?
    • Benchmark Smarter: Tailor Your Model Evaluation Suite with EvalScope
    • Best Practices for Evaluating the Qwen3-Omni Model
    • Evaluating the Qwen3-VL Model
    • Evaluating the Qwen3-Next Model
    • GPT-OSS Model Evaluation
    • Evaluating Qwen3-Coder+Instruct Model
    • Evaluating Text-to-Image Models
    • Evaluating the Qwen3 Model
    • Evaluating the QwQ Model
    • How Smart is Your AI? Full Assessment of IQ and EQ!
    • Evaluating the Thinking Efficiency of Models
    • Evaluating the Inference Capability of R1 Models
    • Full-Chain LLM Training
    • ms-swift Integration

🧪 Benchmark Results

  • Benchmarking
    • MMLU
  • Speed Benchmarking
    • QwQ-32B-Preview

🌟 Blog

  • Welcome to the EvalScope Blogs!
    • RAG Evaluation Survey: Framework, Metrics, and Methods
EvalScope
/
Supported Benchmarks
/
VLM Benchmarks
/
BLINK

BLINK#

Overview#

BLINK is a benchmark designed to evaluate the core visual perception abilities of Multimodal Large Language Models (MLLMs). It transforms 14 classic computer vision tasks into 3,807 multiple-choice questions with single or multiple images and visual prompts.

Task Description#

  • Task Type: Visual Perception Multiple-Choice QA

  • Input: One or more images + multiple-choice question

  • Output: Single answer letter

  • Domains: Visual perception, correspondence, reasoning, detection

Key Features#

  • Covers 14 diverse visual perception tasks

  • Supports single and multi-image inputs (up to 4 images)

  • Tests fundamental visual understanding capabilities

  • Categories include: Art Style, Counting, Forensic Detection, IQ Test, Jigsaw, Multi-view Reasoning, Object Localization, and more

  • Questions derived from classic computer vision benchmarks

Evaluation Notes#

  • Default evaluation uses the val split

  • Primary metric: Accuracy on multiple-choice questions

  • Uses “ANSWER: [LETTER]” format for responses

  • Results can be analyzed across 14 different perception categories

Properties#

Property

Value

Benchmark Name

blink

Dataset ID

evalscope/BLINK

Paper

N/A

Tags

Knowledge, MCQ, MultiModal

Metrics

acc

Default Shots

0-shot

Evaluation Split

val

Data Statistics#

Metric

Value

Total Samples

1,901

Prompt Length (Mean)

577.53 chars

Prompt Length (Min/Max)

252 / 1125 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Art_Style

117

553

553

553

Counting

120

285.21

270

317

Forensic_Detection

132

480

480

480

Functional_Correspondence

130

1118.34

1113

1125

IQ_Test

150

884.6

548

922

Jigsaw

150

543

543

543

Multi-view_Reasoning

133

549

549

549

Object_Localization

122

531.86

527

548

Relative_Depth

124

359

359

359

Relative_Reflectance

134

498

498

498

Semantic_Correspondence

139

952

952

952

Spatial_Relation

143

263.97

252

282

Visual_Correspondence

172

587

587

587

Visual_Similarity

135

414

414

414

Image Statistics:

Metric

Value

Total Images

3,675

Images per Sample

min: 1, max: 4, mean: 1.93

Resolution Range

200x83 - 3072x4096

Formats

jpeg

Sample Example#

Subset: Art_Style

{
  "input": [
    {
      "id": "a522940e",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format:\n'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B.\n\nSome most common art painting styles include Realism, Impressi ... [TRUNCATED] ...  of art paintings, use the first image as the reference image, and determine which one of the second or the third image shares the same style as the reference image?\nSelect from the following choices.\n(A) the second image\n(B) the third image\n"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~477.8KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~876.1KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~329.2KB]"
        }
      ]
    }
  ],
  "choices": [
    "the second image",
    "the third image"
  ],
  "target": "A",
  "id": 0,
  "group_id": 0
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format:
'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}.

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets blink \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['blink'],
    dataset_args={
        'blink': {
            # subset_list: ['Art_Style', 'Counting', 'Forensic_Detection']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)
AI2D
CCBench

On this page

  • Overview
  • Task Description
  • Key Features
  • Evaluation Notes
  • Properties
  • Data Statistics
  • Sample Example
  • Prompt Template
  • Usage
    • Using CLI
    • Using Python

© 2022-2024, Alibaba ModelScope Built with Sphinx 9.1.0