MMStar#

Overview#

MMStar is an elite vision-indispensable multimodal benchmark designed to ensure genuine visual dependency in evaluation. Each sample is carefully curated to require actual visual understanding, minimizing data leakage and testing advanced multimodal capabilities.

Task Description#

  • Task Type: Vision-Dependent Multiple-Choice QA

  • Input: Image + multiple-choice question requiring visual understanding

  • Output: Single answer letter (A/B/C/D)

  • Domains: Perception, reasoning, math, science & technology

Key Features#

  • Ensures visual dependency - questions cannot be answered without images

  • Minimal data leakage from training corpora

  • Tests advanced multimodal reasoning capabilities

  • Six categories: coarse perception, fine-grained perception, instance reasoning, logical reasoning, math, science & technology

  • High-quality curated samples with verified visual necessity

Evaluation Notes#

  • Default evaluation uses the val split

  • Primary metric: Accuracy on multiple-choice questions

  • Uses Chain-of-Thought (CoT) prompting with “ANSWER: [LETTER]” format

  • Results reported per category and overall

Properties#

Property

Value

Benchmark Name

mm_star

Dataset ID

evalscope/MMStar

Paper

N/A

Tags

Knowledge, MCQ, MultiModal

Metrics

acc

Default Shots

0-shot

Evaluation Split

val

Data Statistics#

Metric

Value

Total Samples

1,500

Prompt Length (Mean)

390.23 chars

Prompt Length (Min/Max)

272 / 2023 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

coarse perception

250

350.8

282

784

fine-grained perception

250

334.98

277

608

instance reasoning

250

379.02

273

684

logical reasoning

250

427.22

284

2023

math

250

467.36

292

891

science & technology

250

381.98

272

1173

Image Statistics:

Metric

Value

Total Images

1,500

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

114x66 - 3160x2136

Formats

jpeg

Sample Example#

Subset: coarse perception

{
  "input": [
    {
      "id": "57e256b8",
      "content": [
        {
          "text": "Answer the following multiple choice question.\nThe last line of your response should be of the following format:\n'ANSWER: [LETTER]' (without quotes)\nwhere [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nWhich option describe the object relationship in the image correctly?\nOptions: A: The suitcase is on the book., B: The suitcase is beneath the cat., C: The suitcase is beneath the bed., D: The suitcase is beneath the book."
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~37.2KB]"
        }
      ]
    }
  ],
  "choices": [
    "A",
    "B",
    "C",
    "D"
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "coarse perception",
  "metadata": {
    "index": 0,
    "category": "coarse perception",
    "l2_category": "image scene and topic",
    "source": "MMBench",
    "split": "val",
    "image_path": "images/0.jpg"
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question.
The last line of your response should be of the following format:
'ANSWER: [LETTER]' (without quotes)
where [LETTER] is one of A,B,C,D. Think step by step before answering.

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mm_star \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mm_star'],
    dataset_args={
        'mm_star': {
            # subset_list: ['coarse perception', 'fine-grained perception', 'instance reasoning']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)