MMStar#
Overview#
MMStar is an elite vision-indispensable multimodal benchmark designed to ensure genuine visual dependency in evaluation. Each sample is carefully curated to require actual visual understanding, minimizing data leakage and testing advanced multimodal capabilities.
Task Description#
Task Type: Vision-Dependent Multiple-Choice QA
Input: Image + multiple-choice question requiring visual understanding
Output: Single answer letter (A/B/C/D)
Domains: Perception, reasoning, math, science & technology
Key Features#
Ensures visual dependency - questions cannot be answered without images
Minimal data leakage from training corpora
Tests advanced multimodal reasoning capabilities
Six categories: coarse perception, fine-grained perception, instance reasoning, logical reasoning, math, science & technology
High-quality curated samples with verified visual necessity
Evaluation Notes#
Default evaluation uses the val split
Primary metric: Accuracy on multiple-choice questions
Uses Chain-of-Thought (CoT) prompting with “ANSWER: [LETTER]” format
Results reported per category and overall
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,500 |
Prompt Length (Mean) |
390.23 chars |
Prompt Length (Min/Max) |
272 / 2023 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
250 |
350.8 |
282 |
784 |
|
250 |
334.98 |
277 |
608 |
|
250 |
379.02 |
273 |
684 |
|
250 |
427.22 |
284 |
2023 |
|
250 |
467.36 |
292 |
891 |
|
250 |
381.98 |
272 |
1173 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
1,500 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
114x66 - 3160x2136 |
Formats |
jpeg |
Sample Example#
Subset: coarse perception
{
"input": [
{
"id": "57e256b8",
"content": [
{
"text": "Answer the following multiple choice question.\nThe last line of your response should be of the following format:\n'ANSWER: [LETTER]' (without quotes)\nwhere [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nWhich option describe the object relationship in the image correctly?\nOptions: A: The suitcase is on the book., B: The suitcase is beneath the cat., C: The suitcase is beneath the bed., D: The suitcase is beneath the book."
},
{
"image": "[BASE64_IMAGE: jpeg, ~37.2KB]"
}
]
}
],
"choices": [
"A",
"B",
"C",
"D"
],
"target": "A",
"id": 0,
"group_id": 0,
"subset_key": "coarse perception",
"metadata": {
"index": 0,
"category": "coarse perception",
"l2_category": "image scene and topic",
"source": "MMBench",
"split": "val",
"image_path": "images/0.jpg"
}
}
Prompt Template#
Prompt Template:
Answer the following multiple choice question.
The last line of your response should be of the following format:
'ANSWER: [LETTER]' (without quotes)
where [LETTER] is one of A,B,C,D. Think step by step before answering.
{question}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mm_star \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mm_star'],
dataset_args={
'mm_star': {
# subset_list: ['coarse perception', 'fine-grained perception', 'instance reasoning'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)