MMMU#
Overview#
MMMU (Massive Multi-discipline Multimodal Understanding) is a comprehensive benchmark designed to evaluate multimodal models on expert-level tasks requiring college-level subject knowledge and deliberate reasoning. It covers 30 subjects across 6 core disciplines.
Task Description#
Task Type: Multimodal Question Answering (Multiple-Choice and Open-Ended)
Input: Questions with diverse images (charts, diagrams, maps, tables, etc.)
Output: Answer letter (MC) or free-form text (Open)
Disciplines: Art & Design, Business, Science, Health & Medicine, Humanities, Tech & Engineering
Key Features#
11.5K meticulously collected multimodal questions
From college exams, quizzes, and textbooks
30 subjects and 183 subfields covered
30 heterogeneous image types (charts, diagrams, music sheets, chemical structures, etc.)
Tests both perception and expert-level reasoning
Evaluation Notes#
Default configuration uses 0-shot evaluation
Supports both multiple-choice and open-ended question types
Multiple images per question supported (up to 7)
For open questions: “ANSWER: [ANSWER]” format expected
Evaluates on validation split (test set requires submission)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
900 |
Prompt Length (Mean) |
527.84 chars |
Prompt Length (Min/Max) |
247 / 3011 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
30 |
525.2 |
356 |
899 |
|
30 |
472.73 |
321 |
747 |
|
30 |
643.33 |
279 |
927 |
|
30 |
384.17 |
297 |
1098 |
|
30 |
370.13 |
297 |
588 |
|
30 |
420.1 |
277 |
1119 |
|
30 |
524.87 |
294 |
1239 |
|
30 |
522.27 |
286 |
1220 |
|
30 |
549.37 |
311 |
914 |
|
30 |
512.73 |
273 |
1285 |
|
30 |
401.77 |
292 |
613 |
|
30 |
440.73 |
302 |
741 |
|
30 |
503.63 |
354 |
730 |
|
30 |
476.83 |
314 |
659 |
|
30 |
529.17 |
347 |
814 |
|
30 |
628.67 |
361 |
985 |
|
30 |
460.37 |
299 |
759 |
|
30 |
590.63 |
360 |
899 |
|
30 |
425.57 |
275 |
541 |
|
30 |
784 |
414 |
2261 |
|
30 |
530.6 |
303 |
832 |
|
30 |
474.63 |
308 |
667 |
|
30 |
500.6 |
247 |
1167 |
|
30 |
504.73 |
272 |
861 |
|
30 |
315.93 |
250 |
455 |
|
30 |
499.4 |
311 |
981 |
|
30 |
525.83 |
332 |
1086 |
|
30 |
1173.9 |
280 |
3011 |
|
30 |
678.17 |
282 |
2684 |
|
30 |
465.27 |
299 |
1007 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
980 |
Images per Sample |
min: 1, max: 5, mean: 1.09 |
Resolution Range |
70x67 - 2560x2133 |
Formats |
png |
Sample Example#
Subset: Accounting
{
"input": [
{
"id": "be505d26",
"content": [
{
"text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\n"
},
{
"image": "[BASE64_IMAGE: png, ~65.8KB]"
},
{
"text": " Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\n\nA) $6\nB) $7\nC) $8\nD) $9"
}
]
}
],
"choices": [
"$6",
"$7",
"$8",
"$9"
],
"target": "B",
"id": 0,
"group_id": 0,
"metadata": {
"id": "validation_Accounting_1",
"question_type": "multiple-choice",
"subfield": "Managerial Accounting",
"explanation": "",
"img_type": "['Tables']",
"topic_difficulty": "Medium"
}
}
Prompt Template#
Prompt Template:
Solve the following problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.
{question}
Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem, and you do not need to use a \boxed command.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmmu \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmmu'],
dataset_args={
'mmmu': {
# subset_list: ['Accounting', 'Agriculture', 'Architecture_and_Engineering'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)