MMMU#

Overview#

MMMU (Massive Multi-discipline Multimodal Understanding) is a comprehensive benchmark designed to evaluate multimodal models on expert-level tasks requiring college-level subject knowledge and deliberate reasoning. It covers 30 subjects across 6 core disciplines.

Task Description#

  • Task Type: Multimodal Question Answering (Multiple-Choice and Open-Ended)

  • Input: Questions with diverse images (charts, diagrams, maps, tables, etc.)

  • Output: Answer letter (MC) or free-form text (Open)

  • Disciplines: Art & Design, Business, Science, Health & Medicine, Humanities, Tech & Engineering

Key Features#

  • 11.5K meticulously collected multimodal questions

  • From college exams, quizzes, and textbooks

  • 30 subjects and 183 subfields covered

  • 30 heterogeneous image types (charts, diagrams, music sheets, chemical structures, etc.)

  • Tests both perception and expert-level reasoning

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Supports both multiple-choice and open-ended question types

  • Multiple images per question supported (up to 7)

  • For open questions: “ANSWER: [ANSWER]” format expected

  • Evaluates on validation split (test set requires submission)

Properties#

Property

Value

Benchmark Name

mmmu

Dataset ID

AI-ModelScope/MMMU

Paper

N/A

Tags

Knowledge, MultiModal, QA

Metrics

acc

Default Shots

0-shot

Evaluation Split

validation

Data Statistics#

Metric

Value

Total Samples

900

Prompt Length (Mean)

527.84 chars

Prompt Length (Min/Max)

247 / 3011 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Accounting

30

525.2

356

899

Agriculture

30

472.73

321

747

Architecture_and_Engineering

30

643.33

279

927

Art

30

384.17

297

1098

Art_Theory

30

370.13

297

588

Basic_Medical_Science

30

420.1

277

1119

Biology

30

524.87

294

1239

Chemistry

30

522.27

286

1220

Clinical_Medicine

30

549.37

311

914

Computer_Science

30

512.73

273

1285

Design

30

401.77

292

613

Diagnostics_and_Laboratory_Medicine

30

440.73

302

741

Economics

30

503.63

354

730

Electronics

30

476.83

314

659

Energy_and_Power

30

529.17

347

814

Finance

30

628.67

361

985

Geography

30

460.37

299

759

History

30

590.63

360

899

Literature

30

425.57

275

541

Manage

30

784

414

2261

Marketing

30

530.6

303

832

Materials

30

474.63

308

667

Math

30

500.6

247

1167

Mechanical_Engineering

30

504.73

272

861

Music

30

315.93

250

455

Pharmacy

30

499.4

311

981

Physics

30

525.83

332

1086

Psychology

30

1173.9

280

3011

Public_Health

30

678.17

282

2684

Sociology

30

465.27

299

1007

Image Statistics:

Metric

Value

Total Images

980

Images per Sample

min: 1, max: 5, mean: 1.09

Resolution Range

70x67 - 2560x2133

Formats

png

Sample Example#

Subset: Accounting

{
  "input": [
    {
      "id": "be505d26",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\n"
        },
        {
          "image": "[BASE64_IMAGE: png, ~65.8KB]"
        },
        {
          "text": " Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\n\nA) $6\nB) $7\nC) $8\nD) $9"
        }
      ]
    }
  ],
  "choices": [
    "$6",
    "$7",
    "$8",
    "$9"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "id": "validation_Accounting_1",
    "question_type": "multiple-choice",
    "subfield": "Managerial Accounting",
    "explanation": "",
    "img_type": "['Tables']",
    "topic_difficulty": "Medium"
  }
}

Prompt Template#

Prompt Template:

Solve the following problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem, and you do not need to use a \boxed command.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmmu \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmmu'],
    dataset_args={
        'mmmu': {
            # subset_list: ['Accounting', 'Agriculture', 'Architecture_and_Engineering']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)