MMMU-PRO#

Overview#

MMMU-PRO is an enhanced multimodal benchmark designed to rigorously assess the genuine understanding capabilities of advanced AI models across multiple modalities. It builds upon the original MMMU benchmark with key improvements that make evaluation more challenging and realistic.

Task Description#

  • Task Type: Multimodal Academic Question Answering

  • Input: Images (up to 7) + multiple-choice question

  • Output: Correct answer choice letter

  • Domains: 30 academic subjects across STEM, humanities, and social sciences

Key Features#

  • Enhanced version of MMMU with more rigorous evaluation

  • Covers 30 subjects: Accounting, Biology, Chemistry, Computer Science, Economics, Physics, etc.

  • Multiple dataset formats available:

    • standard (4 options): Traditional 4-choice format

    • standard (10 options): Extended 10-choice format for harder evaluation

    • vision: Questions embedded in images

  • Tests genuine multimodal understanding, not just text shortcuts

Evaluation Notes#

  • Default evaluation uses the test split

  • Primary metric: Accuracy on multiple-choice questions

  • Dataset format can be configured via dataset_format parameter

  • Uses Chain-of-Thought (CoT) prompting for reasoning

  • Rich metadata includes topic difficulty and subject information

Properties#

Property

Value

Benchmark Name

mmmu_pro

Dataset ID

AI-ModelScope/MMMU_Pro

Paper

N/A

Tags

Knowledge, MCQ, MultiModal

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

1,730

Prompt Length (Mean)

521.89 chars

Prompt Length (Min/Max)

249 / 3749 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Accounting

58

518.48

320

899

Agriculture

60

477.05

289

747

Architecture_and_Engineering

60

592.85

281

1177

Art

53

358.34

297

919

Art_Theory

55

362.53

289

619

Basic_Medical_Science

52

401.96

277

867

Biology

59

476.81

269

1387

Chemistry

60

453.75

264

1217

Clinical_Medicine

59

525.58

311

977

Computer_Science

60

441.37

262

1077

Design

60

408.25

285

1449

Diagnostics_and_Laboratory_Medicine

60

444.17

274

789

Economics

59

506.37

284

900

Electronics

60

455.6

314

668

Energy_and_Power

58

506.86

347

816

Finance

60

637.75

317

1864

Geography

52

409.9

267

929

History

56

611.3

328

1077

Literature

52

429.87

274

564

Manage

50

666.56

282

2198

Marketing

59

596.53

303

1060

Materials

60

484.02

296

1351

Math

60

511.95

249

1172

Mechanical_Engineering

59

527.95

272

1418

Music

60

336.55

250

672

Pharmacy

57

474.7

282

902

Physics

60

499.07

341

737

Psychology

60

1355.12

280

3749

Public_Health

58

716.98

282

2510

Sociology

54

416.48

279

708

Image Statistics:

Metric

Value

Total Images

2,048

Images per Sample

min: 1, max: 35, mean: 1.18

Resolution Range

43x50 - 2560x2545

Formats

png

Sample Example#

Subset: Accounting

{
  "input": [
    {
      "id": "bae49033",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C. Think step by step before answering.\n\nPrices of zero-coupon bonds reveal the following pattern of forward rates: "
        },
        {
          "image": "[BASE64_IMAGE: png, ~8.0KB]"
        },
        {
          "text": " In addition to the zero-coupon bond, investors also may purchase a 3-year bond making annual payments of $60 with par value $1,000. Under the expectations hypothesis, what is the expected realized compound yield of the coupon bond?\n\nA) 6.66%\nB) 6.79%\nC) 6.91%"
        }
      ]
    }
  ],
  "choices": [
    "6.66%",
    "6.79%",
    "6.91%"
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "Accounting",
  "metadata": {
    "id": "test_Accounting_42",
    "explanation": "?",
    "img_type": "['Tables']",
    "topic_difficulty": "Hard",
    "subject": "Accounting"
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Extra Parameters#

Parameter

Type

Default

Description

dataset_format

str

standard (4 options)

Dataset format variant. Choices: [‘standard (4 options)’, ‘standard (10 options)’, ‘vision’]. Choices: [‘standard (4 options)’, ‘standard (10 options)’, ‘vision’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmmu_pro \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmmu_pro'],
    dataset_args={
        'mmmu_pro': {
            # subset_list: ['Accounting', 'Agriculture', 'Architecture_and_Engineering']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)