MMLU-Pro#

Overview#

MMLU-Pro is an enhanced version of MMLU with increased difficulty and reasoning requirements. It features 10 answer choices instead of 4 and includes more challenging questions requiring deeper reasoning across 14 diverse domains.

Task Description#

  • Task Type: Multiple-Choice Question Answering (10 options)

  • Input: Question with up to 10 answer choices

  • Output: Single correct answer letter (A-J)

  • Domains: 14 subjects including science, law, engineering, psychology

Key Features#

  • 10 answer choices per question (vs 4 in original MMLU)

  • More challenging questions requiring reasoning

  • Includes Chain-of-Thought (CoT) explanations

  • 14 diverse subjects covered

  • Reduced impact of random guessing (10% vs 25%)

Evaluation Notes#

  • Default configuration uses 5-shot examples

  • Answers should follow “ANSWER: [LETTER]” format

  • Uses subject-specific few-shot examples

  • Step-by-step reasoning encouraged

  • Evaluates on test split with validation for few-shot

Properties#

Property

Value

Benchmark Name

mmlu_pro

Dataset ID

TIGER-Lab/MMLU-Pro

Paper

N/A

Tags

Knowledge, MCQ

Metrics

acc

Default Shots

5-shot

Evaluation Split

test

Train Split

validation

Data Statistics#

Metric

Value

Total Samples

12,032

Prompt Length (Mean)

4959.66 chars

Prompt Length (Min/Max)

3048 / 12100 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

computer science

410

5184.05

4696

7303

math

1,351

4853.54

4574

7519

chemistry

1,132

4567.2

4182

5975

engineering

969

3595.66

3048

5775

law

1,101

7001.35

5581

9200

biology

717

5881.08

5179

7850

health

818

4624.44

4134

9310

physics

1,299

4389.59

4004

6511

business

789

4646.01

4316

6915

philosophy

499

3760.1

3360

5289

economics

844

4680.21

4118

6907

other

924

3730.86

3316

6108

psychology

798

5274.19

4656

6942

history

381

9919.94

8711

12100

Sample Example#

Subset: computer science

{
  "input": [
    {
      "id": "79336f87",
      "content": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with 'ANSWER: [LETTER]' (without quotes) where [LETTER] is the correct letter choice.\n\nQuestion:\nA certain pipel ... [TRUNCATED] ...  random index of a larger value.\nG) The method should be written to return the largest index among larger values.\nH) The method should be written on the assumption that there is only one value in the array that is larger than the given item.\n"
    }
  ],
  "choices": [
    "The method should return an error if more than one larger value is found.",
    "The specification should be modified to indicate what should be done if there is more than one index of larger values.",
    "The method should be written to output a message if more than one larger value is found.",
    "The method should be written so as to return the index of every occurrence of a larger value.",
    "The method should be written to return the last occurrence of a larger value.",
    "The method should return a random index of a larger value.",
    "The method should be written to return the largest index among larger values.",
    "The method should be written on the assumption that there is only one value in the array that is larger than the given item."
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "computer science",
  "metadata": {
    "cot_content": "",
    "subject": "computer science",
    "question_id": 10356
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

Question:
{question}
Options:
{choices}
Few-shot Template
The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with 'ANSWER: [LETTER]' (without quotes) where [LETTER] is the correct letter choice.

{examples}
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

Question:
{question}
Options:
{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmlu_pro \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmlu_pro'],
    dataset_args={
        'mmlu_pro': {
            # subset_list: ['computer science', 'math', 'chemistry']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)