MMLU-Pro#
Overview#
MMLU-Pro is an enhanced version of MMLU with increased difficulty and reasoning requirements. It features 10 answer choices instead of 4 and includes more challenging questions requiring deeper reasoning across 14 diverse domains.
Task Description#
Task Type: Multiple-Choice Question Answering (10 options)
Input: Question with up to 10 answer choices
Output: Single correct answer letter (A-J)
Domains: 14 subjects including science, law, engineering, psychology
Key Features#
10 answer choices per question (vs 4 in original MMLU)
More challenging questions requiring reasoning
Includes Chain-of-Thought (CoT) explanations
14 diverse subjects covered
Reduced impact of random guessing (10% vs 25%)
Evaluation Notes#
Default configuration uses 5-shot examples
Answers should follow “ANSWER: [LETTER]” format
Uses subject-specific few-shot examples
Step-by-step reasoning encouraged
Evaluates on test split with validation for few-shot
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
5-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
12,032 |
Prompt Length (Mean) |
4959.66 chars |
Prompt Length (Min/Max) |
3048 / 12100 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
410 |
5184.05 |
4696 |
7303 |
|
1,351 |
4853.54 |
4574 |
7519 |
|
1,132 |
4567.2 |
4182 |
5975 |
|
969 |
3595.66 |
3048 |
5775 |
|
1,101 |
7001.35 |
5581 |
9200 |
|
717 |
5881.08 |
5179 |
7850 |
|
818 |
4624.44 |
4134 |
9310 |
|
1,299 |
4389.59 |
4004 |
6511 |
|
789 |
4646.01 |
4316 |
6915 |
|
499 |
3760.1 |
3360 |
5289 |
|
844 |
4680.21 |
4118 |
6907 |
|
924 |
3730.86 |
3316 |
6108 |
|
798 |
5274.19 |
4656 |
6942 |
|
381 |
9919.94 |
8711 |
12100 |
Sample Example#
Subset: computer science
{
"input": [
{
"id": "79336f87",
"content": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with 'ANSWER: [LETTER]' (without quotes) where [LETTER] is the correct letter choice.\n\nQuestion:\nA certain pipel ... [TRUNCATED] ... random index of a larger value.\nG) The method should be written to return the largest index among larger values.\nH) The method should be written on the assumption that there is only one value in the array that is larger than the given item.\n"
}
],
"choices": [
"The method should return an error if more than one larger value is found.",
"The specification should be modified to indicate what should be done if there is more than one index of larger values.",
"The method should be written to output a message if more than one larger value is found.",
"The method should be written so as to return the index of every occurrence of a larger value.",
"The method should be written to return the last occurrence of a larger value.",
"The method should return a random index of a larger value.",
"The method should be written to return the largest index among larger values.",
"The method should be written on the assumption that there is only one value in the array that is larger than the given item."
],
"target": "B",
"id": 0,
"group_id": 0,
"subset_key": "computer science",
"metadata": {
"cot_content": "",
"subject": "computer science",
"question_id": 10356
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
Question:
{question}
Options:
{choices}
Few-shot Template
The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with 'ANSWER: [LETTER]' (without quotes) where [LETTER] is the correct letter choice.
{examples}
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
Question:
{question}
Options:
{choices}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmlu_pro \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmlu_pro'],
dataset_args={
'mmlu_pro': {
# subset_list: ['computer science', 'math', 'chemistry'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)