MMLU-Redux#
Overview#
MMLU-Redux is an improved version of the MMLU benchmark with corrected answers. It addresses known errors in the original MMLU dataset by fixing incorrect ground truth labels, missing correct options, and ambiguous questions.
Task Description#
Task Type: Multiple-Choice Knowledge Assessment
Input: Question with four answer choices
Output: Correct answer letter (A/B/C/D)
Domains: 57 subjects across STEM, Humanities, Social Sciences, and Other
Key Features#
Corrects errors in original MMLU benchmark
Error types fixed include:
no_correct_answer: Questions with missing correct optionswrong_groundtruth: Questions with incorrect ground truthmultiple_correct_answers: Questions with ambiguous answers
Same 57-subject coverage as original MMLU
Maintains compatibility with MMLU evaluation frameworks
Evaluation Notes#
Default evaluation uses the test split
Primary metric: Accuracy (with inclusion flag for multi-answer questions)
Uses Chain-of-Thought (CoT) prompting
Zero-shot evaluation only (few-shot not supported)
Results aggregated by subject and category (STEM, Humanities, Social Science, Other)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
5,700 |
Prompt Length (Mean) |
600.81 chars |
Prompt Length (Min/Max) |
255 / 5082 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
100 |
399.13 |
285 |
525 |
|
100 |
454.09 |
323 |
788 |
|
100 |
516.53 |
297 |
1087 |
|
100 |
538.35 |
274 |
927 |
|
100 |
453.01 |
303 |
757 |
|
100 |
550.59 |
343 |
1081 |
|
100 |
451.35 |
292 |
890 |
|
100 |
631.74 |
336 |
1344 |
|
100 |
451.77 |
255 |
709 |
|
100 |
669.42 |
308 |
5082 |
|
100 |
500.71 |
317 |
797 |
|
100 |
476.01 |
291 |
1115 |
|
100 |
372.72 |
284 |
539 |
|
100 |
610.01 |
304 |
1036 |
|
100 |
373.01 |
286 |
585 |
|
100 |
403.98 |
264 |
797 |
|
100 |
598.04 |
318 |
1275 |
|
100 |
389.82 |
303 |
746 |
|
100 |
574.54 |
328 |
1078 |
|
100 |
496.56 |
271 |
1093 |
|
100 |
649.34 |
278 |
1786 |
|
100 |
1840.88 |
855 |
3045 |
|
100 |
417.76 |
306 |
616 |
|
100 |
539.42 |
384 |
864 |
|
100 |
509.35 |
319 |
756 |
|
100 |
409.45 |
282 |
846 |
|
100 |
508.99 |
325 |
1070 |
|
100 |
600.82 |
324 |
1414 |
|
100 |
476.49 |
309 |
1055 |
|
100 |
736.04 |
376 |
1772 |
|
100 |
1643.04 |
782 |
2595 |
|
100 |
1749.59 |
749 |
3834 |
|
100 |
416.87 |
323 |
689 |
|
100 |
457.5 |
307 |
1145 |
|
100 |
636.61 |
332 |
986 |
|
100 |
518.24 |
307 |
1039 |
|
100 |
516.62 |
333 |
902 |
|
100 |
513.09 |
315 |
806 |
|
100 |
399.38 |
294 |
645 |
|
100 |
474.85 |
335 |
757 |
|
100 |
414.32 |
292 |
658 |
|
100 |
388.56 |
277 |
1268 |
|
100 |
534.28 |
325 |
872 |
|
100 |
620.62 |
563 |
767 |
|
100 |
505.6 |
318 |
926 |
|
100 |
462.44 |
319 |
1155 |
|
100 |
468.62 |
320 |
943 |
|
100 |
650.46 |
341 |
1226 |
|
100 |
1370.21 |
359 |
2928 |
|
100 |
1003.93 |
610 |
1735 |
|
100 |
577.98 |
317 |
1502 |
|
100 |
472.39 |
300 |
1188 |
|
100 |
1029.35 |
317 |
2066 |
|
100 |
530.23 |
335 |
834 |
|
100 |
490.28 |
305 |
754 |
|
100 |
447.81 |
302 |
1383 |
|
100 |
353.31 |
287 |
557 |
Sample Example#
Subset: abstract_algebra
{
"input": [
{
"id": "f937b8b4",
"content": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nStatement 1 | If T: V -> W is a linear transformation and dim(V ) < dim(W) < 1, then T must be injective. Statement 2 | Let dim(V) = n and suppose that T: V -> V is linear. If T is injective, then it is a bijection.\n\nA) True, True\nB) False, False\nC) True, False\nD) False, True"
}
],
"choices": [
"True, True",
"False, False",
"True, False",
"False, True"
],
"target": [
"A"
],
"id": 0,
"group_id": 0,
"metadata": {
"error_type": "bad_question_clarity",
"correct_answer": "0",
"potential_reason": "Statement 2 is true and well defined. \r\nHowever, statement 1 is not well defined: The dimension of a vector space is a nonnegative number, and since dim(V) < dim(W) < 1, this means dim(V) has to be negative. Taking this statement literally, the implication is vacuously true as the premise cannot be satisfed, but I doubt that was what the question is trying to test. "
}
}
Prompt Template#
Prompt Template:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmlu_redux \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmlu_redux'],
dataset_args={
'mmlu_redux': {
# subset_list: ['abstract_algebra', 'anatomy', 'astronomy'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)