MMLU-Redux#

Overview#

MMLU-Redux is an improved version of the MMLU benchmark with corrected answers. It addresses known errors in the original MMLU dataset by fixing incorrect ground truth labels, missing correct options, and ambiguous questions.

Task Description#

  • Task Type: Multiple-Choice Knowledge Assessment

  • Input: Question with four answer choices

  • Output: Correct answer letter (A/B/C/D)

  • Domains: 57 subjects across STEM, Humanities, Social Sciences, and Other

Key Features#

  • Corrects errors in original MMLU benchmark

  • Error types fixed include:

    • no_correct_answer: Questions with missing correct options

    • wrong_groundtruth: Questions with incorrect ground truth

    • multiple_correct_answers: Questions with ambiguous answers

  • Same 57-subject coverage as original MMLU

  • Maintains compatibility with MMLU evaluation frameworks

Evaluation Notes#

  • Default evaluation uses the test split

  • Primary metric: Accuracy (with inclusion flag for multi-answer questions)

  • Uses Chain-of-Thought (CoT) prompting

  • Zero-shot evaluation only (few-shot not supported)

  • Results aggregated by subject and category (STEM, Humanities, Social Science, Other)

Properties#

Property

Value

Benchmark Name

mmlu_redux

Dataset ID

AI-ModelScope/mmlu-redux-2.0

Paper

N/A

Tags

Knowledge, MCQ

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

5,700

Prompt Length (Mean)

600.81 chars

Prompt Length (Min/Max)

255 / 5082 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

abstract_algebra

100

399.13

285

525

anatomy

100

454.09

323

788

astronomy

100

516.53

297

1087

business_ethics

100

538.35

274

927

clinical_knowledge

100

453.01

303

757

college_biology

100

550.59

343

1081

college_chemistry

100

451.35

292

890

college_computer_science

100

631.74

336

1344

college_mathematics

100

451.77

255

709

college_medicine

100

669.42

308

5082

college_physics

100

500.71

317

797

computer_security

100

476.01

291

1115

conceptual_physics

100

372.72

284

539

econometrics

100

610.01

304

1036

electrical_engineering

100

373.01

286

585

elementary_mathematics

100

403.98

264

797

formal_logic

100

598.04

318

1275

global_facts

100

389.82

303

746

high_school_biology

100

574.54

328

1078

high_school_chemistry

100

496.56

271

1093

high_school_computer_science

100

649.34

278

1786

high_school_european_history

100

1840.88

855

3045

high_school_geography

100

417.76

306

616

high_school_government_and_politics

100

539.42

384

864

high_school_macroeconomics

100

509.35

319

756

high_school_mathematics

100

409.45

282

846

high_school_microeconomics

100

508.99

325

1070

high_school_physics

100

600.82

324

1414

high_school_psychology

100

476.49

309

1055

high_school_statistics

100

736.04

376

1772

high_school_us_history

100

1643.04

782

2595

high_school_world_history

100

1749.59

749

3834

human_aging

100

416.87

323

689

human_sexuality

100

457.5

307

1145

international_law

100

636.61

332

986

jurisprudence

100

518.24

307

1039

logical_fallacies

100

516.62

333

902

machine_learning

100

513.09

315

806

management

100

399.38

294

645

marketing

100

474.85

335

757

medical_genetics

100

414.32

292

658

miscellaneous

100

388.56

277

1268

moral_disputes

100

534.28

325

872

moral_scenarios

100

620.62

563

767

nutrition

100

505.6

318

926

philosophy

100

462.44

319

1155

prehistory

100

468.62

320

943

professional_accounting

100

650.46

341

1226

professional_law

100

1370.21

359

2928

professional_medicine

100

1003.93

610

1735

professional_psychology

100

577.98

317

1502

public_relations

100

472.39

300

1188

security_studies

100

1029.35

317

2066

sociology

100

530.23

335

834

us_foreign_policy

100

490.28

305

754

virology

100

447.81

302

1383

world_religions

100

353.31

287

557

Sample Example#

Subset: abstract_algebra

{
  "input": [
    {
      "id": "f937b8b4",
      "content": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nStatement 1 | If T: V -> W is a linear transformation and dim(V ) < dim(W) < 1, then T must be injective. Statement 2 | Let dim(V) = n and suppose that T: V -> V is linear. If T is injective, then it is a bijection.\n\nA) True, True\nB) False, False\nC) True, False\nD) False, True"
    }
  ],
  "choices": [
    "True, True",
    "False, False",
    "True, False",
    "False, True"
  ],
  "target": [
    "A"
  ],
  "id": 0,
  "group_id": 0,
  "metadata": {
    "error_type": "bad_question_clarity",
    "correct_answer": "0",
    "potential_reason": "Statement 2 is true and well defined. \r\nHowever, statement 1 is not well defined: The dimension of a vector space is a nonnegative number, and since dim(V) < dim(W) < 1, this means dim(V) has to be negative. Taking this statement literally, the implication is vacuously true as the premise cannot be satisfed, but I doubt that was what the question is trying to test. "
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmlu_redux \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmlu_redux'],
    dataset_args={
        'mmlu_redux': {
            # subset_list: ['abstract_algebra', 'anatomy', 'astronomy']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)