MMLU#

Overview#

MMLU (Massive Multitask Language Understanding) is a comprehensive evaluation benchmark designed to measure knowledge acquired during pretraining. It covers 57 subjects across STEM, humanities, social sciences, and other domains, ranging from elementary to professional difficulty levels.

Task Description#

Task Type: Multiple-Choice Question Answering
Input: Question with four answer choices (A, B, C, D)
Output: Single correct answer letter
Subjects: 57 subjects organized into 4 categories (STEM, Humanities, Social Sciences, Other)

Key Features#

Covers diverse knowledge domains from elementary to advanced professional levels
Tests both factual knowledge and reasoning abilities
Includes subjects like abstract algebra, anatomy, astronomy, business ethics, and more
Standard benchmark for measuring LLM knowledge breadth

Evaluation Notes#

Default configuration uses 5-shot examples from the dev split
Supports Chain-of-Thought (CoT) prompting for improved reasoning
Results can be aggregated by subject or category (STEM, Humanities, Social Sciences, Other)
Use subset_list parameter to evaluate specific subjects

Properties#

Property	Value
Benchmark Name	`mmlu`
Dataset ID	cais/mmlu
Paper	N/A
Tags	`Knowledge`, `MCQ`
Metrics	`acc`
Default Shots	5-shot
Evaluation Split	`test`
Train Split	`dev`

Data Statistics#

Metric	Value
Total Samples	14,042
Prompt Length (Mean)	3212.2 chars
Prompt Length (Min/Max)	985 / 14626 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`abstract_algebra`	100	1256.98	1143	1383
`anatomy`	135	1446	1306	1783
`astronomy`	152	2616.64	2401	3191
`business_ethics`	100	2756.36	2492	3145
`clinical_knowledge`	265	1680.76	1510	1995
`college_biology`	144	2104.53	1874	2641
`college_chemistry`	100	1800.19	1641	2239
`college_computer_science`	100	3424.74	3129	4137
`college_mathematics`	100	1972.77	1776	2230
`college_medicine`	173	2377.62	2005	6779
`college_physics`	102	1941.26	1757	2237
`computer_security`	100	1598.89	1414	2238
`conceptual_physics`	235	1340.77	1246	1572
`econometrics`	114	2286.2	1976	2708
`electrical_engineering`	145	1372.1	1279	1578
`elementary_mathematics`	378	1856.81	1724	2322
`formal_logic`	126	2338.07	2065	3022
`global_facts`	100	1644.77	1558	2001
`high_school_biology`	310	2218.56	1971	2737
`high_school_chemistry`	203	1740.83	1519	2341
`high_school_computer_science`	100	3595.34	3224	4732
`high_school_european_history`	165	13421.12	12392	14626
`high_school_geography`	198	1849.07	1726	2204
`high_school_government_and_politics`	193	2355.29	2171	2922
`high_school_macroeconomics`	390	1861.47	1649	2206
`high_school_mathematics`	270	1731.31	1591	2224
`high_school_microeconomics`	238	1849.97	1651	2396
`high_school_physics`	151	2110.63	1836	2926
`high_school_psychology`	545	2431.39	2223	3498
`high_school_statistics`	216	3273.79	2932	4328
`high_school_us_history`	204	10530.61	9656	11469
`high_school_world_history`	237	6700.7	5650	8814
`human_aging`	223	1448.65	1335	1725
`human_sexuality`	131	1556.02	1412	2250
`international_law`	121	3094.31	2778	3432
`jurisprudence`	108	1851.24	1638	2370
`logical_fallacies`	163	2114.39	1932	2503
`machine_learning`	112	2852.01	2659	3150
`management`	103	1326.07	1220	1571
`marketing`	234	1984.28	1832	2266
`medical_genetics`	100	1531.32	1409	1775
`miscellaneous`	783	1121.57	1004	2096
`moral_disputes`	346	2300.58	2101	2711
`moral_scenarios`	895	2709.89	2644	2853
`nutrition`	306	2620.9	2408	3111
`philosophy`	311	1472.67	1319	2292
`prehistory`	324	2388.29	2201	2899
`professional_accounting`	282	2820.72	2515	3400
`professional_law`	1,534	8077.0	6997	10539
`professional_medicine`	272	4832.72	4380	5802
`professional_psychology`	612	2860.73	2594	3789
`public_relations`	110	1991.35	1824	2712
`security_studies`	245	6405.04	5680	7818
`sociology`	201	2176.49	1976	2530
`us_foreign_policy`	100	2129.28	1944	2393
`virology`	166	1563.11	1426	2507
`world_religions`	171	1051.68	985	1255

Sample Example#

Subset: abstract_algebra

{
  "input": [
    {
      "id": "c7cbfbb9",
      "content": "Here are some examples of how to answer similar questions:\n\nFind all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA) 0\nB) 1\nC) 2\nD) 3\nANSWER: B\n\nStatement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H ... [TRUNCATED] ... d be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nFind the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
    }
  ],
  "choices": [
    "0",
    "4",
    "2",
    "6"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "abstract_algebra",
  "metadata": {
    "subject": "abstract_algebra"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmlu \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmlu'],
    dataset_args={
        'mmlu': {
            # subset_list: ['abstract_algebra', 'anatomy', 'astronomy']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)