MMLU#
Overview#
MMLU (Massive Multitask Language Understanding) is a comprehensive evaluation benchmark designed to measure knowledge acquired during pretraining. It covers 57 subjects across STEM, humanities, social sciences, and other domains, ranging from elementary to professional difficulty levels.
Task Description#
Task Type: Multiple-Choice Question Answering
Input: Question with four answer choices (A, B, C, D)
Output: Single correct answer letter
Subjects: 57 subjects organized into 4 categories (STEM, Humanities, Social Sciences, Other)
Key Features#
Covers diverse knowledge domains from elementary to advanced professional levels
Tests both factual knowledge and reasoning abilities
Includes subjects like abstract algebra, anatomy, astronomy, business ethics, and more
Standard benchmark for measuring LLM knowledge breadth
Evaluation Notes#
Default configuration uses 5-shot examples from the dev split
Supports Chain-of-Thought (CoT) prompting for improved reasoning
Results can be aggregated by subject or category (STEM, Humanities, Social Sciences, Other)
Use
subset_listparameter to evaluate specific subjects
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
5-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
14,042 |
Prompt Length (Mean) |
3212.2 chars |
Prompt Length (Min/Max) |
985 / 14626 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
100 |
1256.98 |
1143 |
1383 |
|
135 |
1446 |
1306 |
1783 |
|
152 |
2616.64 |
2401 |
3191 |
|
100 |
2756.36 |
2492 |
3145 |
|
265 |
1680.76 |
1510 |
1995 |
|
144 |
2104.53 |
1874 |
2641 |
|
100 |
1800.19 |
1641 |
2239 |
|
100 |
3424.74 |
3129 |
4137 |
|
100 |
1972.77 |
1776 |
2230 |
|
173 |
2377.62 |
2005 |
6779 |
|
102 |
1941.26 |
1757 |
2237 |
|
100 |
1598.89 |
1414 |
2238 |
|
235 |
1340.77 |
1246 |
1572 |
|
114 |
2286.2 |
1976 |
2708 |
|
145 |
1372.1 |
1279 |
1578 |
|
378 |
1856.81 |
1724 |
2322 |
|
126 |
2338.07 |
2065 |
3022 |
|
100 |
1644.77 |
1558 |
2001 |
|
310 |
2218.56 |
1971 |
2737 |
|
203 |
1740.83 |
1519 |
2341 |
|
100 |
3595.34 |
3224 |
4732 |
|
165 |
13421.12 |
12392 |
14626 |
|
198 |
1849.07 |
1726 |
2204 |
|
193 |
2355.29 |
2171 |
2922 |
|
390 |
1861.47 |
1649 |
2206 |
|
270 |
1731.31 |
1591 |
2224 |
|
238 |
1849.97 |
1651 |
2396 |
|
151 |
2110.63 |
1836 |
2926 |
|
545 |
2431.39 |
2223 |
3498 |
|
216 |
3273.79 |
2932 |
4328 |
|
204 |
10530.61 |
9656 |
11469 |
|
237 |
6700.7 |
5650 |
8814 |
|
223 |
1448.65 |
1335 |
1725 |
|
131 |
1556.02 |
1412 |
2250 |
|
121 |
3094.31 |
2778 |
3432 |
|
108 |
1851.24 |
1638 |
2370 |
|
163 |
2114.39 |
1932 |
2503 |
|
112 |
2852.01 |
2659 |
3150 |
|
103 |
1326.07 |
1220 |
1571 |
|
234 |
1984.28 |
1832 |
2266 |
|
100 |
1531.32 |
1409 |
1775 |
|
783 |
1121.57 |
1004 |
2096 |
|
346 |
2300.58 |
2101 |
2711 |
|
895 |
2709.89 |
2644 |
2853 |
|
306 |
2620.9 |
2408 |
3111 |
|
311 |
1472.67 |
1319 |
2292 |
|
324 |
2388.29 |
2201 |
2899 |
|
282 |
2820.72 |
2515 |
3400 |
|
1,534 |
8077.0 |
6997 |
10539 |
|
272 |
4832.72 |
4380 |
5802 |
|
612 |
2860.73 |
2594 |
3789 |
|
110 |
1991.35 |
1824 |
2712 |
|
245 |
6405.04 |
5680 |
7818 |
|
201 |
2176.49 |
1976 |
2530 |
|
100 |
2129.28 |
1944 |
2393 |
|
166 |
1563.11 |
1426 |
2507 |
|
171 |
1051.68 |
985 |
1255 |
Sample Example#
Subset: abstract_algebra
{
"input": [
{
"id": "c7cbfbb9",
"content": "Here are some examples of how to answer similar questions:\n\nFind all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA) 0\nB) 1\nC) 2\nD) 3\nANSWER: B\n\nStatement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H ... [TRUNCATED] ... d be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nFind the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
}
],
"choices": [
"0",
"4",
"2",
"6"
],
"target": "B",
"id": 0,
"group_id": 0,
"subset_key": "abstract_algebra",
"metadata": {
"subject": "abstract_algebra"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmlu \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmlu'],
dataset_args={
'mmlu': {
# subset_list: ['abstract_algebra', 'anatomy', 'astronomy'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)