MMLU#

概述#

MMLU（Massive Multitask Language Understanding，大规模多任务语言理解）是一个综合性评估基准，旨在衡量模型在预训练阶段所获得的知识。它涵盖 STEM、人文学科、社会科学及其他领域的 57 个学科，难度从基础到专业级别不等。

任务描述#

任务类型：多项选择题问答（Multiple-Choice Question Answering）
输入：包含四个选项（A、B、C、D）的问题
输出：单个正确答案的字母
学科范围：57 个学科，分为 4 个类别（STEM、人文学科、社会科学、其他）

主要特点#

覆盖从基础到高级专业水平的多样化知识领域
同时考察事实性知识和推理能力
包含抽象代数、解剖学、天文学、商业伦理等多个学科
是衡量大语言模型知识广度的标准基准

评估说明#

默认配置使用开发集（dev split）中的 5-shot 示例
支持思维链（Chain-of-Thought, CoT）提示以提升推理能力
结果可按学科或类别（STEM、人文学科、社会科学、其他）进行聚合
使用 subset_list 参数可评估特定学科

属性#

属性	值
基准测试名称	`mmlu`
数据集 ID	cais/mmlu
论文	N/A
标签	`Knowledge`, `MCQ`
指标	`acc`
默认示例数量	5-shot
评估集	`test`
训练集	`dev`

数据统计#

指标	值
总样本数	14,042
提示词长度（平均）	3212.2 字符
提示词长度（最小/最大）	985 / 14626 字符

各子集统计数据：

子集	样本数	提示平均长度	提示最小长度	提示最大长度
`abstract_algebra`	100	1256.98	1143	1383
`anatomy`	135	1446	1306	1783
`astronomy`	152	2616.64	2401	3191
`business_ethics`	100	2756.36	2492	3145
`clinical_knowledge`	265	1680.76	1510	1995
`college_biology`	144	2104.53	1874	2641
`college_chemistry`	100	1800.19	1641	2239
`college_computer_science`	100	3424.74	3129	4137
`college_mathematics`	100	1972.77	1776	2230
`college_medicine`	173	2377.62	2005	6779
`college_physics`	102	1941.26	1757	2237
`computer_security`	100	1598.89	1414	2238
`conceptual_physics`	235	1340.77	1246	1572
`econometrics`	114	2286.2	1976	2708
`electrical_engineering`	145	1372.1	1279	1578
`elementary_mathematics`	378	1856.81	1724	2322
`formal_logic`	126	2338.07	2065	3022
`global_facts`	100	1644.77	1558	2001
`high_school_biology`	310	2218.56	1971	2737
`high_school_chemistry`	203	1740.83	1519	2341
`high_school_computer_science`	100	3595.34	3224	4732
`high_school_european_history`	165	13421.12	12392	14626
`high_school_geography`	198	1849.07	1726	2204
`high_school_government_and_politics`	193	2355.29	2171	2922
`high_school_macroeconomics`	390	1861.47	1649	2206
`high_school_mathematics`	270	1731.31	1591	2224
`high_school_microeconomics`	238	1849.97	1651	2396
`high_school_physics`	151	2110.63	1836	2926
`high_school_psychology`	545	2431.39	2223	3498
`high_school_statistics`	216	3273.79	2932	4328
`high_school_us_history`	204	10530.61	9656	11469
`high_school_world_history`	237	6700.7	5650	8814
`human_aging`	223	1448.65	1335	1725
`human_sexuality`	131	1556.02	1412	2250
`international_law`	121	3094.31	2778	3432
`jurisprudence`	108	1851.24	1638	2370
`logical_fallacies`	163	2114.39	1932	2503
`machine_learning`	112	2852.01	2659	3150
`management`	103	1326.07	1220	1571
`marketing`	234	1984.28	1832	2266
`medical_genetics`	100	1531.32	1409	1775
`miscellaneous`	783	1121.57	1004	2096
`moral_disputes`	346	2300.58	2101	2711
`moral_scenarios`	895	2709.89	2644	2853
`nutrition`	306	2620.9	2408	3111
`philosophy`	311	1472.67	1319	2292
`prehistory`	324	2388.29	2201	2899
`professional_accounting`	282	2820.72	2515	3400
`professional_law`	1,534	8077.0	6997	10539
`professional_medicine`	272	4832.72	4380	5802
`professional_psychology`	612	2860.73	2594	3789
`public_relations`	110	1991.35	1824	2712
`security_studies`	245	6405.04	5680	7818
`sociology`	201	2176.49	1976	2530
`us_foreign_policy`	100	2129.28	1944	2393
`virology`	166	1563.11	1426	2507
`world_religions`	171	1051.68	985	1255

样例示例#

子集: abstract_algebra

{
  "input": [
    {
      "id": "c7cbfbb9",
      "content": "Here are some examples of how to answer similar questions:\n\nFind all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA) 0\nB) 1\nC) 2\nD) 3\nANSWER: B\n\nStatement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H ... [TRUNCATED] ... d be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nFind the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
    }
  ],
  "choices": [
    "0",
    "4",
    "2",
    "6"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "abstract_algebra",
  "metadata": {
    "subject": "abstract_algebra"
  }
}

注：部分内容为显示目的已截断。

提示模板#

提示模板：

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

使用方法#

使用命令行（CLI）#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmlu \
    --limit 10  # 正式评估时请删除此行

使用 Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmlu'],
    dataset_args={
        'mmlu': {
            # subset_list: ['abstract_algebra', 'anatomy', 'astronomy']  # 可选，用于评估特定子集
        }
    },
    limit=10,  # 正式评估时请删除此行
)

run_task(task_cfg=task_cfg)