C-MMLU#

Overview#

C-MMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese evaluation benchmark covering 67 subjects across STEM, humanities, social sciences, and China-specific topics. It evaluates models’ knowledge and reasoning in Chinese contexts.

Task Description#

Task Type: Multiple-Choice Question Answering (Chinese)
Input: Chinese question with four answer choices (A, B, C, D)
Output: Single correct answer letter
Subjects: 67 subjects organized into categories including China-specific topics

Key Features#

67 subjects covering diverse Chinese knowledge domains
Includes China-specific topics (Chinese history, literature, civil service exam, etc.)
Questions from elementary to professional levels
Tests both general knowledge and China-specific cultural knowledge
Standard benchmark for Chinese language model evaluation

Evaluation Notes#

Default configuration uses 0-shot evaluation
Uses Chinese Chain-of-Thought (CoT) prompting template
Results can be aggregated by subject or category
Categories: STEM, Humanities, Social Science, China-specific, Other
Evaluates on test split

Properties#

Property	Value
Benchmark Name	`cmmlu`
Dataset ID	evalscope/cmmlu
Paper	N/A
Tags	`Chinese`, `Knowledge`, `MCQ`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	11,582
Prompt Length (Mean)	197.87 chars
Prompt Length (Min/Max)	134 / 999 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`agronomy`	169	168.76	142	266
`anatomy`	148	157.72	141	224
`ancient_chinese`	164	178.6	144	367
`arts`	160	161.22	141	233
`astronomy`	165	191.31	143	404
`business_ethics`	209	175.15	146	291
`chinese_civil_service_exam`	160	284.88	143	554
`chinese_driving_rule`	131	181.57	151	250
`chinese_food_culture`	136	170.49	139	270
`chinese_foreign_policy`	107	254.16	150	381
`chinese_history`	323	250.97	164	387
`chinese_literature`	204	177.55	145	397
`chinese_teacher_qualification`	179	207.4	156	326
`college_actuarial_science`	106	270.99	163	558
`college_education`	107	198.09	149	355
`college_engineering_hydrology`	106	189.62	146	273
`college_law`	108	220.14	157	310
`college_mathematics`	105	343.1	174	999
`college_medical_statistics`	106	212.85	151	450
`clinical_knowledge`	237	245.74	150	393
`college_medicine`	273	187.64	141	416
`computer_science`	204	187.76	143	516
`computer_security`	171	214.01	149	399
`conceptual_physics`	147	222.4	154	337
`construction_project_management`	139	186.58	149	306
`economics`	159	184.19	149	259
`education`	163	169.17	145	225
`elementary_chinese`	252	174.84	142	368
`elementary_commonsense`	198	163.93	139	247
`elementary_information_and_technology`	238	181.63	143	275
`electrical_engineering`	172	183.77	148	358
`elementary_mathematics`	230	184.92	145	320
`ethnology`	135	176.41	145	294
`food_science`	143	165.87	141	240
`genetics`	176	187.56	146	283
`global_facts`	149	182.32	146	329
`high_school_biology`	169	267.46	177	486
`high_school_chemistry`	132	260.74	160	395
`high_school_geography`	118	207.08	142	377
`high_school_mathematics`	164	203.72	151	356
`high_school_physics`	110	223.11	152	353
`high_school_politics`	143	269.18	174	386
`human_sexuality`	126	175.63	139	261
`international_law`	185	199.09	150	385
`journalism`	172	172.25	142	234
`jurisprudence`	411	226.57	146	514
`legal_and_moral_basis`	214	205.67	154	317
`logical`	123	181.72	143	427
`machine_learning`	122	213.32	155	419
`management`	210	180.32	145	287
`marketing`	180	185.59	144	247
`marxist_theory`	189	190.72	145	273
`modern_chinese`	116	207.66	142	471
`nutrition`	145	173.48	144	267
`philosophy`	105	179.91	143	359
`professional_accounting`	175	183.38	147	281
`professional_law`	211	231.1	150	414
`professional_medicine`	376	174.87	144	319
`professional_psychology`	232	173.55	142	273
`public_relations`	174	178.06	144	263
`security_study`	135	186.07	145	302
`sociology`	226	173.89	145	384
`sports_science`	165	170.49	141	283
`traditional_chinese_medicine`	185	165.38	134	240
`virology`	169	176.32	144	266
`world_history`	161	258.64	167	388
`world_religions`	160	163.08	142	235

Sample Example#

Subset: agronomy

{
  "input": [
    {
      "id": "4e04de48",
      "content": "回答下面的单项选择题，请选出其中的正确答案。你的回答的最后一行应该是这样的格式：\"答案：[LETTER]\"（不带引号），其中 [LETTER] 是 A,B,C,D 中的一个。请在回答前进行一步步思考。\n\n问题：在农业生产中被当作极其重要的劳动对象发挥作用，最主要的不可替代的基本生产资料是\n选项：\nA) 农业生产工具\nB) 土地\nC) 劳动力\nD) 资金\n"
    }
  ],
  "choices": [
    "农业生产工具",
    "土地",
    "劳动力",
    "资金"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "agronomy",
  "metadata": {
    "subject": "agronomy"
  }
}

Prompt Template#

Prompt Template:

回答下面的单项选择题，请选出其中的正确答案。你的回答的最后一行应该是这样的格式："答案：[LETTER]"（不带引号），其中 [LETTER] 是 {letters} 中的一个。请在回答前进行一步步思考。

问题：{question}
选项：
{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets cmmlu \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['cmmlu'],
    dataset_args={
        'cmmlu': {
            # subset_list: ['agronomy', 'anatomy', 'ancient_chinese']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)