C-Eval#

Overview#

C-Eval is a comprehensive Chinese evaluation benchmark designed to assess the knowledge and reasoning abilities of language models in Chinese. It covers 52 subjects ranging from STEM to humanities and social sciences, with questions from middle school to professional examination levels.

Task Description#

  • Task Type: Multiple-Choice Question Answering (Chinese)

  • Input: Chinese question with four answer choices (A, B, C, D)

  • Output: Single correct answer letter

  • Subjects: 52 subjects organized into 4 categories (STEM, Social Science, Humanities, Other)

Key Features#

  • 13,948 multiple-choice questions across 52 subjects

  • Questions sourced from Chinese middle school, high school, college, and professional exams

  • Covers diverse domains including mathematics, physics, law, medicine, and more

  • Includes explanations for validation split questions

  • Standard benchmark for Chinese language model evaluation

Evaluation Notes#

  • Default configuration uses 5-shot examples from the dev split

  • Questions and prompts are in Chinese

  • Answers should follow the format: “答案:[LETTER]”

  • Results can be aggregated by subject or category

  • Use subset_list parameter to evaluate specific subjects

Properties#

Property

Value

Benchmark Name

ceval

Dataset ID

evalscope/ceval

Paper

N/A

Tags

Chinese, Knowledge, MCQ

Metrics

acc

Default Shots

5-shot

Evaluation Split

val

Train Split

dev

Data Statistics#

Metric

Value

Total Samples

1,346

Prompt Length (Mean)

1643.61 chars

Prompt Length (Min/Max)

727 / 6605 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

computer_network

19

1245.42

1201

1313

operating_system

19

1216.16

1187

1282

computer_architecture

21

1654.67

1622

1732

college_programming

37

1745.03

1660

2189

college_physics

19

2071.16

1986

2184

college_chemistry

24

2152.96

2107

2291

advanced_mathematics

19

6271.68

6130

6605

probability_and_statistics

18

4700.39

4527

4987

discrete_mathematics

16

1176.75

1104

1365

electrical_engineer

37

1137.81

1093

1264

metrology_engineer

24

1187.08

1140

1312

high_school_mathematics

18

2749.11

2670

2894

high_school_physics

19

1365

1263

1538

high_school_chemistry

19

1625.79

1536

1739

high_school_biology

19

1159.95

1103

1243

middle_school_mathematics

19

2141.68

2054

2413

middle_school_biology

21

1822.52

1769

1910

middle_school_physics

19

1596.53

1544

1721

middle_school_chemistry

20

1916.65

1857

2045

veterinary_medicine

23

1254.04

1210

1350

college_economics

55

1919.85

1863

2141

business_administration

33

1608.52

1547

1754

marxism

19

1061.32

1033

1104

mao_zedong_thought

24

1485.62

1449

1545

education_science

29

1369.24

1338

1444

teacher_qualification

44

1516.55

1457

1638

high_school_politics

19

2062

1955

2189

high_school_geography

19

1059.53

1026

1216

middle_school_politics

21

1653.24

1595

1714

middle_school_geography

12

1063.58

1020

1143

modern_chinese_history

23

1355.7

1313

1447

ideological_and_moral_cultivation

19

760.21

727

830

logic

22

2436.41

2358

2572

law

24

1799.29

1729

1931

chinese_language_and_literature

23

954.83

937

983

art_studies

33

793.3

774

844

professional_tour_guide

29

924.41

902

1004

legal_professional

23

2856.17

2718

2978

high_school_chinese

19

2295.79

2205

2418

high_school_history

20

1221.9

1164

1300

middle_school_history

22

1069.73

1034

1149

civil_servant

47

1973

1849

2186

sports_science

19

1810.26

1789

1874

plant_protection

22

1678.09

1653

1745

basic_medicine

19

938.05

920

976

clinical_medicine

22

1119.41

1086

1209

urban_and_rural_planner

46

1428.8

1373

1591

accountant

49

1605.92

1511

1808

fire_engineer

31

1240.81

1168

1402

environmental_impact_assessment_engineer

31

1269.97

1209

1388

tax_accountant

49

1970.65

1879

2099

physician

49

1010.59

983

1065

Sample Example#

Subset: computer_network

{
  "input": [
    {
      "id": "73073a35",
      "content": "以下是一些示例问题:\n\n问题:下列设备属于资源子网的是____。\n选项:\nA. 计算机软件\nB. 网桥\nC. 交换机\nD. 路由器\n解析:1. 首先,资源子网是指提供共享资源的网络,如打印机、文件服务器等。\r\n2. 其次,我们需要了解选项中设备的功能。网桥、交换机和路由器的主要功能是实现不同网络之间的通信和数据传输,是通信子网设备。而计算机软件可以提供共享资源的功能。\n答案:A\n\n问题:滑动窗口的作用是____。\n选项:\nA. 流量控制\nB. 拥塞控制\nC. 路由控制\nD. 差错 ... [TRUNCATED] ... Mbps,所以答案为min{80Mbps, 100Mbps}=80Mbps,选C。\n答案:C\n\n\n以下是中国关于计算机网络的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:\"答案:[LETTER]\"(不带引号),其中 [LETTER] 是 A、B、C、D 中的一个。\n\n问题:使用位填充方法,以01111110为位首flag,数据为011011111111111111110010,求问传送时要添加几个0____\n选项:\nA. 1\nB. 2\nC. 3\nD. 4\n"
    }
  ],
  "choices": [
    "1",
    "2",
    "3",
    "4"
  ],
  "target": "C",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "id": 0,
    "explanation": "",
    "subject": "computer_network"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

以下是中国关于{subject}的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:[LETTER]"(不带引号),其中 [LETTER] 是 A、B、C、D 中的一个。

问题:{question}
选项:
{choices}
Few-shot Template
以下是一些示例问题:

{fewshot}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ceval \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ceval'],
    dataset_args={
        'ceval': {
            # subset_list: ['computer_network', 'operating_system', 'computer_architecture']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)