C-Eval#
Overview#
C-Eval is a comprehensive Chinese evaluation benchmark designed to assess the knowledge and reasoning abilities of language models in Chinese. It covers 52 subjects ranging from STEM to humanities and social sciences, with questions from middle school to professional examination levels.
Task Description#
Task Type: Multiple-Choice Question Answering (Chinese)
Input: Chinese question with four answer choices (A, B, C, D)
Output: Single correct answer letter
Subjects: 52 subjects organized into 4 categories (STEM, Social Science, Humanities, Other)
Key Features#
13,948 multiple-choice questions across 52 subjects
Questions sourced from Chinese middle school, high school, college, and professional exams
Covers diverse domains including mathematics, physics, law, medicine, and more
Includes explanations for validation split questions
Standard benchmark for Chinese language model evaluation
Evaluation Notes#
Default configuration uses 5-shot examples from the dev split
Questions and prompts are in Chinese
Answers should follow the format: “答案:[LETTER]”
Results can be aggregated by subject or category
Use
subset_listparameter to evaluate specific subjects
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
5-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,346 |
Prompt Length (Mean) |
1643.61 chars |
Prompt Length (Min/Max) |
727 / 6605 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
19 |
1245.42 |
1201 |
1313 |
|
19 |
1216.16 |
1187 |
1282 |
|
21 |
1654.67 |
1622 |
1732 |
|
37 |
1745.03 |
1660 |
2189 |
|
19 |
2071.16 |
1986 |
2184 |
|
24 |
2152.96 |
2107 |
2291 |
|
19 |
6271.68 |
6130 |
6605 |
|
18 |
4700.39 |
4527 |
4987 |
|
16 |
1176.75 |
1104 |
1365 |
|
37 |
1137.81 |
1093 |
1264 |
|
24 |
1187.08 |
1140 |
1312 |
|
18 |
2749.11 |
2670 |
2894 |
|
19 |
1365 |
1263 |
1538 |
|
19 |
1625.79 |
1536 |
1739 |
|
19 |
1159.95 |
1103 |
1243 |
|
19 |
2141.68 |
2054 |
2413 |
|
21 |
1822.52 |
1769 |
1910 |
|
19 |
1596.53 |
1544 |
1721 |
|
20 |
1916.65 |
1857 |
2045 |
|
23 |
1254.04 |
1210 |
1350 |
|
55 |
1919.85 |
1863 |
2141 |
|
33 |
1608.52 |
1547 |
1754 |
|
19 |
1061.32 |
1033 |
1104 |
|
24 |
1485.62 |
1449 |
1545 |
|
29 |
1369.24 |
1338 |
1444 |
|
44 |
1516.55 |
1457 |
1638 |
|
19 |
2062 |
1955 |
2189 |
|
19 |
1059.53 |
1026 |
1216 |
|
21 |
1653.24 |
1595 |
1714 |
|
12 |
1063.58 |
1020 |
1143 |
|
23 |
1355.7 |
1313 |
1447 |
|
19 |
760.21 |
727 |
830 |
|
22 |
2436.41 |
2358 |
2572 |
|
24 |
1799.29 |
1729 |
1931 |
|
23 |
954.83 |
937 |
983 |
|
33 |
793.3 |
774 |
844 |
|
29 |
924.41 |
902 |
1004 |
|
23 |
2856.17 |
2718 |
2978 |
|
19 |
2295.79 |
2205 |
2418 |
|
20 |
1221.9 |
1164 |
1300 |
|
22 |
1069.73 |
1034 |
1149 |
|
47 |
1973 |
1849 |
2186 |
|
19 |
1810.26 |
1789 |
1874 |
|
22 |
1678.09 |
1653 |
1745 |
|
19 |
938.05 |
920 |
976 |
|
22 |
1119.41 |
1086 |
1209 |
|
46 |
1428.8 |
1373 |
1591 |
|
49 |
1605.92 |
1511 |
1808 |
|
31 |
1240.81 |
1168 |
1402 |
|
31 |
1269.97 |
1209 |
1388 |
|
49 |
1970.65 |
1879 |
2099 |
|
49 |
1010.59 |
983 |
1065 |
Sample Example#
Subset: computer_network
{
"input": [
{
"id": "73073a35",
"content": "以下是一些示例问题:\n\n问题:下列设备属于资源子网的是____。\n选项:\nA. 计算机软件\nB. 网桥\nC. 交换机\nD. 路由器\n解析:1. 首先,资源子网是指提供共享资源的网络,如打印机、文件服务器等。\r\n2. 其次,我们需要了解选项中设备的功能。网桥、交换机和路由器的主要功能是实现不同网络之间的通信和数据传输,是通信子网设备。而计算机软件可以提供共享资源的功能。\n答案:A\n\n问题:滑动窗口的作用是____。\n选项:\nA. 流量控制\nB. 拥塞控制\nC. 路由控制\nD. 差错 ... [TRUNCATED] ... Mbps,所以答案为min{80Mbps, 100Mbps}=80Mbps,选C。\n答案:C\n\n\n以下是中国关于计算机网络的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:\"答案:[LETTER]\"(不带引号),其中 [LETTER] 是 A、B、C、D 中的一个。\n\n问题:使用位填充方法,以01111110为位首flag,数据为011011111111111111110010,求问传送时要添加几个0____\n选项:\nA. 1\nB. 2\nC. 3\nD. 4\n"
}
],
"choices": [
"1",
"2",
"3",
"4"
],
"target": "C",
"id": 0,
"group_id": 0,
"metadata": {
"id": 0,
"explanation": "",
"subject": "computer_network"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
以下是中国关于{subject}的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:[LETTER]"(不带引号),其中 [LETTER] 是 A、B、C、D 中的一个。
问题:{question}
选项:
{choices}
Few-shot Template
以下是一些示例问题:
{fewshot}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets ceval \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['ceval'],
dataset_args={
'ceval': {
# subset_list: ['computer_network', 'operating_system', 'computer_architecture'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)