SuperGPQA#
概述#
SuperGPQA 是一个大规模多项选择题问答数据集,旨在评估模型在不同领域的泛化能力。该数据集包含来自 50 多个领域的 26,000 多道题目,每道题提供 10 个选项。
任务描述#
任务类型:多项选择知识评估
输入:包含 10 个选项(A-J)的问题
输出:正确答案的字母
领域:涵盖科学、工程、医学、经济学、法学等 50 多个领域
主要特点#
覆盖 50 多个学术领域的 26,000+ 道问题
每题提供 10 个选项(比标准的 4 选项更具挑战性)
广泛覆盖以下领域:
科学:数学、物理、化学、生物
工程:计算机科学、电气、机械
医学:临床医学、基础医学、药学
人文学科:哲学、历史、文学
社会科学:经济学、法学、社会学
评估说明#
默认使用 train 划分进行评估(唯一可用划分)
主要指标:多项选择题的 准确率(Accuracy)
仅支持 0-shot 或 5-shot 评估
使用思维链(Chain-of-Thought, CoT)提示
结果可按领域或学科类别分组统计
属性#
属性 |
值 |
|---|---|
基准测试名称 |
|
数据集 ID |
|
论文 |
N/A |
标签 |
|
指标 |
|
默认 Shots |
0-shot |
评估划分 |
|
数据统计#
指标 |
值 |
|---|---|
总样本数 |
26,529 |
提示词长度(平均) |
826.04 字符 |
提示词长度(最小/最大) |
294 / 8444 字符 |
各子集统计数据:
子集 |
样本数 |
提示平均长度 |
提示最小长度 |
提示最大长度 |
|---|---|---|---|---|
|
246 |
994.09 |
355 |
6149 |
|
347 |
880.01 |
357 |
3171 |
|
268 |
691.78 |
339 |
2282 |
|
723 |
779.14 |
348 |
5407 |
|
2,622 |
908.66 |
326 |
7946 |
|
2,845 |
936.31 |
348 |
6780 |
|
1,218 |
830.98 |
334 |
4003 |
|
763 |
823.37 |
331 |
4409 |
|
504 |
921.14 |
347 |
5594 |
|
190 |
1042.88 |
354 |
6016 |
|
150 |
843.95 |
363 |
2432 |
|
591 |
1220.98 |
366 |
4267 |
|
674 |
595.18 |
326 |
3707 |
|
567 |
690.19 |
320 |
2368 |
|
247 |
695.45 |
355 |
1728 |
|
289 |
834.37 |
359 |
2630 |
|
556 |
891.89 |
385 |
4025 |
|
50 |
1214.34 |
403 |
2676 |
|
684 |
885.2 |
345 |
3093 |
|
205 |
709.2 |
349 |
2839 |
|
1,120 |
786.49 |
337 |
4581 |
|
142 |
877.2 |
359 |
2219 |
|
440 |
778.78 |
313 |
4331 |
|
292 |
670.73 |
348 |
3899 |
|
65 |
700.06 |
362 |
2239 |
|
1,769 |
823.53 |
330 |
6109 |
|
218 |
840.64 |
354 |
2504 |
|
410 |
1035.03 |
325 |
3839 |
|
278 |
642.21 |
355 |
2629 |
|
133 |
636.14 |
336 |
2514 |
|
603 |
565.14 |
338 |
3552 |
|
162 |
647.03 |
345 |
2684 |
|
100 |
707.91 |
366 |
2676 |
|
151 |
730.48 |
339 |
4242 |
|
200 |
801.77 |
332 |
2462 |
|
207 |
550.58 |
341 |
1620 |
|
107 |
850.05 |
378 |
3467 |
|
100 |
866.71 |
338 |
3090 |
|
138 |
599.32 |
295 |
2320 |
|
189 |
779.83 |
333 |
2629 |
|
251 |
778.94 |
362 |
3854 |
|
341 |
810.18 |
350 |
3293 |
|
50 |
933.54 |
364 |
2176 |
|
426 |
540.96 |
294 |
2909 |
|
132 |
792.47 |
386 |
2975 |
|
56 |
497.89 |
339 |
1457 |
|
176 |
863.61 |
377 |
3684 |
|
119 |
792.2 |
366 |
3178 |
|
358 |
831.25 |
343 |
2306 |
|
908 |
955.85 |
433 |
8444 |
|
112 |
725.53 |
382 |
1882 |
|
143 |
758.95 |
334 |
3241 |
|
109 |
600.39 |
343 |
1383 |
|
104 |
960.7 |
335 |
3178 |
|
168 |
729.06 |
340 |
2648 |
|
255 |
828.58 |
373 |
2421 |
|
150 |
806.67 |
368 |
3502 |
|
100 |
865.04 |
379 |
5785 |
|
405 |
810.0 |
334 |
2925 |
|
50 |
722 |
369 |
1554 |
|
203 |
732.33 |
337 |
2455 |
|
376 |
777.42 |
366 |
2718 |
|
103 |
621.8 |
361 |
1913 |
|
100 |
927.86 |
384 |
7376 |
|
145 |
662.8 |
394 |
1694 |
|
58 |
765.22 |
411 |
1832 |
|
87 |
674.59 |
393 |
1973 |
|
131 |
649.55 |
393 |
1995 |
|
100 |
723.75 |
378 |
2124 |
|
50 |
602.5 |
362 |
1063 |
|
50 |
759.36 |
375 |
1820 |
|
150 |
687.39 |
365 |
2156 |
样例示例#
子集: Electronic Science and Technology
{
"input": [
{
"id": "d0ceb797",
"content": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D,E,F,G,H,I,J. Think step by step before answering.\n\nThe commo ... [TRUNCATED] ... oduct of A1 and A2's common-mode rejection ratios\nG) the size of A2's common-mode rejection ratio\nH) the size of A1's common-mode rejection ratio\nI) The difference in the common-mode rejection ratio of A1 and A2 themselves\nJ) input resistance"
}
],
"choices": [
"the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves",
"all of the above",
"the average of A1 and A2's common-mode rejection ratios",
"the sum of A1 and A2's common-mode rejection ratios",
"the product of A1 and A2's common-mode rejection ratios",
"the square root of the product of A1 and A2's common-mode rejection ratios",
"the size of A2's common-mode rejection ratio",
"the size of A1's common-mode rejection ratio",
"The difference in the common-mode rejection ratio of A1 and A2 themselves",
"input resistance"
],
"target": "I",
"id": 0,
"group_id": 0,
"subset_key": "Electronic Science and Technology",
"metadata": {
"field": "Electronic Science and Technology",
"discipline": "Engineering",
"uuid": "a8390c754538493ba59055689b4482aa",
"explanation": "The difference in the common-mode rejection ratio of A1 and A2 themselves"
}
}
注:部分内容为显示目的已截断。
提示模板#
提示模板:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
使用方法#
使用 CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets super_gpqa \
--limit 10 # 正式评估时请删除此行
使用 Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['super_gpqa'],
dataset_args={
'super_gpqa': {
# subset_list: ['Electronic Science and Technology', 'Philosophy', 'Traditional Chinese Medicine'] # 可选,用于评估特定子集
}
},
limit=10, # 正式评估时请删除此行
)
run_task(task_cfg=task_cfg)