SuperGPQA#

Overview#

SuperGPQA is a large-scale multiple-choice question answering dataset designed to evaluate model generalization across diverse fields. It contains 26,000+ questions from 50+ fields, with each question featuring 10 answer options.

Task Description#

Task Type: Multiple-Choice Knowledge Assessment
Input: Question with 10 answer choices (A-J)
Output: Correct answer letter
Domains: 50+ fields across Science, Engineering, Medicine, Economics, Law, etc.

Key Features#

26,000+ questions across 50+ academic fields
10 options per question (more challenging than standard 4-choice)
Broad coverage including:
- Science: Mathematics, Physics, Chemistry, Biology
- Engineering: Computer Science, Electrical, Mechanical
- Medicine: Clinical, Basic Medical, Pharmacy
- Humanities: Philosophy, History, Literature
- Social Sciences: Economics, Law, Sociology

Evaluation Notes#

Default evaluation uses the train split (only available split)
Primary metric: Accuracy on multiple-choice questions
Supports 0-shot or 5-shot evaluation only
Uses Chain-of-Thought (CoT) prompting
Results can be grouped by field or discipline category

Properties#

Property	Value
Benchmark Name	`super_gpqa`
Dataset ID	m-a-p/SuperGPQA
Paper	N/A
Tags	`Knowledge`, `MCQ`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	26,529
Prompt Length (Mean)	826.04 chars
Prompt Length (Min/Max)	294 / 8444 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Electronic Science and Technology`	246	994.09	355	6149
`Philosophy`	347	880.01	357	3171
`Traditional Chinese Medicine`	268	691.78	339	2282
`Applied Economics`	723	779.14	348	5407
`Mathematics`	2,622	908.66	326	7946
`Physics`	2,845	936.31	348	6780
`Clinical Medicine`	1,218	830.98	334	4003
`Computer Science and Technology`	763	823.37	331	4409
`Information and Communication Engineering`	504	921.14	347	5594
`Control Science and Engineering`	190	1042.88	354	6016
`Theoretical Economics`	150	843.95	363	2432
`Law`	591	1220.98	366	4267
`History`	674	595.18	326	3707
`Basic Medicine`	567	690.19	320	2368
`Education`	247	695.45	355	1728
`Materials Science and Engineering`	289	834.37	359	2630
`Electrical Engineering`	556	891.89	385	4025
`Systems Science`	50	1214.34	403	2676
`Power Engineering and Engineering Thermophysics`	684	885.2	345	3093
`Military Science`	205	709.2	349	2839
`Biology`	1,120	786.49	337	4581
`Business Administration`	142	877.2	359	2219
`Language and Literature`	440	778.78	313	4331
`Public Health and Preventive Medicine`	292	670.73	348	3899
`Political Science`	65	700.06	362	2239
`Chemistry`	1,769	823.53	330	6109
`Hydraulic Engineering`	218	840.64	354	2504
`Chemical Engineering and Technology`	410	1035.03	325	3839
`Pharmacy`	278	642.21	355	2629
`Geography`	133	636.14	336	2514
`Art Studies`	603	565.14	338	3552
`Architecture`	162	647.03	345	2684
`Forestry Engineering`	100	707.91	366	2676
`Public Administration`	151	730.48	339	4242
`Oceanography`	200	801.77	332	2462
`Journalism and Communication`	207	550.58	341	1620
`Nuclear Science and Technology`	107	850.05	378	3467
`Weapon Science and Technology`	100	866.71	338	3090
`Naval Architecture and Ocean Engineering`	138	599.32	295	2320
`Environmental Science and Engineering`	189	779.83	333	2629
`Transportation Engineering`	251	778.94	362	3854
`Geology`	341	810.18	350	3293
`Physical Oceanography`	50	933.54	364	2176
`Musicology`	426	540.96	294	2909
`Stomatology`	132	792.47	386	2975
`Aquaculture`	56	497.89	339	1457
`Mechanical Engineering`	176	863.61	377	3684
`Aeronautical and Astronautical Science and Technology`	119	792.2	366	3178
`Civil Engineering`	358	831.25	343	2306
`Mechanics`	908	955.85	433	8444
`Petroleum and Natural Gas Engineering`	112	725.53	382	1882
`Sociology`	143	758.95	334	3241
`Food Science and Engineering`	109	600.39	343	1383
`Agricultural Engineering`	104	960.7	335	3178
`Surveying and Mapping Science and Technology`	168	729.06	340	2648
`Metallurgical Engineering`	255	828.58	373	2421
`Library, Information and Archival Management`	150	806.67	368	3502
`Mining Engineering`	100	865.04	379	5785
`Astronomy`	405	810.0	334	2925
`Geological Resources and Geological Engineering`	50	722	369	1554
`Atmospheric Science`	203	732.33	337	2455
`Optical Engineering`	376	777.42	366	2718
`Animal Husbandry`	103	621.8	361	1913
`Geophysics`	100	927.86	384	7376
`Crop Science`	145	662.8	394	1694
`Management Science and Engineering`	58	765.22	411	1832
`Psychology`	87	674.59	393	1973
`Forestry`	131	649.55	393	1995
`Textile Science and Engineering`	100	723.75	378	2124
`Veterinary Medicine`	50	602.5	362	1063
`Instrument Science and Technology`	50	759.36	375	1820
`Physical Education`	150	687.39	365	2156

Sample Example#

Subset: Electronic Science and Technology

{
  "input": [
    {
      "id": "d0ceb797",
      "content": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D,E,F,G,H,I,J. Think step by step before answering.\n\nThe commo ... [TRUNCATED] ... oduct of A1 and A2's common-mode rejection ratios\nG) the size of A2's common-mode rejection ratio\nH) the size of A1's common-mode rejection ratio\nI) The difference in the common-mode rejection ratio of A1 and A2 themselves\nJ) input resistance"
    }
  ],
  "choices": [
    "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves",
    "all of the above",
    "the average of A1 and A2's common-mode rejection ratios",
    "the sum of A1 and A2's common-mode rejection ratios",
    "the product of A1 and A2's common-mode rejection ratios",
    "the square root of the product of A1 and A2's common-mode rejection ratios",
    "the size of A2's common-mode rejection ratio",
    "the size of A1's common-mode rejection ratio",
    "The difference in the common-mode rejection ratio of A1 and A2 themselves",
    "input resistance"
  ],
  "target": "I",
  "id": 0,
  "group_id": 0,
  "subset_key": "Electronic Science and Technology",
  "metadata": {
    "field": "Electronic Science and Technology",
    "discipline": "Engineering",
    "uuid": "a8390c754538493ba59055689b4482aa",
    "explanation": "The difference in the common-mode rejection ratio of A1 and A2 themselves"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets super_gpqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['super_gpqa'],
    dataset_args={
        'super_gpqa': {
            # subset_list: ['Electronic Science and Technology', 'Philosophy', 'Traditional Chinese Medicine']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)