MMLU-Redux#
概述#
MMLU-Redux 是 MMLU 基准测试的一个改进版本,修正了原始数据集中的错误答案。它通过修复不正确的标准答案标签、缺失的正确选项以及存在歧义的问题,解决了原始 MMLU 数据集中已知的错误。
任务描述#
任务类型:多项选择知识评估
输入:包含四个选项的问题
输出:正确答案的字母(A/B/C/D)
领域:涵盖 STEM、人文学科、社会科学及其他领域的 57 个学科
主要特性#
修正了原始 MMLU 基准测试中的错误
修复的错误类型包括:
no_correct_answer:缺少正确选项的问题wrong_groundtruth:标准答案错误的问题multiple_correct_answers:答案存在歧义的问题
与原始 MMLU 保持相同的 57 个学科覆盖范围
与 MMLU 评估框架兼容
评估说明#
默认使用 test 数据划分进行评估
主要指标:准确率(Accuracy)(对多答案问题使用包含标志)
使用思维链(Chain-of-Thought, CoT)提示
仅支持零样本(zero-shot)评估(不支持少样本)
结果按学科和类别(STEM、人文学科、社会科学、其他)汇总
属性#
属性 |
值 |
|---|---|
基准测试名称 |
|
数据集 ID |
|
论文 |
N/A |
标签 |
|
指标 |
|
默认样本数 |
0-shot |
评估划分 |
|
数据统计#
指标 |
值 |
|---|---|
总样本数 |
5,700 |
提示词长度(平均) |
600.81 字符 |
提示词长度(最小/最大) |
255 / 5082 字符 |
各子集统计数据:
子集 |
样本数 |
提示平均长度 |
提示最小长度 |
提示最大长度 |
|---|---|---|---|---|
|
100 |
399.13 |
285 |
525 |
|
100 |
454.09 |
323 |
788 |
|
100 |
516.53 |
297 |
1087 |
|
100 |
538.35 |
274 |
927 |
|
100 |
453.01 |
303 |
757 |
|
100 |
550.59 |
343 |
1081 |
|
100 |
451.35 |
292 |
890 |
|
100 |
631.74 |
336 |
1344 |
|
100 |
451.77 |
255 |
709 |
|
100 |
669.42 |
308 |
5082 |
|
100 |
500.71 |
317 |
797 |
|
100 |
476.01 |
291 |
1115 |
|
100 |
372.72 |
284 |
539 |
|
100 |
610.01 |
304 |
1036 |
|
100 |
373.01 |
286 |
585 |
|
100 |
403.98 |
264 |
797 |
|
100 |
598.04 |
318 |
1275 |
|
100 |
389.82 |
303 |
746 |
|
100 |
574.54 |
328 |
1078 |
|
100 |
496.56 |
271 |
1093 |
|
100 |
649.34 |
278 |
1786 |
|
100 |
1840.88 |
855 |
3045 |
|
100 |
417.76 |
306 |
616 |
|
100 |
539.42 |
384 |
864 |
|
100 |
509.35 |
319 |
756 |
|
100 |
409.45 |
282 |
846 |
|
100 |
508.99 |
325 |
1070 |
|
100 |
600.82 |
324 |
1414 |
|
100 |
476.49 |
309 |
1055 |
|
100 |
736.04 |
376 |
1772 |
|
100 |
1643.04 |
782 |
2595 |
|
100 |
1749.59 |
749 |
3834 |
|
100 |
416.87 |
323 |
689 |
|
100 |
457.5 |
307 |
1145 |
|
100 |
636.61 |
332 |
986 |
|
100 |
518.24 |
307 |
1039 |
|
100 |
516.62 |
333 |
902 |
|
100 |
513.09 |
315 |
806 |
|
100 |
399.38 |
294 |
645 |
|
100 |
474.85 |
335 |
757 |
|
100 |
414.32 |
292 |
658 |
|
100 |
388.56 |
277 |
1268 |
|
100 |
534.28 |
325 |
872 |
|
100 |
620.62 |
563 |
767 |
|
100 |
505.6 |
318 |
926 |
|
100 |
462.44 |
319 |
1155 |
|
100 |
468.62 |
320 |
943 |
|
100 |
650.46 |
341 |
1226 |
|
100 |
1370.21 |
359 |
2928 |
|
100 |
1003.93 |
610 |
1735 |
|
100 |
577.98 |
317 |
1502 |
|
100 |
472.39 |
300 |
1188 |
|
100 |
1029.35 |
317 |
2066 |
|
100 |
530.23 |
335 |
834 |
|
100 |
490.28 |
305 |
754 |
|
100 |
447.81 |
302 |
1383 |
|
100 |
353.31 |
287 |
557 |
样例示例#
子集: abstract_algebra
{
"input": [
{
"id": "f937b8b4",
"content": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nStatement 1 | If T: V -> W is a linear transformation and dim(V ) < dim(W) < 1, then T must be injective. Statement 2 | Let dim(V) = n and suppose that T: V -> V is linear. If T is injective, then it is a bijection.\n\nA) True, True\nB) False, False\nC) True, False\nD) False, True"
}
],
"choices": [
"True, True",
"False, False",
"True, False",
"False, True"
],
"target": [
"A"
],
"id": 0,
"group_id": 0,
"metadata": {
"error_type": "bad_question_clarity",
"correct_answer": "0",
"potential_reason": "Statement 2 is true and well defined. \r\nHowever, statement 1 is not well defined: The dimension of a vector space is a nonnegative number, and since dim(V) < dim(W) < 1, this means dim(V) has to be negative. Taking this statement literally, the implication is vacuously true as the premise cannot be satisfed, but I doubt that was what the question is trying to test. "
}
}
提示模板#
提示模板:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
使用方法#
使用 CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmlu_redux \
--limit 10 # 正式评估时请删除此行
使用 Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmlu_redux'],
dataset_args={
'mmlu_redux': {
# subset_list: ['abstract_algebra', 'anatomy', 'astronomy'] # 可选,用于评估特定子集
}
},
limit=10, # 正式评估时请删除此行
)
run_task(task_cfg=task_cfg)