BBH#

概述#

BBH（BIG-Bench Hard）是从 BIG-Bench 基准测试中精选出的 23 项具有挑战性的任务子集，这些任务之所以被选中，是因为语言模型在最初尝试时表现不佳。这些任务需要复杂的推理能力，并且能从思维链（Chain-of-Thought, CoT）提示中显著受益。

任务描述#

任务类型：混合型（多项选择题和自由形式）
输入：需要推理的任务特定问题
输出：按指定格式给出的答案
子集：27 项推理任务，分为多项选择题（17 项）和自由形式（10 项）

主要特点#

来自 BIG-Bench 的 27 项具有挑战性的推理任务
多项选择题任务：时间序列、歧义消解、逻辑推理等
自由形式任务：算术、导航、布尔表达式等
每项任务均配有精心设计的思维链（CoT）示例
旨在评估高级推理能力

评估说明#

默认配置使用 3-shot 并配合 CoT 提示（推荐）
每个子集的 CoT 提示已预定义在 cot_prompts/ 目录中
答案应遵循格式："So the answer is [ANSWER]"
设置 few_shot_num=0 可禁用少样本示例
多项选择题答案将被标准化为单个字母（A、B、C 等）

属性#

属性	值
基准测试名称	`bbh`
数据集ID	evalscope/bbh
论文	N/A
标签	`Reasoning`
指标	`acc`
默认少样本数量	3-shot
评估划分	`test`

数据统计#

指标	值
总样本数	6,511
提示词长度（平均）	3307.29 字符
提示词长度（最小/最大）	1060 / 7885 字符

各子集统计数据：

子集	样本数	提示平均长度	提示最小长度	提示最大长度
`temporal_sequences`	250	3746.18	3646	3876
`disambiguation_qa`	250	4047.48	3993	4099
`date_understanding`	250	1550.66	1491	1641
`tracking_shuffled_objects_three_objects`	250	3257.42	3195	3316
`penguins_in_a_table`	146	3030.88	2922	3201
`geometric_shapes`	250	5270.24	5201	5384
`snarks`	178	3493.68	3339	3693
`ruin_names`	250	3832.01	3781	3948
`tracking_shuffled_objects_seven_objects`	250	3598.1	3506	3682
`tracking_shuffled_objects_five_objects`	250	3419.36	3338	3489
`logical_deduction_three_objects`	250	3093.32	3014	3165
`hyperbaton`	250	3433.3	3386	3486
`logical_deduction_five_objects`	250	3264.38	3118	3379
`logical_deduction_seven_objects`	250	3434.09	3217	3633
`movie_recommendation`	250	2489.85	2436	2613
`salient_translation_error_detection`	250	7401.64	7223	7885
`reasoning_about_colored_objects`	250	2818.32	2572	3102
`multistep_arithmetic_two`	250	2596.98	2594	2600
`navigate`	250	2508.7	2452	2626
`dyck_languages`	250	2723.8	2680	2874
`word_sorting`	250	2481.34	2397	2569
`sports_understanding`	250	1077.42	1060	1122
`boolean_expressions`	250	1991.7	1980	1998
`object_counting`	250	1706.66	1647	1787
`formal_fallacies`	250	5185.5	4918	5514
`causal_judgement`	187	4877.42	4194	6311
`web_of_lies`	250	3300.84	3267	3340

样例示例#

子集: temporal_sequences

{
  "input": [
    {
      "id": "7d1767c8",
      "content": "Task description: Answer questions about which times certain events could have occurred.\n\nQ: Today, Emily went to the museum. Between what times could they have gone?\nWe know that:\nEmily woke up at 1pm.\nElizabeth saw Emily reading at the libr ... [TRUNCATED] ... \nOptions:\n(A) 6pm to 9pm\n(B) 7am to 11am\n(C) 1pm to 2pm\n(D) 2pm to 6pm\nA: Let's think step by step. Put your final answer in the format of \"So the answer is [ANSWER]\" (without quotes and markdown) where [ANSWER] is the answer to the problem.\n"
    }
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "temporal_sequences",
  "metadata": {
    "task_type": "multiple_choice"
  }
}

注：部分内容因展示需要已被截断。

提示模板#

提示模板：

Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.

少样本模板

{fewshot}

Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.

使用方法#

使用命令行接口（CLI）#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets bbh \
    --limit 10  # 正式评估时请删除此行

使用 Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['bbh'],
    dataset_args={
        'bbh': {
            # subset_list: ['temporal_sequences', 'disambiguation_qa', 'date_understanding']  # 可选，用于评估特定子集
        }
    },
    limit=10,  # 正式评估时请删除此行
)

run_task(task_cfg=task_cfg)