BBH#

Overview#

BBH (BIG-Bench Hard) is a subset of 23 challenging tasks from the BIG-Bench benchmark that are specifically selected because language models initially struggled with them. These tasks require complex reasoning abilities that benefit from Chain-of-Thought (CoT) prompting.

Task Description#

  • Task Type: Mixed (Multiple-Choice and Free-Form)

  • Input: Task-specific questions requiring reasoning

  • Output: Answers in specified format

  • Subsets: 27 reasoning tasks divided into multiple-choice (17) and free-form (10)

Key Features#

  • 27 challenging reasoning tasks from BIG-Bench

  • Multiple-choice tasks: temporal sequences, disambiguation, logical deduction, etc.

  • Free-form tasks: arithmetic, navigation, boolean expressions, etc.

  • Each task comes with curated Chain-of-Thought examples

  • Designed to test advanced reasoning capabilities

Evaluation Notes#

  • Default configuration uses 3-shot with CoT prompting (recommended)

  • CoT prompts are pre-defined for each subset in cot_prompts/ directory

  • Answers should follow the format: “So the answer is [ANSWER]”

  • Setting few_shot_num=0 disables few-shot examples

  • Multiple-choice answers are normalized to single letters (A, B, C, etc.)

Properties#

Property

Value

Benchmark Name

bbh

Dataset ID

evalscope/bbh

Paper

N/A

Tags

Reasoning

Metrics

acc

Default Shots

3-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

6,511

Prompt Length (Mean)

3307.29 chars

Prompt Length (Min/Max)

1060 / 7885 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

temporal_sequences

250

3746.18

3646

3876

disambiguation_qa

250

4047.48

3993

4099

date_understanding

250

1550.66

1491

1641

tracking_shuffled_objects_three_objects

250

3257.42

3195

3316

penguins_in_a_table

146

3030.88

2922

3201

geometric_shapes

250

5270.24

5201

5384

snarks

178

3493.68

3339

3693

ruin_names

250

3832.01

3781

3948

tracking_shuffled_objects_seven_objects

250

3598.1

3506

3682

tracking_shuffled_objects_five_objects

250

3419.36

3338

3489

logical_deduction_three_objects

250

3093.32

3014

3165

hyperbaton

250

3433.3

3386

3486

logical_deduction_five_objects

250

3264.38

3118

3379

logical_deduction_seven_objects

250

3434.09

3217

3633

movie_recommendation

250

2489.85

2436

2613

salient_translation_error_detection

250

7401.64

7223

7885

reasoning_about_colored_objects

250

2818.32

2572

3102

multistep_arithmetic_two

250

2596.98

2594

2600

navigate

250

2508.7

2452

2626

dyck_languages

250

2723.8

2680

2874

word_sorting

250

2481.34

2397

2569

sports_understanding

250

1077.42

1060

1122

boolean_expressions

250

1991.7

1980

1998

object_counting

250

1706.66

1647

1787

formal_fallacies

250

5185.5

4918

5514

causal_judgement

187

4877.42

4194

6311

web_of_lies

250

3300.84

3267

3340

Sample Example#

Subset: temporal_sequences

{
  "input": [
    {
      "id": "7d1767c8",
      "content": "Task description: Answer questions about which times certain events could have occurred.\n\nQ: Today, Emily went to the museum. Between what times could they have gone?\nWe know that:\nEmily woke up at 1pm.\nElizabeth saw Emily reading at the libr ... [TRUNCATED] ... \nOptions:\n(A) 6pm to 9pm\n(B) 7am to 11am\n(C) 1pm to 2pm\n(D) 2pm to 6pm\nA: Let's think step by step. Put your final answer in the format of \"So the answer is [ANSWER]\" (without quotes and markdown) where [ANSWER] is the answer to the problem.\n"
    }
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "temporal_sequences",
  "metadata": {
    "task_type": "multiple_choice"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.
Few-shot Template
{fewshot}

Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets bbh \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['bbh'],
    dataset_args={
        'bbh': {
            # subset_list: ['temporal_sequences', 'disambiguation_qa', 'date_understanding']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)