BBH#
Overview#
BBH (BIG-Bench Hard) is a subset of 23 challenging tasks from the BIG-Bench benchmark that are specifically selected because language models initially struggled with them. These tasks require complex reasoning abilities that benefit from Chain-of-Thought (CoT) prompting.
Task Description#
Task Type: Mixed (Multiple-Choice and Free-Form)
Input: Task-specific questions requiring reasoning
Output: Answers in specified format
Subsets: 27 reasoning tasks divided into multiple-choice (17) and free-form (10)
Key Features#
27 challenging reasoning tasks from BIG-Bench
Multiple-choice tasks: temporal sequences, disambiguation, logical deduction, etc.
Free-form tasks: arithmetic, navigation, boolean expressions, etc.
Each task comes with curated Chain-of-Thought examples
Designed to test advanced reasoning capabilities
Evaluation Notes#
Default configuration uses 3-shot with CoT prompting (recommended)
CoT prompts are pre-defined for each subset in
cot_prompts/directoryAnswers should follow the format: “So the answer is [ANSWER]”
Setting
few_shot_num=0disables few-shot examplesMultiple-choice answers are normalized to single letters (A, B, C, etc.)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
3-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
6,511 |
Prompt Length (Mean) |
3307.29 chars |
Prompt Length (Min/Max) |
1060 / 7885 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
250 |
3746.18 |
3646 |
3876 |
|
250 |
4047.48 |
3993 |
4099 |
|
250 |
1550.66 |
1491 |
1641 |
|
250 |
3257.42 |
3195 |
3316 |
|
146 |
3030.88 |
2922 |
3201 |
|
250 |
5270.24 |
5201 |
5384 |
|
178 |
3493.68 |
3339 |
3693 |
|
250 |
3832.01 |
3781 |
3948 |
|
250 |
3598.1 |
3506 |
3682 |
|
250 |
3419.36 |
3338 |
3489 |
|
250 |
3093.32 |
3014 |
3165 |
|
250 |
3433.3 |
3386 |
3486 |
|
250 |
3264.38 |
3118 |
3379 |
|
250 |
3434.09 |
3217 |
3633 |
|
250 |
2489.85 |
2436 |
2613 |
|
250 |
7401.64 |
7223 |
7885 |
|
250 |
2818.32 |
2572 |
3102 |
|
250 |
2596.98 |
2594 |
2600 |
|
250 |
2508.7 |
2452 |
2626 |
|
250 |
2723.8 |
2680 |
2874 |
|
250 |
2481.34 |
2397 |
2569 |
|
250 |
1077.42 |
1060 |
1122 |
|
250 |
1991.7 |
1980 |
1998 |
|
250 |
1706.66 |
1647 |
1787 |
|
250 |
5185.5 |
4918 |
5514 |
|
187 |
4877.42 |
4194 |
6311 |
|
250 |
3300.84 |
3267 |
3340 |
Sample Example#
Subset: temporal_sequences
{
"input": [
{
"id": "7d1767c8",
"content": "Task description: Answer questions about which times certain events could have occurred.\n\nQ: Today, Emily went to the museum. Between what times could they have gone?\nWe know that:\nEmily woke up at 1pm.\nElizabeth saw Emily reading at the libr ... [TRUNCATED] ... \nOptions:\n(A) 6pm to 9pm\n(B) 7am to 11am\n(C) 1pm to 2pm\n(D) 2pm to 6pm\nA: Let's think step by step. Put your final answer in the format of \"So the answer is [ANSWER]\" (without quotes and markdown) where [ANSWER] is the answer to the problem.\n"
}
],
"target": "A",
"id": 0,
"group_id": 0,
"subset_key": "temporal_sequences",
"metadata": {
"task_type": "multiple_choice"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.
Few-shot Template
{fewshot}
Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is [ANSWER]" (without quotes and markdown) where [ANSWER] is the answer to the problem.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets bbh \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['bbh'],
dataset_args={
'bbh': {
# subset_list: ['temporal_sequences', 'disambiguation_qa', 'date_understanding'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)