ProcessBench#

Overview#

ProcessBench is a benchmark for evaluating AI models on mathematical reasoning process verification. It tests the ability to identify errors in step-by-step mathematical solutions across various difficulty levels from GSM8K to OmniMath.

Task Description#

  • Task Type: Mathematical Reasoning Error Detection

  • Input: Math problem + step-by-step solution (tagged paragraphs)

  • Output: Index of first error paragraph (or -1 if correct)

  • Domains: Math reasoning verification, error detection

Key Features#

  • Four difficulty subsets:

    • gsm8k: Grade school math problems

    • math: Competition math problems

    • olympiadbench: Olympiad-level problems

    • omnimath: Advanced mathematical reasoning

  • Tests process supervision and verification abilities

  • Requires analyzing step-by-step reasoning for errors

Evaluation Notes#

  • Default evaluation uses the test split

  • Multiple metrics tracked:

    • error_acc: Accuracy on detecting error locations

    • correct_acc: Accuracy on identifying correct solutions

    • simple_f1_score: F1 score balancing both

  • Answers should be in \boxed{} format (paragraph index or -1)

  • Aggregation method: F1 score

Properties#

Property

Value

Benchmark Name

process_bench

Dataset ID

Qwen/ProcessBench

Paper

N/A

Tags

Math, Reasoning

Metrics

error_acc, correct_acc, simple_f1_score

Default Shots

0-shot

Evaluation Split

test

Aggregation

f1

Data Statistics#

Metric

Value

Total Samples

3,400

Prompt Length (Mean)

2764.83 chars

Prompt Length (Min/Max)

690 / 9005 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

gsm8k

400

1824.26

876

4520

math

1,000

2297.11

690

7565

olympiadbench

1,000

3166.77

1129

9005

omnimath

1,000

3206.82

832

8550

Sample Example#

Subset: gsm8k

{
  "input": [
    {
      "id": "aca63163",
      "content": "The following is a math problem and a solution (split into paragraphs, enclosed with tags and indexed from 0):\n\n[Math Problem]\n\nSue lives in a fun neighborhood.  One weekend, the neighbors decided to play a prank on Sue.  On Friday morning, t ... [TRUNCATED] ... nce you identify an error in a paragraph, return the index of the paragraph where the earliest error occurs. Otherwise, return the index of -1 (which typically denotes \"not found\").\n\nPlease put your final answer (i.e., the index) in \boxed{}.\n"
    }
  ],
  "target": "1",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "steps": [
      "To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
      "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \\(1/3 \\times 18 = 6\\) flamingos are taken back. So, they have \\(18 - 6 = 12\\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \\(12 + 6 = 18\\) pink flamingos and 6 white flamingos.",
      "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \\(18 + 18 = 36\\) pink flamingos and still 6 white flamingos.",
      "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \\(36 - 6 = 30\\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \\(\\boxed{30}\\)."
    ],
    "tagged_response": "<paragraph_0>\nTo find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.\n</paragrap ... [TRUNCATED] ...  subtract the number of white flamingos from the number of pink flamingos: \\(36 - 6 = 30\\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \\(\\boxed{30}\\).\n</paragraph_3>",
    "final_answer_correct": false
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

The following is a math problem and a solution (split into paragraphs, enclosed with tags and indexed from 0):

[Math Problem]

{problem}

[Solution]

{tagged_response}

Your task is to review and critique the solution paragraph by paragraph. Once you identify an error in a paragraph, return the index of the paragraph where the earliest error occurs. Otherwise, return the index of -1 (which typically denotes "not found").

Please put your final answer (i.e., the index) in oxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets process_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['process_bench'],
    dataset_args={
        'process_bench': {
            # subset_list: ['gsm8k', 'math', 'olympiadbench']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)