Competition-MATH#

Overview#

Competition-MATH is a comprehensive benchmark of 12,500 challenging competition mathematics problems collected from AMC, AIME, and other prestigious math competitions. It is designed to evaluate the advanced mathematical reasoning capabilities of language models.

Task Description#

  • Task Type: Competition Mathematics Problem Solving

  • Input: Mathematical problem from competitions

  • Output: Step-by-step solution with final answer in \boxed{}

  • Difficulty Levels: Level 1 (easiest) to Level 5 (hardest)

Key Features#

  • 12,500 problems from mathematics competitions

  • Five difficulty levels for comprehensive evaluation

  • Topics: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus

  • Each problem includes human-written solutions

  • Designed for evaluating advanced mathematical reasoning

Evaluation Notes#

  • Default configuration uses 4-shot examples with Chain-of-Thought prompting

  • Answers are extracted from \boxed{} format

  • Numeric equivalence checking for answer comparison

  • Results can be analyzed by level and problem type

  • Uses math_parser for robust answer extraction

Properties#

Property

Value

Benchmark Name

competition_math

Dataset ID

evalscope/competition_math

Paper

N/A

Tags

Math, Reasoning

Metrics

acc

Default Shots

4-shot

Evaluation Split

test

Train Split

train

Data Statistics#

Metric

Value

Total Samples

5,000

Prompt Length (Mean)

1019.03 chars

Prompt Length (Min/Max)

675 / 3991 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Level 1

437

947.84

842

2757

Level 2

894

816.87

682

2328

Level 3

1,131

822.14

675

2470

Level 4

1,214

949.04

768

2871

Level 5

1,324

1411.41

1187

3991

Sample Example#

Subset: Level 1

{
  "input": [
    {
      "id": "3848cf6a",
      "content": "Here are some examples of how to solve similar problems:\n\nProblem:\nWhen Joyce counts the pennies in her bank by fives, she has one left over. When she counts them by threes, there are two left over. What is the least possible number of pennie ... [TRUNCATED] ... of 12, how many eggs will be left over if all cartons are sold?\nSolution:\n4\nProblem:\nHow many of the six integers 1 through 6 are divisors of the four-digit number 1452?\n\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
    }
  ],
  "target": "5",
  "id": 0,
  "group_id": 0,
  "subset_key": "Level 1",
  "metadata": {
    "reasoning": "All numbers are divisible by $1$. The last two digits, $52$, form a multiple of 4, so the number is divisible by $4$, and thus $2$. $1+4+5+2=12$, which is a multiple of $3$, so $1452$ is divisible by $3$.  Since it is divisible by $2$ and $3$, it is divisible by $6$.  But it is not divisible by $5$ as it does not end in $5$ or $0$.  So the total is $\\boxed{5}$.",
    "type": "Number Theory"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Problem:
{question}

Please reason step by step, and put your final answer within \boxed{{}}.
Few-shot Template
Here are some examples of how to solve similar problems:

{fewshot}
Problem:
{question}

Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets competition_math \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['competition_math'],
    dataset_args={
        'competition_math': {
            # subset_list: ['Level 1', 'Level 2', 'Level 3']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)