Competition-MATH#
Overview#
Competition-MATH is a comprehensive benchmark of 12,500 challenging competition mathematics problems collected from AMC, AIME, and other prestigious math competitions. It is designed to evaluate the advanced mathematical reasoning capabilities of language models.
Task Description#
Task Type: Competition Mathematics Problem Solving
Input: Mathematical problem from competitions
Output: Step-by-step solution with final answer in \boxed{}
Difficulty Levels: Level 1 (easiest) to Level 5 (hardest)
Key Features#
12,500 problems from mathematics competitions
Five difficulty levels for comprehensive evaluation
Topics: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus
Each problem includes human-written solutions
Designed for evaluating advanced mathematical reasoning
Evaluation Notes#
Default configuration uses 4-shot examples with Chain-of-Thought prompting
Answers are extracted from
\boxed{}formatNumeric equivalence checking for answer comparison
Results can be analyzed by level and problem type
Uses math_parser for robust answer extraction
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
4-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
5,000 |
Prompt Length (Mean) |
1019.03 chars |
Prompt Length (Min/Max) |
675 / 3991 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
437 |
947.84 |
842 |
2757 |
|
894 |
816.87 |
682 |
2328 |
|
1,131 |
822.14 |
675 |
2470 |
|
1,214 |
949.04 |
768 |
2871 |
|
1,324 |
1411.41 |
1187 |
3991 |
Sample Example#
Subset: Level 1
{
"input": [
{
"id": "3848cf6a",
"content": "Here are some examples of how to solve similar problems:\n\nProblem:\nWhen Joyce counts the pennies in her bank by fives, she has one left over. When she counts them by threes, there are two left over. What is the least possible number of pennie ... [TRUNCATED] ... of 12, how many eggs will be left over if all cartons are sold?\nSolution:\n4\nProblem:\nHow many of the six integers 1 through 6 are divisors of the four-digit number 1452?\n\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
}
],
"target": "5",
"id": 0,
"group_id": 0,
"subset_key": "Level 1",
"metadata": {
"reasoning": "All numbers are divisible by $1$. The last two digits, $52$, form a multiple of 4, so the number is divisible by $4$, and thus $2$. $1+4+5+2=12$, which is a multiple of $3$, so $1452$ is divisible by $3$. Since it is divisible by $2$ and $3$, it is divisible by $6$. But it is not divisible by $5$ as it does not end in $5$ or $0$. So the total is $\\boxed{5}$.",
"type": "Number Theory"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
Problem:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Few-shot Template
Here are some examples of how to solve similar problems:
{fewshot}
Problem:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets competition_math \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['competition_math'],
dataset_args={
'competition_math': {
# subset_list: ['Level 1', 'Level 2', 'Level 3'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)