MATH-500#
Overview#
MATH-500 is a curated subset of 500 problems from the MATH benchmark, designed to evaluate the mathematical reasoning capabilities of language models. It covers five difficulty levels across various mathematical topics including algebra, geometry, number theory, and calculus.
Task Description#
Task Type: Mathematical Problem Solving
Input: Mathematical problem statement
Output: Step-by-step solution with final numerical answer
Difficulty Levels: Level 1 (easiest) to Level 5 (hardest)
Key Features#
500 carefully selected problems from the full MATH dataset
Five difficulty levels for fine-grained evaluation
Problems cover algebra, geometry, number theory, probability, and more
Each problem includes a reference solution
Designed for efficient yet comprehensive math evaluation
Evaluation Notes#
Default configuration uses 0-shot evaluation
Answers should be formatted within
\boxed{}for proper extractionNumeric equivalence checking for answer comparison
Results can be broken down by difficulty level
Commonly used for math reasoning benchmarking due to manageable size
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
500 |
Prompt Length (Mean) |
266.89 chars |
Prompt Length (Min/Max) |
91 / 1804 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
43 |
193.19 |
100 |
571 |
|
90 |
218.82 |
91 |
802 |
|
105 |
236 |
93 |
688 |
|
128 |
277.1 |
93 |
1771 |
|
134 |
337.28 |
118 |
1804 |
Sample Example#
Subset: Level 1
{
"input": [
{
"id": "b5d90091",
"content": "Suppose $\\sin D = 0.7$ in the diagram below. What is $DE$? [asy]\npair D,E,F;\nF = (0,0);\nD = (sqrt(51),7);\nE = (0,7);\ndraw(D--E--F--D);\ndraw(rightanglemark(D,E,F,15));\nlabel(\"$D$\",D,NE);\nlabel(\"$E$\",E,NW);\nlabel(\"$F$\",F,SW);\nlabel(\"$7$\",(E+F)/2,W);\n[/asy]\nPlease reason step by step, and put your final answer within \\boxed{}."
}
],
"target": "\\sqrt{51}",
"id": 0,
"group_id": 0,
"subset_key": "Level 1",
"metadata": {
"question_id": "test/precalculus/1303.json",
"solution": "The triangle is a right triangle, so $\\sin D = \\frac{EF}{DF}$. Then we have that $\\sin D = 0.7 = \\frac{7}{DF}$, so $DF = 10$.\n\nUsing the Pythagorean Theorem, we find that the length of $DE$ is $\\sqrt{DF^2 - EF^2},$ or $\\sqrt{100 - 49} = \\boxed{\\sqrt{51}}$."
}
}
Prompt Template#
Prompt Template:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets math_500 \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['math_500'],
dataset_args={
'math_500': {
# subset_list: ['Level 1', 'Level 2', 'Level 3'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)