MATH-500#

Overview#

MATH-500 is a curated subset of 500 problems from the MATH benchmark, designed to evaluate the mathematical reasoning capabilities of language models. It covers five difficulty levels across various mathematical topics including algebra, geometry, number theory, and calculus.

Task Description#

  • Task Type: Mathematical Problem Solving

  • Input: Mathematical problem statement

  • Output: Step-by-step solution with final numerical answer

  • Difficulty Levels: Level 1 (easiest) to Level 5 (hardest)

Key Features#

  • 500 carefully selected problems from the full MATH dataset

  • Five difficulty levels for fine-grained evaluation

  • Problems cover algebra, geometry, number theory, probability, and more

  • Each problem includes a reference solution

  • Designed for efficient yet comprehensive math evaluation

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Answers should be formatted within \boxed{} for proper extraction

  • Numeric equivalence checking for answer comparison

  • Results can be broken down by difficulty level

  • Commonly used for math reasoning benchmarking due to manageable size

Properties#

Property

Value

Benchmark Name

math_500

Dataset ID

AI-ModelScope/MATH-500

Paper

N/A

Tags

Math, Reasoning

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

500

Prompt Length (Mean)

266.89 chars

Prompt Length (Min/Max)

91 / 1804 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Level 1

43

193.19

100

571

Level 2

90

218.82

91

802

Level 3

105

236

93

688

Level 4

128

277.1

93

1771

Level 5

134

337.28

118

1804

Sample Example#

Subset: Level 1

{
  "input": [
    {
      "id": "b5d90091",
      "content": "Suppose $\\sin D = 0.7$ in the diagram below. What is $DE$? [asy]\npair D,E,F;\nF = (0,0);\nD = (sqrt(51),7);\nE = (0,7);\ndraw(D--E--F--D);\ndraw(rightanglemark(D,E,F,15));\nlabel(\"$D$\",D,NE);\nlabel(\"$E$\",E,NW);\nlabel(\"$F$\",F,SW);\nlabel(\"$7$\",(E+F)/2,W);\n[/asy]\nPlease reason step by step, and put your final answer within \\boxed{}."
    }
  ],
  "target": "\\sqrt{51}",
  "id": 0,
  "group_id": 0,
  "subset_key": "Level 1",
  "metadata": {
    "question_id": "test/precalculus/1303.json",
    "solution": "The triangle is a right triangle, so $\\sin D = \\frac{EF}{DF}$. Then we have that $\\sin D = 0.7 = \\frac{7}{DF}$, so $DF = 10$.\n\nUsing the Pythagorean Theorem, we find that the length of $DE$ is $\\sqrt{DF^2 - EF^2},$ or $\\sqrt{100 - 49} = \\boxed{\\sqrt{51}}$."
  }
}

Prompt Template#

Prompt Template:

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets math_500 \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['math_500'],
    dataset_args={
        'math_500': {
            # subset_list: ['Level 1', 'Level 2', 'Level 3']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)