IMO-AnswerBench#

Overview#

IMO-AnswerBench is a benchmark of 400 challenging problems sourced from the International Mathematical Olympiad (IMO) Shortlists. It covers four major mathematical domains and is designed to evaluate advanced mathematical reasoning capabilities of language models at the olympiad level.

Task Description#

Task Type: Olympiad Mathematics Problem Solving
Input: IMO Shortlist mathematical problem
Output: Step-by-step solution with final answer
Difficulty: International olympiad level

Key Features#

400 problems from IMO Shortlists (2005-2024)
Four domains: Algebra, Combinatorics, Geometry, Number Theory
Subcategories include: Operation, Inequality, Sequence, Polynomial, Functional Equation, etc.
Answers range from simple integers to complex LaTeX expressions (intervals, sets, fractions)
Represents the highest difficulty level in mathematical problem-solving benchmarks

Evaluation Notes#

Default configuration uses 0-shot evaluation
Answers should be formatted within \boxed{} for proper extraction
Uses numeric equivalence checking with LLM-as-judge for complex answers
Results can be broken down by Category (Algebra, Combinatorics, Geometry, Number Theory)
Many answers involve symbolic expressions requiring mathematical equivalence checking

Properties#

Property	Value
Benchmark Name	`imo_answerbench`
Dataset ID	evalscope/imo-answerbench
Paper	N/A
Tags	`Math`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	400
Prompt Length (Mean)	423.02 chars
Prompt Length (Min/Max)	145 / 1343 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Algebra`	100	345.06	183	984
`Combinatorics`	100	641.53	257	1343
`Geometry`	100	375.44	196	703
`Number theory`	100	330.07	145	769

Sample Example#

Subset: Algebra

{
  "input": [
    {
      "id": "afe74111",
      "content": "Problem:\nFor a given positive integer $N$, Henry writes the quotient of $ab$ divided by $N+1$ on the board for each integer pair $(a,b)$ where $1\\le a,b\\le N$. Find all $N$ such that the sum of the $N^2$ numbers Henry wrote on the board is $\\frac{N^3-N^2+2}{4}$.\n\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
    }
  ],
  "target": "3",
  "id": 0,
  "group_id": 0,
  "subset_key": "Algebra",
  "metadata": {
    "problem_id": "imo-bench-algebra-001",
    "subcategory": "Operation",
    "source": "IMO Shortlist 2021"
  }
}

Prompt Template#

Prompt Template:

Problem:
{question}

Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets imo_answerbench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['imo_answerbench'],
    dataset_args={
        'imo_answerbench': {
            # subset_list: ['Algebra', 'Combinatorics', 'Geometry']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)