IMO-AnswerBench#
Overview#
IMO-AnswerBench is a benchmark of 400 challenging problems sourced from the International Mathematical Olympiad (IMO) Shortlists. It covers four major mathematical domains and is designed to evaluate advanced mathematical reasoning capabilities of language models at the olympiad level.
Task Description#
Task Type: Olympiad Mathematics Problem Solving
Input: IMO Shortlist mathematical problem
Output: Step-by-step solution with final answer
Difficulty: International olympiad level
Key Features#
400 problems from IMO Shortlists (2005-2024)
Four domains: Algebra, Combinatorics, Geometry, Number Theory
Subcategories include: Operation, Inequality, Sequence, Polynomial, Functional Equation, etc.
Answers range from simple integers to complex LaTeX expressions (intervals, sets, fractions)
Represents the highest difficulty level in mathematical problem-solving benchmarks
Evaluation Notes#
Default configuration uses 0-shot evaluation
Answers should be formatted within
\boxed{}for proper extractionUses numeric equivalence checking with LLM-as-judge for complex answers
Results can be broken down by Category (Algebra, Combinatorics, Geometry, Number Theory)
Many answers involve symbolic expressions requiring mathematical equivalence checking
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
400 |
Prompt Length (Mean) |
423.02 chars |
Prompt Length (Min/Max) |
145 / 1343 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
100 |
345.06 |
183 |
984 |
|
100 |
641.53 |
257 |
1343 |
|
100 |
375.44 |
196 |
703 |
|
100 |
330.07 |
145 |
769 |
Sample Example#
Subset: Algebra
{
"input": [
{
"id": "afe74111",
"content": "Problem:\nFor a given positive integer $N$, Henry writes the quotient of $ab$ divided by $N+1$ on the board for each integer pair $(a,b)$ where $1\\le a,b\\le N$. Find all $N$ such that the sum of the $N^2$ numbers Henry wrote on the board is $\\frac{N^3-N^2+2}{4}$.\n\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
}
],
"target": "3",
"id": 0,
"group_id": 0,
"subset_key": "Algebra",
"metadata": {
"problem_id": "imo-bench-algebra-001",
"subcategory": "Operation",
"source": "IMO Shortlist 2021"
}
}
Prompt Template#
Prompt Template:
Problem:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets imo_answerbench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['imo_answerbench'],
dataset_args={
'imo_answerbench': {
# subset_list: ['Algebra', 'Combinatorics', 'Geometry'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)