OlympiadBench#

Overview#

OlympiadBench is an Olympiad-level bilingual multimodal scientific benchmark featuring 8,476 problems from mathematics and physics competitions, including the Chinese college entrance exam (CEE). It provides rigorous evaluation of advanced scientific reasoning.

Task Description#

Task Type: Olympiad-Level Math/Physics Problem Solving
Input: Problem text with optional images (up to 9)
Output: Mathematical answer or proof
Domains: Mathematics, Physics (bilingual: English and Chinese)

Key Features#

8,476 Olympiad-level problems
Bilingual support (English and Chinese)
Covers both Mathematics and Physics
Subset naming convention:
- OE: Open-Ended problems
- TP: Theorem Proving problems
- MM: Multimodal (with images)
- TO: Text-Only
- CEE: Chinese Entrance Exam
- COMP: Comprehensive competition problems

Evaluation Notes#

Default evaluation uses the train split
Primary metric: Accuracy with mathematical judging
Answers should be in \boxed{} format
Note: TP (Theorem Proving) subsets cannot be auto-evaluated currently
Supports numerical precision/error thresholds for approximate answers

Properties#

Property	Value
Benchmark Name	`olympiad_bench`
Dataset ID	AI-ModelScope/OlympiadBench
Paper	N/A
Tags	`Math`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	8,476
Prompt Length (Mean)	484.46 chars
Prompt Length (Min/Max)	129 / 5032 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`OE_MM_maths_en_COMP`	150	1118.59	575	3943
`OE_MM_maths_zh_CEE`	1,910	343.89	182	1566
`OE_MM_maths_zh_COMP`	56	379.07	201	608
`OE_MM_physics_en_COMP`	456	863.57	556	2793
`OE_MM_physics_zh_CEE`	1,483	425.68	214	901
`OE_TO_maths_en_COMP`	674	825.93	545	4924
`OE_TO_maths_zh_CEE`	1,240	297.26	171	1098
`OE_TO_maths_zh_COMP`	408	322.72	173	1583
`OE_TO_physics_en_COMP`	236	949.1	548	3112
`OE_TO_physics_zh_CEE`	115	329.56	203	503
`TP_MM_maths_en_COMP`	62	2044.19	475	4617
`TP_MM_maths_zh_CEE`	652	241.05	165	674
`TP_MM_maths_zh_COMP`	81	304.7	206	512
`TP_MM_physics_en_COMP`	19	677.79	432	1602
`TP_TO_maths_en_COMP`	503	933.44	449	5032
`TP_TO_maths_zh_CEE`	207	242.79	141	947
`TP_TO_maths_zh_COMP`	199	285.43	129	522
`TP_TO_physics_en_COMP`	25	740.48	475	1262

Image Statistics:

Metric	Value
Total Images	5,875
Images per Sample	min: 1, max: 9, mean: 1.21
Resolution Range	64x46 - 1765x1947
Formats	jpeg, png

Sample Example#

Subset: OE_MM_maths_en_COMP

{
  "input": [
    {
      "id": "d7b44a52",
      "content": [
        {
          "text": "The following is an open-ended problem from an International Math competition. The answer of The problem should be a numerical value. Please calculate the answer according to the given requirements and the information provided. Please use LaT ... [TRUNCATED] ... c_{3}, \\ldots$ with $c_{i}<C$ for all $i$, Turbo can (after studying the sequence) ensure that there is some point on the circle that it will never visit or crawl across.\n\nPlease reason step by step, and put your final answer within \\boxed{}."
        },
        {
          "image": "[BASE64_IMAGE: jpg, ~27.4KB]"
        }
      ]
    }
  ],
  "target": "$\\frac{1}{2}$",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "id": 2231,
    "subfield": "Geometry",
    "context": null,
    "solution": [
      "The largest possible $C$ is $C=\\frac{1}{2}$.\n\nFor $0<C \\leqslant \\frac{1}{2}$, Turbo can simply choose an arbitrary point $P$ (different from its starting point) to avoid. When Turbo is at an arbitrary point $A$ different from $P$, the two ar ... [TRUNCATED] ... Note: Every sequence of the form $c_{i}=x$ if $i$ is odd, and $c_{i}=y$ if $i$ is even, where $0<x, y<C$, such that $x+y \\geqslant 1$, and $x \\neq y$ satisfies the conditions with the same argument. There might be even more possible examples.",
      "To show that $C\\le \\frac12$\n\nWe consider the following related problem:\n\nWe assume instead that the snail Chet is moving left and right on the real line. Find the size $M$ of the smallest (closed) interval, that we cannot force Chet out of, u ... [TRUNCATED] ... n, 1-\\varepsilon]$. Indeed the absolute value of the final position is at least $1-\\frac{5}{6} \\varepsilon$. This contradicts the assumption, that we cannot force Chet out of $[-1+\\varepsilon, 1-\\varepsilon]$. Hence $M \\geqslant 2$ as needed."
    ],
    "final_answer": [
      "$\\frac{1}{2}$"
    ],
    "is_multiple_answer": false,
    "unit": null,
    "answer_type": "Numerical",
    "question_type": "Open-ended",
    "language": "English",
    "subject": "Math",
    "error": null
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets olympiad_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['olympiad_bench'],
    dataset_args={
        'olympiad_bench': {
            # subset_list: ['OE_MM_maths_en_COMP', 'OE_MM_maths_zh_CEE', 'OE_MM_maths_zh_COMP']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)