OlympiadBench#
Overview#
OlympiadBench is an Olympiad-level bilingual multimodal scientific benchmark featuring 8,476 problems from mathematics and physics competitions, including the Chinese college entrance exam (CEE). It provides rigorous evaluation of advanced scientific reasoning.
Task Description#
Task Type: Olympiad-Level Math/Physics Problem Solving
Input: Problem text with optional images (up to 9)
Output: Mathematical answer or proof
Domains: Mathematics, Physics (bilingual: English and Chinese)
Key Features#
8,476 Olympiad-level problems
Bilingual support (English and Chinese)
Covers both Mathematics and Physics
Subset naming convention:
OE: Open-Ended problemsTP: Theorem Proving problemsMM: Multimodal (with images)TO: Text-OnlyCEE: Chinese Entrance ExamCOMP: Comprehensive competition problems
Evaluation Notes#
Default evaluation uses the train split
Primary metric: Accuracy with mathematical judging
Answers should be in \boxed{} format
Note:
TP(Theorem Proving) subsets cannot be auto-evaluated currentlySupports numerical precision/error thresholds for approximate answers
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
8,476 |
Prompt Length (Mean) |
484.46 chars |
Prompt Length (Min/Max) |
129 / 5032 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
150 |
1118.59 |
575 |
3943 |
|
1,910 |
343.89 |
182 |
1566 |
|
56 |
379.07 |
201 |
608 |
|
456 |
863.57 |
556 |
2793 |
|
1,483 |
425.68 |
214 |
901 |
|
674 |
825.93 |
545 |
4924 |
|
1,240 |
297.26 |
171 |
1098 |
|
408 |
322.72 |
173 |
1583 |
|
236 |
949.1 |
548 |
3112 |
|
115 |
329.56 |
203 |
503 |
|
62 |
2044.19 |
475 |
4617 |
|
652 |
241.05 |
165 |
674 |
|
81 |
304.7 |
206 |
512 |
|
19 |
677.79 |
432 |
1602 |
|
503 |
933.44 |
449 |
5032 |
|
207 |
242.79 |
141 |
947 |
|
199 |
285.43 |
129 |
522 |
|
25 |
740.48 |
475 |
1262 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
5,875 |
Images per Sample |
min: 1, max: 9, mean: 1.21 |
Resolution Range |
64x46 - 1765x1947 |
Formats |
jpeg, png |
Sample Example#
Subset: OE_MM_maths_en_COMP
{
"input": [
{
"id": "d7b44a52",
"content": [
{
"text": "The following is an open-ended problem from an International Math competition. The answer of The problem should be a numerical value. Please calculate the answer according to the given requirements and the information provided. Please use LaT ... [TRUNCATED] ... c_{3}, \\ldots$ with $c_{i}<C$ for all $i$, Turbo can (after studying the sequence) ensure that there is some point on the circle that it will never visit or crawl across.\n\nPlease reason step by step, and put your final answer within \\boxed{}."
},
{
"image": "[BASE64_IMAGE: jpg, ~27.4KB]"
}
]
}
],
"target": "$\\frac{1}{2}$",
"id": 0,
"group_id": 0,
"metadata": {
"id": 2231,
"subfield": "Geometry",
"context": null,
"solution": [
"The largest possible $C$ is $C=\\frac{1}{2}$.\n\nFor $0<C \\leqslant \\frac{1}{2}$, Turbo can simply choose an arbitrary point $P$ (different from its starting point) to avoid. When Turbo is at an arbitrary point $A$ different from $P$, the two ar ... [TRUNCATED] ... Note: Every sequence of the form $c_{i}=x$ if $i$ is odd, and $c_{i}=y$ if $i$ is even, where $0<x, y<C$, such that $x+y \\geqslant 1$, and $x \\neq y$ satisfies the conditions with the same argument. There might be even more possible examples.",
"To show that $C\\le \\frac12$\n\nWe consider the following related problem:\n\nWe assume instead that the snail Chet is moving left and right on the real line. Find the size $M$ of the smallest (closed) interval, that we cannot force Chet out of, u ... [TRUNCATED] ... n, 1-\\varepsilon]$. Indeed the absolute value of the final position is at least $1-\\frac{5}{6} \\varepsilon$. This contradicts the assumption, that we cannot force Chet out of $[-1+\\varepsilon, 1-\\varepsilon]$. Hence $M \\geqslant 2$ as needed."
],
"final_answer": [
"$\\frac{1}{2}$"
],
"is_multiple_answer": false,
"unit": null,
"answer_type": "Numerical",
"question_type": "Open-ended",
"language": "English",
"subject": "Math",
"error": null
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets olympiad_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['olympiad_bench'],
dataset_args={
'olympiad_bench': {
# subset_list: ['OE_MM_maths_en_COMP', 'OE_MM_maths_zh_CEE', 'OE_MM_maths_zh_COMP'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)