MathVerse#
Overview#
MathVerse is an all-around visual math benchmark designed for equitable and in-depth evaluation of Multimodal Large Language Models (MLLMs). It contains 2,612 high-quality, multi-subject math problems with diagrams, transformed into 15K test samples across varying information modalities.
Task Description#
Task Type: Visual Mathematical Reasoning
Input: Math problem with diagram + question (multi-choice or free-form)
Output: Answer (letter for multi-choice, numerical/expression for free-form)
Domains: Multi-subject mathematics with visual diagrams
Key Features#
2,612 problems transformed into 6 versions each (15K total samples)
Tests whether MLLMs truly understand visual diagrams for math reasoning
Problem versions vary by visual information dependency:
Text Dominant: Most info in text
Text Lite: Balanced text/visual
Vision Intensive: More visual reliance
Vision Dominant: Primarily visual
Vision Only: All info in diagram
Supports both multiple-choice and free-form answers
Evaluation Notes#
Default evaluation uses the testmini split
Primary metric: Accuracy with numeric comparison
Free-form answers use \boxed{} format
Uses LLM judge for answer verification
Results reported per problem version for detailed analysis
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
3,940 |
Prompt Length (Mean) |
274.2 chars |
Prompt Length (Min/Max) |
70 / 1535 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
788 |
369.63 |
122 |
1535 |
|
788 |
294.77 |
78 |
1397 |
|
788 |
280.39 |
78 |
1350 |
|
788 |
272.11 |
78 |
1356 |
|
788 |
154.1 |
70 |
222 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
3,940 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
63x70 - 6840x3549 |
Formats |
jpeg, png |
Sample Example#
Subset: Text Dominant
{
"input": [
{
"id": "a3189330",
"content": [
{
"text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.\n\nAs shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
},
{
"image": "[BASE64_IMAGE: png, ~1.6KB]"
}
]
}
],
"target": "D",
"id": 0,
"group_id": 0,
"subset_key": "Text Dominant",
"metadata": {
"sample_index": "1",
"problem_index": "1",
"problem_version": "Text Dominant",
"question_type": "multi-choice",
"query_wo": "Please directly answer the question and provide the correct option letter, e.g., A, B, C, D.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
"query_cot": "Please first conduct reasoning, and then answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
"question_for_eval": "As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
}
}
Prompt Template#
Prompt Template:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets math_verse \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['math_verse'],
dataset_args={
'math_verse': {
# subset_list: ['Text Dominant', 'Text Lite', 'Vision Intensive'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)