MathVista#
Overview#
MathVista is a comprehensive benchmark for mathematical reasoning in visual contexts. It combines newly created datasets with existing benchmarks to evaluate models on diverse visual mathematical reasoning tasks across multiple domains.
Task Description#
Task Type: Visual Mathematical Reasoning
Input: Image with mathematical question (multiple-choice or free-form)
Output: Numerical answer or answer choice
Domains: Geometry, algebra, statistics, scientific reasoning
Key Features#
6,141 examples from 31 different datasets
Includes IQTest, FunctionQA, and PaperQA (newly created)
9 MathQA and 19 VQA datasets from literature
Tests logical reasoning on puzzle figures
Tests algebraic reasoning over functional plots
Tests scientific reasoning with academic paper figures
Evaluation Notes#
Default configuration uses 0-shot evaluation on testmini split
Supports both multiple-choice and free-form questions
Answers should be in
\boxed{}format without unitsUses numeric equivalence checking for answer comparison
Chain-of-Thought (CoT) prompting for multiple-choice questions
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,000 |
Prompt Length (Mean) |
261.48 chars |
Prompt Length (Min/Max) |
106 / 1391 chars |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
1,000 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
187x18 - 5236x3491 |
Formats |
jpeg, mpo, png, webp |
Sample Example#
Subset: default
{
"input": [
{
"id": "93ec4f16",
"content": [
{
"text": "When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. The reason is that there is no one value for the force-it changes. However, we can split the displacement up in ... [TRUNCATED] ... g of spring constant $k=750 \\mathrm{~N} / \\mathrm{m}$. When the canister is momentarily stopped by the spring, by what distance $d$ is the spring compressed?\nPlease reason step by step, and put your final answer within \\boxed{} without units."
},
{
"image": "[BASE64_IMAGE: jpg, ~185.7KB]"
}
]
}
],
"target": "1.2",
"id": 0,
"group_id": 0,
"metadata": {
"precision": 1.0,
"question_type": "free_form",
"answer_type": "float",
"category": "math-targeted-vqa",
"context": "scientific figure",
"grade": "college",
"img_height": 720,
"img_width": 1514,
"language": "english",
"skills": [
"scientific reasoning"
],
"source": "SciBench",
"split": "testmini",
"task": "textbook question answering"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
{question}
Please reason step by step, and put your final answer within \boxed{{}} without units.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets math_vista \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['math_vista'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)