MathVerse#

Overview#

MathVerse is an all-around visual math benchmark designed for equitable and in-depth evaluation of Multimodal Large Language Models (MLLMs). It contains 2,612 high-quality, multi-subject math problems with diagrams, transformed into 15K test samples across varying information modalities.

Task Description#

Task Type: Visual Mathematical Reasoning
Input: Math problem with diagram + question (multi-choice or free-form)
Output: Answer (letter for multi-choice, numerical/expression for free-form)
Domains: Multi-subject mathematics with visual diagrams

Key Features#

2,612 problems transformed into 6 versions each (15K total samples)
Tests whether MLLMs truly understand visual diagrams for math reasoning
Problem versions vary by visual information dependency:
- Text Dominant: Most info in text
- Text Lite: Balanced text/visual
- Vision Intensive: More visual reliance
- Vision Dominant: Primarily visual
- Vision Only: All info in diagram
Supports both multiple-choice and free-form answers

Evaluation Notes#

Default evaluation uses the testmini split
Primary metric: Accuracy with numeric comparison
Free-form answers use \boxed{} format
Uses LLM judge for answer verification
Results reported per problem version for detailed analysis

Properties#

Property	Value
Benchmark Name	`math_verse`
Dataset ID	evalscope/MathVerse
Paper	N/A
Tags	`MCQ`, `Math`, `MultiModal`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`testmini`

Data Statistics#

Metric	Value
Total Samples	3,940
Prompt Length (Mean)	274.2 chars
Prompt Length (Min/Max)	70 / 1535 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Text Dominant`	788	369.63	122	1535
`Text Lite`	788	294.77	78	1397
`Vision Intensive`	788	280.39	78	1350
`Vision Dominant`	788	272.11	78	1356
`Vision Only`	788	154.1	70	222

Image Statistics:

Metric	Value
Total Images	3,940
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	63x70 - 6840x3549
Formats	jpeg, png

Sample Example#

Subset: Text Dominant

{
  "input": [
    {
      "id": "a3189330",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.\n\nAs shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
        },
        {
          "image": "[BASE64_IMAGE: png, ~1.6KB]"
        }
      ]
    }
  ],
  "target": "D",
  "id": 0,
  "group_id": 0,
  "subset_key": "Text Dominant",
  "metadata": {
    "sample_index": "1",
    "problem_index": "1",
    "problem_version": "Text Dominant",
    "question_type": "multi-choice",
    "query_wo": "Please directly answer the question and provide the correct option letter, e.g., A, B, C, D.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
    "query_cot": "Please first conduct reasoning, and then answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
    "question_for_eval": "As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
  }
}

Prompt Template#

Prompt Template:

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets math_verse \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['math_verse'],
    dataset_args={
        'math_verse': {
            # subset_list: ['Text Dominant', 'Text Lite', 'Vision Intensive']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)