MathVerse#

概述#

MathVerse 是一个全面的视觉数学基准测试,旨在对多模态大语言模型(MLLMs)进行公平且深入的评估。它包含 2,612 道高质量、多学科的带图数学题,并根据不同信息模态转换为 15K 个测试样本。

任务描述#

  • 任务类型:视觉数学推理

  • 输入:带图的数学问题 + 问题(多选题或自由作答)

  • 输出:答案(多选题为选项字母,自由作答为数值或表达式)

  • 领域:带视觉图表的多学科数学

核心特性#

  • 2,612 道题目,每道题生成 6 个版本(共 15K 个样本)

  • 测试 MLLM 是否真正理解用于数学推理的视觉图表

  • 题目版本根据对视觉信息的依赖程度划分:

    • Text Dominant(文本主导):大部分信息在文本中

    • Text Lite(文本轻量):文本与视觉信息均衡

    • Vision Intensive(视觉密集):更依赖视觉信息

    • Vision Dominant(视觉主导):主要依赖视觉信息

    • Vision Only(纯视觉):所有信息均在图表中

  • 支持多选题和自由作答两种形式

评估说明#

  • 默认评估使用 testmini 划分

  • 主要指标:准确率(Accuracy),采用数值比较

  • 自由作答答案使用 \boxed{} 格式

  • 使用 LLM 作为裁判进行答案验证

  • 按题目版本分别报告结果,便于深入分析

属性#

属性

基准测试名称

math_verse

数据集ID

evalscope/MathVerse

论文

N/A

标签

MCQ, Math, MultiModal, Reasoning

指标

acc

默认提示数量

0-shot

评估划分

testmini

数据统计#

指标

总样本数

3,940

提示词长度(平均)

274.2 字符

提示词长度(最小/最大)

70 / 1535 字符

各子集统计:

子集

样本数

提示平均长度

提示最小长度

提示最大长度

Text Dominant

788

369.63

122

1535

Text Lite

788

294.77

78

1397

Vision Intensive

788

280.39

78

1350

Vision Dominant

788

272.11

78

1356

Vision Only

788

154.1

70

222

图像统计:

指标

总图像数

3,940

每样本图像数

最小: 1, 最大: 1, 平均: 1

分辨率范围

63x70 - 6840x3549

格式

jpeg, png

样例示例#

子集: Text Dominant

{
  "input": [
    {
      "id": "a3189330",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.\n\nAs shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
        },
        {
          "image": "[BASE64_IMAGE: png, ~1.6KB]"
        }
      ]
    }
  ],
  "target": "D",
  "id": 0,
  "group_id": 0,
  "subset_key": "Text Dominant",
  "metadata": {
    "sample_index": "1",
    "problem_index": "1",
    "problem_version": "Text Dominant",
    "question_type": "multi-choice",
    "query_wo": "Please directly answer the question and provide the correct option letter, e.g., A, B, C, D.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
    "query_cot": "Please first conduct reasoning, and then answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°",
    "question_for_eval": "As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, point D is on AB and point E is on AC, DE parallel BC, then the size of angle CED is ()\nChoices:\nA:40°\nB:60°\nC:120°\nD:140°"
  }
}

提示模板#

提示模板:

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

使用方法#

使用 CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets math_verse \
    --limit 10  # 正式评估时请删除此行

使用 Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['math_verse'],
    dataset_args={
        'math_verse': {
            # subset_list: ['Text Dominant', 'Text Lite', 'Vision Intensive']  # 可选,用于评估特定子集
        }
    },
    limit=10,  # 正式评估时请删除此行
)

run_task(task_cfg=task_cfg)