GSM8K-V#

Overview#

GSM8K-V is a purely visual multi-image mathematical reasoning benchmark that systematically transforms each GSM8K math word problem into its visual counterpart. It enables clean within-item comparison across modalities for multimodal math evaluation.

Task Description#

Task Type: Visual Mathematical Word Problem Solving
Input: Multiple images representing a math problem + question
Output: Numerical answer in \boxed{} format
Domains: Visual math reasoning, scene understanding, arithmetic

Key Features#

Visual transformation of GSM8K text problems
Multi-image input format for complete problem representation
Enables text-vs-visual modality comparisons
Categorized by problem type and subcategory
Preserves original problem difficulty levels

Evaluation Notes#

Default evaluation uses the train split
Primary metric: Accuracy with numeric comparison
Answers should be in \boxed{} format with number only
Step-by-step reasoning encouraged before final answer
Metadata includes original text question for comparison

Properties#

Property	Value
Benchmark Name	`gsm8k_v`
Dataset ID	evalscope/GSM8K-V
Paper	N/A
Tags	`Math`, `MultiModal`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	1,319
Prompt Length (Mean)	492.4 chars
Prompt Length (Min/Max)	410 / 789 chars

Image Statistics:

Metric	Value
Total Images	5,344
Images per Sample	min: 2, max: 11, mean: 4.05
Resolution Range	1024x1024 - 1024x1024
Formats	jpeg

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "9e0072fd",
      "content": [
        {
          "text": "You are an expert at solving mathematical word problems. Please solve the following problem step by step, showing your reasoning.\n\nWhen providing your final answer:\n- If the answer can be expressed as a whole number (integer), provide it as a ... [TRUNCATED] ... s Janet (white woman, shoulder-length brown hair, wearing a light blue blouse and khaki pants) make each day at the farmers' market?\n\nPlease think step by step. After your reasoning, put your final answer within \\boxed{} with the number only."
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~153.1KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~175.1KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~124.7KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~172.8KB]"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~117.4KB]"
        }
      ]
    }
  ],
  "target": "18.0",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "index": 1,
    "category": "other",
    "subcategory": "count",
    "original_question": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

You are an expert at solving mathematical word problems. Please solve the following problem step by step, showing your reasoning.

When providing your final answer:
- If the answer can be expressed as a whole number (integer), provide it as an integer
Problem: {question}

Please think step by step. After your reasoning, put your final answer within \boxed{{}} with the number only.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets gsm8k_v \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['gsm8k_v'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)