SimpleVQA#

Overview#

SimpleVQA is the first comprehensive multimodal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. It features high-quality, challenging queries with static and timeless reference answers.

Task Description#

Task Type: Factual Visual Question Answering
Input: Image + factual question
Output: Short factual answer
Domains: Factuality, visual reasoning, knowledge recall

Key Features#

Covers multiple tasks and scenarios
High-quality, challenging questions
Static and timeless reference answers (no temporal dependencies)
Straightforward evaluation methodology
Tests genuine factual knowledge beyond pattern matching

Evaluation Notes#

Default evaluation uses the test split
Primary metric: Accuracy with LLM judge
Three-grade evaluation: CORRECT, INCORRECT, NOT_ATTEMPTED
LLM judge uses detailed grading rubric for semantic matching
Rich metadata includes language, source, and atomic facts

Properties#

Property	Value
Benchmark Name	`simple_vqa`
Dataset ID	m-a-p/SimpleVQA
Paper	N/A
Tags	`MultiModal`, `QA`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	2,025
Prompt Length (Mean)	56.22 chars
Prompt Length (Min/Max)	27 / 1015 chars

Image Statistics:

Metric	Value
Total Images	2,025
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	106x56 - 5119x3413
Formats	jpeg, png

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "4340fc24",
      "content": [
        {
          "text": "Answer the question:\n\n图中所示穴位所属的经脉是什么？"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~26.5KB]"
        }
      ]
    }
  ],
  "target": "足阳明胃经",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "data_id": 0,
    "image_description": "",
    "language": "CN",
    "original_category": "中华文化_中医",
    "source": "https://baike.baidu.com/item/%E4%BC%8F%E5%85%94%E7%A9%B4/3503684#:~:text\\u003d%E4%BA%BA%E4%BD%93%E7%A9%B4%E4%BD%8D%E5%90%8D%E4%BC%8F%E5%85%94%E7%A9%B4F%C3%BA%20t%C3%B9%EF%BC%88ST32%EF%BC%89%E5%B1%9E%E8%B6%B3%E9%98%B3%E6%98%8E%E8%83%83%E7%BB%8 ... [TRUNCATED] ... 4%BE%A7%E7%AB%AF%E7%9A%84%E8%BF%9E%E7%BA%BF%E4%B8%8A%EF%BC%8C%E9%AB%8C%E9%AA%A8%E4%B8%8A%E7%BC%98%E4%B8%8A6%E5%AF%B8%E3%80%82%E4%BC%8F%E5%85%94%E5%88%AB%E5%90%8D%E5%A4%96%E4%B8%98%E3%80%81%E5%A4%96%E5%8B%BE%EF%BC%8C%E4%BD%8D%E4%BA%8E%E5%A4%A7",
    "atomic_question": "图中所示穴位的名称是什么？",
    "atomic_fact": "伏兔"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the question:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets simple_vqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['simple_vqa'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)