WorldVQA#

Overview#

WorldVQA is a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). It measures models’ ability to ground and name visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities.

Task Description#

Task Type: Visual Entity Recognition / Knowledge QA
Input: Image + question asking to identify a visual entity
Output: Free-form text answer (specific entity name)
Domain: Nature, architecture, culture, products, transportation, entertainment, brands, sports

Key Features#

3000 VQA pairs across 8 semantic categories
Bilingual: English (non-zh) and Chinese (zh)
Three difficulty levels: easy, medium, hard
Tests atomic visual knowledge decoupled from reasoning
Requires precise entity identification (e.g., specific breed, not generic “dog”)

Evaluation Notes#

Default configuration uses 0-shot evaluation
Evaluates on train split (the benchmark data split)
Primary metric: Accuracy via LLM-as-judge
Supports LLM judge for semantic equivalence checking
Results reported per category and overall

Properties#

Property	Value
Benchmark Name	`world_vqa`
Dataset ID	evalscope/WorldVQA
Paper	N/A
Tags	`Knowledge`, `MultiModal`, `QA`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	3,000
Prompt Length (Mean)	38.4 chars
Prompt Length (Min/Max)	5 / 182 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Nature & Environment`	326	33.9	7	83
`Locations & Architecture`	512	42.03	9	122
`Culture, Arts & Crafts`	506	33.71	7	112
`Objects & Products`	437	55.97	9	182
`Vehicles, Craft & Transportation`	306	30.67	8	114
`Entertainment, Media & Gaming`	511	30.5	5	83
`Brands, Logos & Graphic Design`	260	39.1	6	83
`Sports, Gear & Venues`	142	41.99	9	100

Image Statistics:

Metric	Value
Total Images	3,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	33x65 - 3840x3840
Formats	gif, jpeg, png, webp

Sample Example#

Subset: Nature & Environment

{
  "input": [
    {
      "id": "0c08c4e2",
      "content": [
        {
          "text": "What breed of dog is in the picture?"
        },
        {
          "image": "[BASE64_IMAGE: png, ~392.9KB]"
        }
      ]
    }
  ],
  "target": "Greek Hound",
  "id": 0,
  "group_id": 0,
  "subset_key": "Nature & Environment",
  "metadata": {
    "index": 0,
    "category": "Nature & Environment",
    "language": "non-zh",
    "difficulty": "medium"
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets world_vqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['world_vqa'],
    dataset_args={
        'world_vqa': {
            # subset_list: ['Nature & Environment', 'Locations & Architecture', 'Culture, Arts & Crafts']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)