BabyVision#

Overview#

BabyVision is a visual perception benchmark that evaluates the fundamental visual abilities of multimodal large language models through tasks inspired by infant and early childhood visual development. It focuses on fine-grained discrimination, spatial perception, visual pattern recognition, and visual tracking.

Task Description#

  • Task Type: Visual Perception (Choice + Fill-in-the-blank)

  • Input: Image + question

  • Output: Choice letter or free-form short answer

  • Domains: Fine-grained discrimination, spatial perception, visual pattern recognition, visual tracking

Key Features#

  • 388 test samples across 4 major visual ability categories and 22 subtypes

  • Two answer types: choice (135 samples) and blank (253 samples)

  • Subtypes include: Find the different, Find the same, Count clusters, Maze, 3D cube unfold, Pattern completion, Paper folding, Rotation patterns, etc.

  • Tests low-level visual perception rather than high-level reasoning or knowledge

  • Includes Chain-of-Thought (CoT) reference for analysis

Evaluation Notes#

  • Default evaluation uses the train split (388 samples, single split dataset)

  • Primary metric: Accuracy via LLM-as-judge

  • Subsets organized by type field (4 categories)

  • LLM judge evaluates both choice and blank answer types uniformly

  • Requires judge_model_args configuration for LLM judge

Properties#

Property

Value

Benchmark Name

baby_vision

Dataset ID

evalscope/BabyVision

Paper

N/A

Tags

MultiModal, QA, Reasoning

Metrics

acc

Default Shots

0-shot

Evaluation Split

train

Data Statistics#

Metric

Value

Total Samples

388

Prompt Length (Mean)

167.37 chars

Prompt Length (Min/Max)

33 / 450 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Fine-grained Discrimination

163

152.09

33

450

Spatial Perception

91

157.18

73

370

Visual Pattern Recognition

51

178.92

94

319

Visual Tracking

83

201.45

97

389

Image Statistics:

Metric

Value

Total Images

388

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

174x144 - 2378x1448

Formats

jpeg, png, webp

Sample Example#

Subset: Fine-grained Discrimination

{
  "input": [
    {
      "id": "8b83904b",
      "content": [
        {
          "image": "[BASE64_IMAGE: jpeg, ~77.0KB]"
        },
        {
          "text": "The image shows a total of 49 tiger patterns arranged in 7 rows and 7 columns. One of them is different from the others. Which row and column is it in? The answer format is (x,y). (For example, the answer for the 2nd row and 3rd column is (2,3))."
        }
      ]
    }
  ],
  "target": "(4,7)",
  "id": 0,
  "group_id": 0,
  "subset_key": "Fine-grained Discrimination",
  "metadata": {
    "taskId": 445,
    "type": "Fine-grained Discrimination",
    "subtype": "Find the different",
    "ansType": "blank",
    "coT": "The image shows 49 tiger patterns arranged in 7 rows and 7 columns.\nNow, we need to find the coordinates of the one tiger pattern that is different from the other 48.\nIt can be observed that the tiger in the fourth row and seventh column has no ears (the ears are located in the upper right corner of each tiger pattern), while the other 48 tigers have ears.\nTherefore, the correct answer is (4,7)."
  }
}

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets baby_vision \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['baby_vision'],
    dataset_args={
        'baby_vision': {
            # subset_list: ['Fine-grained Discrimination', 'Spatial Perception', 'Visual Pattern Recognition']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)