ZeroBench#

Overview#

ZeroBench is a challenging visual reasoning benchmark for Large Multimodal Models (LMMs). It consists of 100 high-quality, manually curated questions covering numerous domains, reasoning types, and image types designed to be beyond current model capabilities.

Task Description#

Task Type: Advanced Visual Reasoning
Input: One or more images + challenging visual reasoning question
Output: Step-by-step reasoning with final answer in curly braces
Domains: Visual reasoning, perception, multi-step inference

Key Features#

100 manually curated high-quality questions
Designed to challenge frontier models (zero pass@1 with greedy decoding)
Covers diverse domains, reasoning types, and image types
No model achieves 5/5 reliability score
Tests limits of current visual reasoning capabilities

Evaluation Notes#

Default evaluation uses the zerobench split
Primary metric: Accuracy with LLM judge
Answers must be in format: {final answer}
Includes subquestions split for detailed analysis
Uses image compression to handle large images

Properties#

Property	Value
Benchmark Name	`zerobench`
Dataset ID	evalscope/zerobench
Paper	N/A
Tags	`Knowledge`, `MultiModal`, `QA`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`zerobench`
Train Split	`zerobench_subquestions`

Data Statistics#

Metric	Value
Total Samples	100
Prompt Length (Mean)	645.72 chars
Prompt Length (Min/Max)	139 / 1998 chars

Image Statistics:

Metric	Value
Total Images	108
Images per Sample	min: 1, max: 3, mean: 1.08
Resolution Range	512x297 - 5559x4070
Formats	jpeg, png

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "f3276b25",
      "content": [
        {
          "text": "I want to purchase all the Montellier bottles from the top three shelves. How much do I save by purchasing the bottles with a loyalty card? Give your final answer in dollars.\n\n\n\nLet's think step by step and give the final answer in curly braces,\nlike this: {final answer}\"\n"
        },
        {
          "image": "[BASE64_IMAGE: png, ~462.4KB]"
        }
      ]
    }
  ],
  "target": "11.90",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "question_id": "1",
    "question_images": [
      "images/1_0.png"
    ],
    "image_attribution": "Own"
  }
}

Prompt Template#

Prompt Template:

{question}

Let's think step by step and give the final answer in curly braces,
like this: {{final answer}}"

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets zerobench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['zerobench'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)