ZeroBench#
Overview#
ZeroBench is a challenging visual reasoning benchmark for Large Multimodal Models (LMMs). It consists of 100 high-quality, manually curated questions covering numerous domains, reasoning types, and image types designed to be beyond current model capabilities.
Task Description#
Task Type: Advanced Visual Reasoning
Input: One or more images + challenging visual reasoning question
Output: Step-by-step reasoning with final answer in curly braces
Domains: Visual reasoning, perception, multi-step inference
Key Features#
100 manually curated high-quality questions
Designed to challenge frontier models (zero pass@1 with greedy decoding)
Covers diverse domains, reasoning types, and image types
No model achieves 5/5 reliability score
Tests limits of current visual reasoning capabilities
Evaluation Notes#
Default evaluation uses the zerobench split
Primary metric: Accuracy with LLM judge
Answers must be in format:
{final answer}Includes subquestions split for detailed analysis
Uses image compression to handle large images
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
100 |
Prompt Length (Mean) |
645.72 chars |
Prompt Length (Min/Max) |
139 / 1998 chars |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
108 |
Images per Sample |
min: 1, max: 3, mean: 1.08 |
Resolution Range |
512x297 - 5559x4070 |
Formats |
jpeg, png |
Sample Example#
Subset: default
{
"input": [
{
"id": "f3276b25",
"content": [
{
"text": "I want to purchase all the Montellier bottles from the top three shelves. How much do I save by purchasing the bottles with a loyalty card? Give your final answer in dollars.\n\n\n\nLet's think step by step and give the final answer in curly braces,\nlike this: {final answer}\"\n"
},
{
"image": "[BASE64_IMAGE: png, ~462.4KB]"
}
]
}
],
"target": "11.90",
"id": 0,
"group_id": 0,
"metadata": {
"question_id": "1",
"question_images": [
"images/1_0.png"
],
"image_attribution": "Own"
}
}
Prompt Template#
Prompt Template:
{question}
Let's think step by step and give the final answer in curly braces,
like this: {{final answer}}"
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets zerobench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['zerobench'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)