SEED-Bench-2-Plus#
Overview#
SEED-Bench-2-Plus is a large-scale benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on text-rich visual understanding tasks. It contains 2.3K multiple-choice questions with precise human annotations across real-world scenarios.
Task Description#
Task Type: Text-Rich Visual Question Answering
Input: Image containing text-rich content + multiple-choice question
Output: Correct answer choice letter (A/B/C/D)
Domains: Charts, maps, web interfaces
Key Features#
Focuses on text-rich visual scenarios common in real applications
Three broad categories: Charts, Maps, and Webs
Human-annotated questions with high quality
Tests understanding of complex visual layouts with text
Multiple difficulty levels within each category
Evaluation Notes#
Default evaluation uses the test split
Available subsets:
chart,web,mapPrimary metric: Accuracy on multiple-choice questions
Uses Chain-of-Thought (CoT) prompting for reasoning
Rich metadata including data source, type, and difficulty level
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
2,277 |
Prompt Length (Mean) |
393.01 chars |
Prompt Length (Min/Max) |
280 / 710 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
810 |
390.92 |
280 |
663 |
|
660 |
395.56 |
294 |
664 |
|
807 |
393.02 |
282 |
710 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
2,277 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
800x800 - 800x800 |
Formats |
png |
Sample Example#
Subset: chart
{
"input": [
{
"id": "7fdf638f",
"content": [
{
"text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nAccording to the tree diagram, how many women who have breast cancer received a negative mammogram?\n\nA) 80\nB) 950\nC) 20\nD) 8,950"
},
{
"image": "[BASE64_IMAGE: png, ~68.1KB]"
}
]
}
],
"choices": [
"80",
"950",
"20",
"8,950"
],
"target": "D",
"id": 0,
"group_id": 0,
"subset_key": "chart",
"metadata": {
"data_id": "text_rich/1.png",
"question_id": "0",
"question_image_subtype": "tree diagram",
"data_source": "SEED-Bench v2 plus",
"data_type": "Single Image",
"level": "L1",
"subpart": "Single-Image & Text Comprehension",
"version": "v2+"
}
}
Prompt Template#
Prompt Template:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets seed_bench_2_plus \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['seed_bench_2_plus'],
dataset_args={
'seed_bench_2_plus': {
# subset_list: ['chart', 'web', 'map'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)