SEED-Bench-2-Plus#

Overview#

SEED-Bench-2-Plus is a large-scale benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on text-rich visual understanding tasks. It contains 2.3K multiple-choice questions with precise human annotations across real-world scenarios.

Task Description#

  • Task Type: Text-Rich Visual Question Answering

  • Input: Image containing text-rich content + multiple-choice question

  • Output: Correct answer choice letter (A/B/C/D)

  • Domains: Charts, maps, web interfaces

Key Features#

  • Focuses on text-rich visual scenarios common in real applications

  • Three broad categories: Charts, Maps, and Webs

  • Human-annotated questions with high quality

  • Tests understanding of complex visual layouts with text

  • Multiple difficulty levels within each category

Evaluation Notes#

  • Default evaluation uses the test split

  • Available subsets: chart, web, map

  • Primary metric: Accuracy on multiple-choice questions

  • Uses Chain-of-Thought (CoT) prompting for reasoning

  • Rich metadata including data source, type, and difficulty level

Properties#

Property

Value

Benchmark Name

seed_bench_2_plus

Dataset ID

evalscope/SEED-Bench-2-Plus

Paper

N/A

Tags

Knowledge, MCQ, MultiModal, Reasoning

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

2,277

Prompt Length (Mean)

393.01 chars

Prompt Length (Min/Max)

280 / 710 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

chart

810

390.92

280

663

web

660

395.56

294

664

map

807

393.02

282

710

Image Statistics:

Metric

Value

Total Images

2,277

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

800x800 - 800x800

Formats

png

Sample Example#

Subset: chart

{
  "input": [
    {
      "id": "7fdf638f",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nAccording to the tree diagram, how many women who have breast cancer received a negative mammogram?\n\nA) 80\nB) 950\nC) 20\nD) 8,950"
        },
        {
          "image": "[BASE64_IMAGE: png, ~68.1KB]"
        }
      ]
    }
  ],
  "choices": [
    "80",
    "950",
    "20",
    "8,950"
  ],
  "target": "D",
  "id": 0,
  "group_id": 0,
  "subset_key": "chart",
  "metadata": {
    "data_id": "text_rich/1.png",
    "question_id": "0",
    "question_image_subtype": "tree diagram",
    "data_source": "SEED-Bench v2 plus",
    "data_type": "Single Image",
    "level": "L1",
    "subpart": "Single-Image & Text Comprehension",
    "version": "v2+"
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets seed_bench_2_plus \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['seed_bench_2_plus'],
    dataset_args={
        'seed_bench_2_plus': {
            # subset_list: ['chart', 'web', 'map']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)