POPE#

Overview#

POPE (Polling-based Object Probing Evaluation) is a benchmark specifically designed to evaluate object hallucination in Large Vision-Language Models (LVLMs). It tests models’ ability to accurately identify objects present in images through yes/no questions.

Task Description#

  • Task Type: Object Hallucination Detection (Yes/No Q&A)

  • Input: Image with question “Is there a [object] in the image?”

  • Output: YES or NO answer

  • Focus: Measuring accuracy vs. hallucination rate

Key Features#

  • Three sampling strategies: random, popular, adversarial

  • Tests for false positive object claims (hallucination)

  • Based on MSCOCO images

  • Simple yes/no question format for objective evaluation

  • Measures alignment between model responses and visual content

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Five metrics: accuracy, precision, recall, F1 score, yes_ratio

  • F1 score is the primary aggregation metric

  • Three subsets: popular, adversarial, random

  • “Popular” and “adversarial” subsets are more challenging

  • yes_ratio indicates model’s tendency to answer “yes”

Properties#

Property

Value

Benchmark Name

pope

Dataset ID

lmms-lab/POPE

Paper

N/A

Tags

Hallucination, MultiModal, Yes/No

Metrics

accuracy, precision, recall, f1_score, yes_ratio

Default Shots

0-shot

Evaluation Split

N/A

Aggregation

f1

Data Statistics#

Metric

Value

Total Samples

9,000

Prompt Length (Mean)

79.4 chars

Prompt Length (Min/Max)

75 / 87 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

popular

3,000

79.27

75

87

adversarial

3,000

79.36

75

87

random

3,000

79.59

75

87

Image Statistics:

Metric

Value

Total Images

9,000

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

500x243 - 640x640

Formats

jpeg

Sample Example#

Subset: popular

{
  "input": [
    {
      "id": "8847a5a3",
      "content": [
        {
          "text": "Is there a snowboard in the image?\nPlease answer YES or NO without an explanation."
        },
        {
          "image": "[BASE64_IMAGE: png, ~87.2KB]"
        }
      ]
    }
  ],
  "target": "YES",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "id": "3000",
    "answer": "YES",
    "category": "popular",
    "question_id": "1"
  }
}

Prompt Template#

Prompt Template:

{question}
Please answer YES or NO without an explanation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets pope \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['pope'],
    dataset_args={
        'pope': {
            # subset_list: ['popular', 'adversarial', 'random']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)