HallusionBench#

Overview#

HallusionBench is an advanced diagnostic benchmark designed to evaluate image-context reasoning and detect hallucination tendencies in Large Vision-Language Models (LVLMs). It specifically tests models’ susceptibility to language hallucination and visual illusion.

Task Description#

Task Type: Hallucination Detection and Visual Reasoning
Input: Image + yes/no question about image content
Output: YES or NO answer
Domains: Hallucination detection, visual reasoning, factual accuracy

Key Features#

Specifically designed to probe hallucination behaviors
Tests both language hallucination and visual illusion
Organized by categories and subcategories for detailed analysis
Uses grouped accuracy metrics for robust evaluation
Questions require precise image-context reasoning

Evaluation Notes#

Default evaluation uses the image split
Multiple accuracy metrics:
- aAcc: Answer-level accuracy (per-question)
- fAcc: Figure-level accuracy (all questions per figure correct)
- qAcc: Question-level accuracy (grouped by question type)
Requires simple YES/NO answers without explanation
Aggregation at subcategory, category, and overall levels

Properties#

Property	Value
Benchmark Name	`hallusion_bench`
Dataset ID	lmms-lab/HallusionBench
Paper	N/A
Tags	`Hallucination`, `MultiModal`, `Yes/No`
Metrics	`aAcc`, `qAcc`, `fAcc`
Default Shots	0-shot
Evaluation Split	`image`
Aggregation	`f1`

Data Statistics#

Metric	Value
Total Samples	951
Prompt Length (Mean)	136.78 chars
Prompt Length (Min/Max)	76 / 292 chars

Image Statistics:

Metric	Value
Total Images	951
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	388x56 - 5291x4536
Formats	png

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "ba75d669",
      "content": [
        {
          "text": "Is China, Hongkong SAR, the leading importing country of gold, silverware, and jewelry with the highest import value in 2018?\nPlease answer YES or NO without an explanation."
        },
        {
          "image": "[BASE64_IMAGE: png, ~143.0KB]"
        }
      ]
    }
  ],
  "target": "NO",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "category": "VS",
    "subcategory": "chart",
    "visual_input": "1",
    "set_id": "0",
    "figure_id": "1",
    "question_id": "0",
    "gt_answer": "0",
    "gt_answer_details": "Switzerland is the leading importing country of gold, silverware, and jewelry with the highest import value in 2018?"
  }
}

Prompt Template#

Prompt Template:

{question}
Please answer YES or NO without an explanation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets hallusion_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['hallusion_bench'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)