HallusionBench#
Overview#
HallusionBench is an advanced diagnostic benchmark designed to evaluate image-context reasoning and detect hallucination tendencies in Large Vision-Language Models (LVLMs). It specifically tests models’ susceptibility to language hallucination and visual illusion.
Task Description#
Task Type: Hallucination Detection and Visual Reasoning
Input: Image + yes/no question about image content
Output: YES or NO answer
Domains: Hallucination detection, visual reasoning, factual accuracy
Key Features#
Specifically designed to probe hallucination behaviors
Tests both language hallucination and visual illusion
Organized by categories and subcategories for detailed analysis
Uses grouped accuracy metrics for robust evaluation
Questions require precise image-context reasoning
Evaluation Notes#
Default evaluation uses the image split
Multiple accuracy metrics:
aAcc: Answer-level accuracy (per-question)
fAcc: Figure-level accuracy (all questions per figure correct)
qAcc: Question-level accuracy (grouped by question type)
Requires simple YES/NO answers without explanation
Aggregation at subcategory, category, and overall levels
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Aggregation |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
951 |
Prompt Length (Mean) |
136.78 chars |
Prompt Length (Min/Max) |
76 / 292 chars |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
951 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
388x56 - 5291x4536 |
Formats |
png |
Sample Example#
Subset: default
{
"input": [
{
"id": "ba75d669",
"content": [
{
"text": "Is China, Hongkong SAR, the leading importing country of gold, silverware, and jewelry with the highest import value in 2018?\nPlease answer YES or NO without an explanation."
},
{
"image": "[BASE64_IMAGE: png, ~143.0KB]"
}
]
}
],
"target": "NO",
"id": 0,
"group_id": 0,
"metadata": {
"category": "VS",
"subcategory": "chart",
"visual_input": "1",
"set_id": "0",
"figure_id": "1",
"question_id": "0",
"gt_answer": "0",
"gt_answer_details": "Switzerland is the leading importing country of gold, silverware, and jewelry with the highest import value in 2018?"
}
}
Prompt Template#
Prompt Template:
{question}
Please answer YES or NO without an explanation.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets hallusion_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['hallusion_bench'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)