HaluEval#

Overview#

HaluEval is a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. It provides a comprehensive benchmark for assessing model reliability and factual accuracy.

Task Description#

  • Task Type: Hallucination Detection

  • Input: Context/knowledge + response to judge

  • Output: YES (hallucination) or NO (factual)

  • Domains: Dialogue, QA, Summarization

Key Features#

  • Three evaluation categories:

    • dialogue_samples: Hallucination in conversational responses

    • qa_samples: Hallucination in question answering

    • summarization_samples: Hallucination in document summaries

  • Both generated and human-annotated samples

  • Tests model’s ability to detect factual inconsistencies

  • Requires reasoning about knowledge-response alignment

Evaluation Notes#

  • Default evaluation uses zero-shot (no few-shot examples)

  • Multiple metrics computed:

    • Accuracy: Overall correct judgments

    • Precision: True positives among positive predictions

    • Recall: True positives among actual positives

    • F1 Score: Harmonic mean of precision and recall

    • Yes Ratio: Proportion of YES predictions

  • Binary YES/NO judgment format

Properties#

Property

Value

Benchmark Name

halueval

Dataset ID

evalscope/HaluEval

Paper

N/A

Tags

Hallucination, Knowledge, Yes/No

Metrics

accuracy, precision, recall, f1_score, yes_ratio

Default Shots

0-shot

Evaluation Split

data

Data Statistics#

Metric

Value

Total Samples

30,000

Prompt Length (Mean)

4832.18 chars

Prompt Length (Min/Max)

2463 / 16078 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

dialogue_samples

10,000

3563.69

3169

4200

qa_samples

10,000

2811.83

2463

4004

summarization_samples

10,000

8121.02

4932

16078

Sample Example#

Subset: dialogue_samples

{
  "input": [
    {
      "id": "a99406f3",
      "content": [
        {
          "text": "I want you act as a response judge. Given a dialogue history and a response, your objective is to determine if the provided response contains non-factual or hallucinated information. You SHOULD give your judgement based on the following hallu ... [TRUNCATED] ...  do! Robert Downey Jr. is a favorite. [Human]: Yes i like him too did you know he also was in Zodiac a crime fiction film. \n#Response#: I'm not a fan of crime movies, but I did know that RDJ starred in Zodiac with Tom Hanks.\n#Your Judgement#:"
        }
      ]
    }
  ],
  "target": "YES",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "answer": "yes"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets halueval \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['halueval'],
    dataset_args={
        'halueval': {
            # subset_list: ['dialogue_samples', 'qa_samples', 'summarization_samples']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)