Humanity’s-Last-Exam#

Overview#

Humanity’s Last Exam (HLE) is a comprehensive language model benchmark consisting of 2,500 questions across a broad range of subjects. Created jointly by the Center for AI Safety and Scale AI, it represents one of the most challenging academic benchmarks available.

Task Description#

  • Task Type: Expert-Level Question Answering

  • Input: Question with optional image (14% multimodal)

  • Output: Answer with explanation and confidence score

  • Domains: Mathematics (41%), Physics (9%), Biology/Medicine (11%), Computer Science/AI (10%), Humanities (9%), Engineering (4%), Chemistry (7%), Other (9%)

Key Features#

  • 2,500 expert-level questions across multiple disciplines

  • 14% of questions require multimodal understanding

  • 24% multiple-choice, 76% short-answer exact-match

  • Questions from various academic and professional domains

  • Includes confidence scoring in response format

Evaluation Notes#

  • Default evaluation uses the test split

  • Primary metric: Accuracy with LLM judge

  • Response format includes: Explanation, Answer, and Confidence (0-100%)

  • Note: Set extra_params["include_multi_modal"] to False for text-only models

  • Uses GRADE: C/I format for LLM judge scoring

Properties#

Property

Value

Benchmark Name

hle

Dataset ID

cais/hle

Paper

N/A

Tags

Knowledge, QA

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

2,500

Prompt Length (Mean)

1029.85 chars

Prompt Length (Min/Max)

234 / 21341 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Biology/Medicine

280

1259.39

246

13702

Chemistry

165

812.72

236

6942

Computer Science/AI

241

1581.02

263

11529

Engineering

111

1620.26

250

21341

Humanities/Social Science

219

1069.39

256

7028

Math

1,021

862.46

262

8952

Physics

230

1027.63

257

17139

Other

233

754.94

234

13655

Image Statistics:

Metric

Value

Total Images

342

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

329x12 - 14950x2780

Formats

gif, jpeg, png, webp

Sample Example#

Subset: Biology/Medicine

{
  "input": [
    {
      "id": "906a518f",
      "content": "Your response should be in the following format:\nExplanation: {your explanation for your answer choice}\nAnswer: {your chosen answer}\nConfidence: {your confidence score between 0% and 100% for your answer}"
    },
    {
      "id": "d03d8d4e",
      "content": [
        {
          "text": "In a bioinformatics lab, Watterson's estimator (theta) and pi (nucleotide diversity) will be calculated from variant call files which contain human phased samples with only single nucleotide variants present, and there are no completely missi ... [TRUNCATED] ... y pi (nucleotide diversity) is biased.\nC. Both Watterson's estimator (theta) and pi (nucleotide diversity) are biased.\nD. Neither Watterson's estimator (theta) nor pi (nucleotide diversity) are biased.\nE. None of the other answers are correct"
        }
      ]
    }
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "Biology/Medicine",
  "metadata": {
    "uid": "66e88728ba7d8bc0d5806f3a",
    "author_name": "Scott S",
    "rationale": "First, we recognize that all single nucleotide variants are included somewhere in the sample. It is given that, across “all samples,” there are no “missing single nucleotide variants.” Further, since “[t]he number of samples is arbitrarily la ... [TRUNCATED] ... fferent genotypes that that position, the analysis would consider these two genomes to have the same nucleotide at the position. This reduces the estimated nucleotide diversity, pi. Therefore, pi would be biased in the circumstance described.",
    "raw_subject": "Bioinformatics",
    "category": "Biology/Medicine",
    "has_image": false
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter

Type

Default

Description

include_multi_modal

bool

True

Include multi-modal (image) questions during evaluation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets hle \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['hle'],
    dataset_args={
        'hle': {
            # subset_list: ['Biology/Medicine', 'Chemistry', 'Computer Science/AI']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)