Humanity’s-Last-Exam#

Overview#

Humanity’s Last Exam (HLE) is a comprehensive language model benchmark consisting of 2,500 questions across a broad range of subjects. Created jointly by the Center for AI Safety and Scale AI, it represents one of the most challenging academic benchmarks available.

Task Description#

Task Type: Expert-Level Question Answering
Input: Question with optional image (14% multimodal)
Output: Answer with explanation and confidence score
Domains: Mathematics (41%), Physics (9%), Biology/Medicine (11%), Computer Science/AI (10%), Humanities (9%), Engineering (4%), Chemistry (7%), Other (9%)

Key Features#

2,500 expert-level questions across multiple disciplines
14% of questions require multimodal understanding
24% multiple-choice, 76% short-answer exact-match
Questions from various academic and professional domains
Includes confidence scoring in response format

Evaluation Notes#

Default evaluation uses the test split
Primary metric: Accuracy with LLM judge
Response format includes: Explanation, Answer, and Confidence (0-100%)
Note: Set extra_params["include_multi_modal"] to False for text-only models
Uses GRADE: C/I format for LLM judge scoring

Properties#

Property	Value
Benchmark Name	`hle`
Dataset ID	cais/hle
Paper	N/A
Tags	`Knowledge`, `QA`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	2,500
Prompt Length (Mean)	1029.85 chars
Prompt Length (Min/Max)	234 / 21341 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Biology/Medicine`	280	1259.39	246	13702
`Chemistry`	165	812.72	236	6942
`Computer Science/AI`	241	1581.02	263	11529
`Engineering`	111	1620.26	250	21341
`Humanities/Social Science`	219	1069.39	256	7028
`Math`	1,021	862.46	262	8952
`Physics`	230	1027.63	257	17139
`Other`	233	754.94	234	13655

Image Statistics:

Metric	Value
Total Images	342
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	329x12 - 14950x2780
Formats	gif, jpeg, png, webp

Sample Example#

Subset: Biology/Medicine

{
  "input": [
    {
      "id": "906a518f",
      "content": "Your response should be in the following format:\nExplanation: {your explanation for your answer choice}\nAnswer: {your chosen answer}\nConfidence: {your confidence score between 0% and 100% for your answer}"
    },
    {
      "id": "d03d8d4e",
      "content": [
        {
          "text": "In a bioinformatics lab, Watterson's estimator (theta) and pi (nucleotide diversity) will be calculated from variant call files which contain human phased samples with only single nucleotide variants present, and there are no completely missi ... [TRUNCATED] ... y pi (nucleotide diversity) is biased.\nC. Both Watterson's estimator (theta) and pi (nucleotide diversity) are biased.\nD. Neither Watterson's estimator (theta) nor pi (nucleotide diversity) are biased.\nE. None of the other answers are correct"
        }
      ]
    }
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "Biology/Medicine",
  "metadata": {
    "uid": "66e88728ba7d8bc0d5806f3a",
    "author_name": "Scott S",
    "rationale": "First, we recognize that all single nucleotide variants are included somewhere in the sample. It is given that, across “all samples,” there are no “missing single nucleotide variants.” Further, since “[t]he number of samples is arbitrarily la ... [TRUNCATED] ... fferent genotypes that that position, the analysis would consider these two genomes to have the same nucleotide at the position. This reduces the estimated nucleotide diversity, pi. Therefore, pi would be biased in the circumstance described.",
    "raw_subject": "Bioinformatics",
    "category": "Biology/Medicine",
    "has_image": false
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter	Type	Default	Description
`include_multi_modal`	`bool`	`True`	Include multi-modal (image) questions during evaluation.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets hle \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['hle'],
    dataset_args={
        'hle': {
            # subset_list: ['Biology/Medicine', 'Chemistry', 'Computer Science/AI']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)