Humanity’s-Last-Exam#
Overview#
Humanity’s Last Exam (HLE) is a comprehensive language model benchmark consisting of 2,500 questions across a broad range of subjects. Created jointly by the Center for AI Safety and Scale AI, it represents one of the most challenging academic benchmarks available.
Task Description#
Task Type: Expert-Level Question Answering
Input: Question with optional image (14% multimodal)
Output: Answer with explanation and confidence score
Domains: Mathematics (41%), Physics (9%), Biology/Medicine (11%), Computer Science/AI (10%), Humanities (9%), Engineering (4%), Chemistry (7%), Other (9%)
Key Features#
2,500 expert-level questions across multiple disciplines
14% of questions require multimodal understanding
24% multiple-choice, 76% short-answer exact-match
Questions from various academic and professional domains
Includes confidence scoring in response format
Evaluation Notes#
Default evaluation uses the test split
Primary metric: Accuracy with LLM judge
Response format includes: Explanation, Answer, and Confidence (0-100%)
Note: Set
extra_params["include_multi_modal"]toFalsefor text-only modelsUses GRADE: C/I format for LLM judge scoring
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
2,500 |
Prompt Length (Mean) |
1029.85 chars |
Prompt Length (Min/Max) |
234 / 21341 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
280 |
1259.39 |
246 |
13702 |
|
165 |
812.72 |
236 |
6942 |
|
241 |
1581.02 |
263 |
11529 |
|
111 |
1620.26 |
250 |
21341 |
|
219 |
1069.39 |
256 |
7028 |
|
1,021 |
862.46 |
262 |
8952 |
|
230 |
1027.63 |
257 |
17139 |
|
233 |
754.94 |
234 |
13655 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
342 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
329x12 - 14950x2780 |
Formats |
gif, jpeg, png, webp |
Sample Example#
Subset: Biology/Medicine
{
"input": [
{
"id": "906a518f",
"content": "Your response should be in the following format:\nExplanation: {your explanation for your answer choice}\nAnswer: {your chosen answer}\nConfidence: {your confidence score between 0% and 100% for your answer}"
},
{
"id": "d03d8d4e",
"content": [
{
"text": "In a bioinformatics lab, Watterson's estimator (theta) and pi (nucleotide diversity) will be calculated from variant call files which contain human phased samples with only single nucleotide variants present, and there are no completely missi ... [TRUNCATED] ... y pi (nucleotide diversity) is biased.\nC. Both Watterson's estimator (theta) and pi (nucleotide diversity) are biased.\nD. Neither Watterson's estimator (theta) nor pi (nucleotide diversity) are biased.\nE. None of the other answers are correct"
}
]
}
],
"target": "B",
"id": 0,
"group_id": 0,
"subset_key": "Biology/Medicine",
"metadata": {
"uid": "66e88728ba7d8bc0d5806f3a",
"author_name": "Scott S",
"rationale": "First, we recognize that all single nucleotide variants are included somewhere in the sample. It is given that, across “all samples,” there are no “missing single nucleotide variants.” Further, since “[t]he number of samples is arbitrarily la ... [TRUNCATED] ... fferent genotypes that that position, the analysis would consider these two genomes to have the same nucleotide at the position. This reduces the estimated nucleotide diversity, pi. Therefore, pi would be biased in the circumstance described.",
"raw_subject": "Bioinformatics",
"category": "Biology/Medicine",
"has_image": false
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Include multi-modal (image) questions during evaluation. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets hle \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['hle'],
dataset_args={
'hle': {
# subset_list: ['Biology/Medicine', 'Chemistry', 'Computer Science/AI'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)