Needle-in-a-Haystack#
Overview#
Needle in a Haystack is a benchmark focused on evaluating information retrieval capabilities in long-context scenarios. It tests a model’s ability to find specific information (needles) within large documents (haystacks).
Task Description#
Task Type: Long-Context Information Retrieval
Input: Long document with embedded target information + retrieval question
Output: Extracted target information (needle)
Domains: Long-context understanding, information retrieval
Key Features#
Tests retrieval across varying context lengths (1K-32K+ tokens)
Tests retrieval at different document depths (0%-100%)
Supports both English and Chinese corpora
Generates synthetic samples with configurable parameters
Produces heatmap visualizations of performance
Evaluation Notes#
Default context lengths: 1,000 to 32,000 tokens (configurable)
Default depth percentages: 0% to 100% (configurable)
Primary metric: Accuracy on retrieval
Uses LLM judge for flexible answer matching
Configurable via extra_params: needles, context lengths, depth intervals, tokenizer
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
200 |
Prompt Length (Mean) |
45063.55 chars |
Prompt Length (Min/Max) |
1361 / 137407 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
100 |
70387.8 |
3898 |
137407 |
|
100 |
19739.3 |
1361 |
37893 |
Sample Example#
Subset: english
{
"input": [
{
"id": "f617334d",
"content": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
},
{
"id": "a832dc68",
"content": "Please read the following text and answer the question below.\n\n<text>\n\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n\n\nWant to start a startup? Get funded by\nY Combinator.\n\n\n\n\nJuly 2004(This ... [TRUNCATED] ... ow do you\nget them to come and work for you? And then of course there's the\nquestion, how do\n</text>\n\n<question>\nWhat is the best thing to do in San Francisco?\n</question>\n\nDon't give information outside the document or repeat your findings."
}
],
"target": "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n",
"id": 0,
"group_id": 0,
"metadata": {
"context": "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n\n\nWant to start a startup? Get funded by\nY Combinator.\n\n\n\n\nJuly 2004(This essay is derived from a talk at Oscon 2004.)\nA few months ago I finish ... [TRUNCATED] ... m, we need to understand these\nespecially productive people. What motivates them? What do they\nneed to do their jobs? How do you recognize them? How do you\nget them to come and work for you? And then of course there's the\nquestion, how do",
"context_length": 1000,
"depth_percent": 0
}
}
Note: Some content was truncated for display.
Prompt Template#
System Prompt:
You are a helpful AI bot that answers questions for a user. Keep your response short and direct
Prompt Template:
Please read the following text and answer the question below.
<text>
{context}
</text>
<question>
{question}
</question>
Don't give information outside the document or repeat your findings.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Question used for retrieval evaluation. |
|
|
|
List of factual needle strings inserted into the context. |
|
|
|
Minimum context length (tokens) to generate synthetic samples. |
|
|
|
Maximum context length (tokens) to generate synthetic samples. |
|
|
|
Number of intervals between min and max context lengths. |
|
|
|
Minimum insertion depth percentage for needles. |
|
|
|
Maximum insertion depth percentage for needles. |
|
|
|
Number of intervals between min and max depth percentages. |
|
|
|
Tokenizer checkpoint path used for tokenization. |
|
|
|
Render numerical scores on heatmap output images. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets needle_haystack \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['needle_haystack'],
dataset_args={
'needle_haystack': {
# subset_list: ['english', 'chinese'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)