Needle-in-a-Haystack#

Overview#

Needle in a Haystack is a benchmark focused on evaluating information retrieval capabilities in long-context scenarios. It tests a model’s ability to find specific information (needles) within large documents (haystacks).

Task Description#

  • Task Type: Long-Context Information Retrieval

  • Input: Long document with embedded target information + retrieval question

  • Output: Extracted target information (needle)

  • Domains: Long-context understanding, information retrieval

Key Features#

  • Tests retrieval across varying context lengths (1K-32K+ tokens)

  • Tests retrieval at different document depths (0%-100%)

  • Supports both English and Chinese corpora

  • Generates synthetic samples with configurable parameters

  • Produces heatmap visualizations of performance

Evaluation Notes#

  • Default context lengths: 1,000 to 32,000 tokens (configurable)

  • Default depth percentages: 0% to 100% (configurable)

  • Primary metric: Accuracy on retrieval

  • Uses LLM judge for flexible answer matching

  • Configurable via extra_params: needles, context lengths, depth intervals, tokenizer

  • Usage Example

Properties#

Property

Value

Benchmark Name

needle_haystack

Dataset ID

AI-ModelScope/Needle-in-a-Haystack-Corpus

Paper

N/A

Tags

LongContext, Retrieval

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

200

Prompt Length (Mean)

45063.55 chars

Prompt Length (Min/Max)

1361 / 137407 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

english

100

70387.8

3898

137407

chinese

100

19739.3

1361

37893

Sample Example#

Subset: english

{
  "input": [
    {
      "id": "f617334d",
      "content": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
    },
    {
      "id": "a832dc68",
      "content": "Please read the following text and answer the question below.\n\n<text>\n\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n\n\nWant to start a startup?  Get funded by\nY Combinator.\n\n\n\n\nJuly 2004(This  ... [TRUNCATED] ... ow do you\nget them to come and work for you?  And then of course there's the\nquestion, how do\n</text>\n\n<question>\nWhat is the best thing to do in San Francisco?\n</question>\n\nDon't give information outside the document or repeat your findings."
    }
  ],
  "target": "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "context": "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n\n\nWant to start a startup?  Get funded by\nY Combinator.\n\n\n\n\nJuly 2004(This essay is derived from a talk at Oscon 2004.)\nA few months ago I finish ... [TRUNCATED] ... m, we need to understand these\nespecially productive people.  What motivates them?  What do they\nneed to do their jobs?  How do you recognize them? How do you\nget them to come and work for you?  And then of course there's the\nquestion, how do",
    "context_length": 1000,
    "depth_percent": 0
  }
}

Note: Some content was truncated for display.

Prompt Template#

System Prompt:

You are a helpful AI bot that answers questions for a user. Keep your response short and direct

Prompt Template:

Please read the following text and answer the question below.

<text>
{context}
</text>

<question>
{question}
</question>

Don't give information outside the document or repeat your findings.

Extra Parameters#

Parameter

Type

Default

Description

retrieval_question

str

What is the best thing to do in San Francisco?

Question used for retrieval evaluation.

needles

list[str]

['\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n']

List of factual needle strings inserted into the context.

context_lengths_min

int

1000

Minimum context length (tokens) to generate synthetic samples.

context_lengths_max

int

32000

Maximum context length (tokens) to generate synthetic samples.

context_lengths_num_intervals

int

10

Number of intervals between min and max context lengths.

document_depth_percent_min

int

0

Minimum insertion depth percentage for needles.

document_depth_percent_max

int

100

Maximum insertion depth percentage for needles.

document_depth_percent_intervals

int

10

Number of intervals between min and max depth percentages.

tokenizer_path

str

Qwen/Qwen3-0.6B

Tokenizer checkpoint path used for tokenization.

show_score

bool

False

Render numerical scores on heatmap output images.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets needle_haystack \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['needle_haystack'],
    dataset_args={
        'needle_haystack': {
            # subset_list: ['english', 'chinese']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)