AA-LCR#

Overview#

AA-LCR (Artificial Analysis Long Context Retrieval) is a benchmark for evaluating long-context retrieval and reasoning capabilities of language models. It requires models to find and synthesize information across multiple documents.

Task Description#

Task Type: Long-Context Question Answering
Input: Multiple documents + question requiring cross-document reasoning
Output: Answer synthesized from document information
Context: Very long context (multiple documents concatenated)

Key Features#

Tests long-context retrieval abilities
Multiple document understanding
Cross-document reasoning required
LLM-based judging for answer correctness
Auto-download of document corpus

Evaluation Notes#

Default configuration uses 0-shot evaluation
Primary metric: Accuracy (via LLM judge)
Evaluates on test split
Documents auto-downloaded if text_dir not specified
Judge prompt compares candidate answer against reference

Properties#

Property	Value
Benchmark Name	`aa_lcr`
Dataset ID	evalscope/AA-LCR
Paper	N/A
Tags	`Knowledge`, `LongContext`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	100
Prompt Length (Mean)	414674.06 chars
Prompt Length (Min/Max)	240709 / 548771 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "0b26b81d",
      "content": "\nBEGIN INPUT DOCUMENTS\n\nBEGIN DOCUMENT 1:\n[Contents lists available at ScienceDirect](https://www.elsevier.com/locate/techfore)\n\n# Technological Forecasting & Social Change\n\n[journal homepage: www.elsevier.com/locate/techfore](http://www.else ... [TRUNCATED] ...  and undertakings issued by the ACCC. Identify and rank the industries explicitly mentioned in the paragraphs, according to the number of infringements over the past three decades. Exclude Broadcasting Industry from the answer.\n\nEND QUESTION\n"
    }
  ],
  "target": "1. Airline Industry (12)\\n2. Accommodation Industry (4)",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "question": "Based on the provided documents, there appears to be a correlation between industry concentration and the frequency of consumer-related infringements and undertakings issued by the ACCC. Identify and rank the industries explicitly mentioned in the paragraphs, according to the number of infringements over the past three decades. Exclude Broadcasting Industry from the answer.",
    "data_source_urls": "https://competition-policy.ec.europa.eu/system/files/2024-06/A_taxonomy_of_industry_competition_launch.pdf;https://www.industry.gov.au/sites/default/files/2023-11/barriers-to-collaboration-and-commercialisation-iisa.pdf;https://e61.in/wp-content/uploads/2023/08/The-State-of-Competition.pdf;https://uu.diva-portal.org/smash/get/diva2:1798138/FULLTEXT01.pdf;https://one.oecd.org/document/DAF/COMP(2023)14/en/pdf",
    "input_tokens": 94494
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

BEGIN INPUT DOCUMENTS

{documents_text}

END INPUT DOCUMENTS

Answer the following question using the input documents provided above.

START QUESTION

{question}

END QUESTION

Extra Parameters#

Parameter	Type	Default	Description
`text_dir`	`str	null`	`None`

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets aa_lcr \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['aa_lcr'],
    dataset_args={
        'aa_lcr': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)