LongMemEval#

Overview#

LongMemEval evaluates long-term interactive memory in chat assistants. Each question is answered from a timestamped multi-session user-assistant history.

Task Description#

  • Task Type: Long-context / retrieval-log question answering

  • Input: Multiple dated chat sessions + current question date + question

  • Output: Free-form answer grounded in the history

  • Subsets: s (~115K-token histories), m (~500 sessions, large), and oracle (evidence sessions only)

Key Features#

  • Covers single-session, multi-session, temporal reasoning, knowledge update, preference, and abstention questions

  • Supports full-history long-context prompts and official retrieval-log prompts

  • Uses LongMemEval’s LLM judge prompts for semantic answer correctness

  • Downloads only the selected JSON file from ModelScope

Evaluation Notes#

  • Default subset is s with eval_mode=long_context for standard long-context evaluation

  • Use subset_list=['oracle'] and extra_params.eval_mode='oracle_context' for evidence-only upper-bound evaluation

  • The m subset is large and must be requested explicitly

  • retrieval_log mode consumes official LongMemEval retrieval logs; it does not run embedding retrieval itself

Properties#

Property

Value

Benchmark Name

longmemeval

Dataset ID

evalscope/longmemeval-cleaned

Paper

N/A

Tags

LongContext, MultiTurn, QA, Retrieval

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

500

Prompt Length (Mean)

515877.18 chars

Prompt Length (Min/Max)

480743 / 540467 chars

Sample Example#

Subset: s

{
  "input": [
    {
      "id": "99d5aff3",
      "content": "I will give you several history chats between you and a user. Please answer the question based on the relevant chat history. Answer the question step by step: first extract all the relevant information, and then reason over the information to ... [TRUNCATED 513848 chars] ... you'll be able to optimize your pantry storage space, reduce clutter, and make meal prep and cooking more efficient. Happy organizing!\"}]\n\n\nCurrent Date: 2023/05/30 (Tue) 23:40\nQuestion: What degree did I graduate with?\nAnswer (step by step):"
    }
  ],
  "target": "Business Administration",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "question_id": "e47becba",
    "question_type": "single-session-user",
    "question": "What degree did I graduate with?",
    "answer": "Business Administration",
    "question_date": "2023/05/30 (Tue) 23:40",
    "answer_session_ids": [
      "answer_280352e9"
    ],
    "is_abstention": false,
    "eval_mode": "long_context",
    "retrieved_ids": [
      "sharegpt_yywfIrx_0",
      "85a1be56_1",
      "sharegpt_Jcy1CVN_0",
      "sharegpt_Cr2tc1f_0",
      "sharegpt_DGTCD7D_0",
      "f6859b48_2",
      "52c34859_1",
      "ultrachat_231069",
      "sharegpt_qRdLQvN_7",
      "ultrachat_359984",
      "... [TRUNCATED 43 more items] ..."
    ]
  }
}

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter

Type

Default

Description

eval_mode

str

long_context

Evaluation mode: oracle_context, long_context, or retrieval_log. Choices: [‘oracle_context’, ‘long_context’, ‘retrieval_log’]

retrieval_log_path

`str

null`

None

retriever_type

str

flat-session

Retrieval prompt shape for retrieval_log mode. Choices: [‘flat-session’, ‘flat-turn’]

history_format

str

json

History rendering format. Choices: [‘json’, ‘nl’]

user_only

bool

False

Whether to keep only user turns in history.

reading_method

str

con

Prompt reading method. con asks the model to extract and reason before answering. Choices: [‘direct’, ‘con’]

topk_context

int

1000

Maximum number of history sessions or retrieved chunks included in the prompt.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets longmemeval \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['longmemeval'],
    dataset_args={
        'longmemeval': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)