LoCoMo#

Overview#

LoCoMo evaluates very long-term conversational memory in two-person multi-session dialogues. This adapter supports the official question-answering task from locomo10.json.

Task Description#

Task Type: Long-context question answering
Input: Multi-session conversation history with session dates and a question
Output: Short free-form answer
Subsets: qa

Key Features#

Uses the official LoCoMo QA data file hosted on ModelScope
Supports full-history long-context prompts and evidence-only oracle prompts
Includes image captions from the released data when present, but does not download image files
Uses LoCoMo’s rule-based F1 / adversarial refusal scoring instead of an LLM judge

Evaluation Notes#

Default subset is qa with eval_mode=long_context
Use extra_params.eval_mode='oracle_context' for evidence-only upper-bound evaluation
Event summarization, multimodal dialog generation, and RAG retrieval are not included in this QA adapter

Properties#

Property	Value
Benchmark Name	`locomo`
Dataset ID	evalscope/locomo
Paper	N/A
Tags	`LongContext`, `MultiTurn`, `QA`
Metrics	`f1`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	1,986
Prompt Length (Mean)	94097.72 chars
Prompt Length (Min/Max)	55178 / 111026 chars

Sample Example#

Subset: qa

{
  "input": [
    {
      "id": "b585f7b0",
      "content": "Below is a conversation between two people: Caroline and Melanie. The conversation takes place over multiple days and the date of each conversation is wriiten at the beginning of the conversation.\n\nDATE: 1:56 pm on 8 May, 2023\nCONVERSATION:\nC ... [TRUNCATED 74397 chars] ... m of a short phrase for the following question. Answer with exact words from the context whenever possible.\n\nQuestion: When did Caroline go to the LGBTQ support group? Use DATE of CONVERSATION to answer with an approximate date. Short answer:"
    }
  ],
  "target": "7 May 2023",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "sample_id": "conv-26",
    "qa_index": 0,
    "category": 2,
    "category_name": "temporal",
    "raw_question": "When did Caroline go to the LGBTQ support group?",
    "answer": "7 May 2023",
    "adversarial_answer": null,
    "eval_mode": "long_context",
    "question": "When did Caroline go to the LGBTQ support group? Use DATE of CONVERSATION to answer with an approximate date.",
    "evidence": [
      "D1:3"
    ]
  }
}

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter	Type	Default	Description
`eval_mode`	`str`	`long_context`	Evaluation mode: long_context or oracle_context. Choices: [‘long_context’, ‘oracle_context’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets locomo \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['locomo'],
    dataset_args={
        'locomo': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)