ERQA#

Overview#

ERQA (Embodied Reasoning QA) is a benchmark for evaluating spatial reasoning and embodied understanding capabilities of multimodal large language models. It tests models’ ability to reason about trajectories, actions, spatial relationships, and task planning in egocentric robotic scenarios.

Task Description#

Task Type: Embodied Spatial Reasoning (Multiple Choice)
Input: Egocentric image(s) + multiple-choice question (A/B/C/D)
Output: Single answer letter (A/B/C/D)
Domain: Robotics, spatial reasoning, embodied AI

Key Features#

400 questions across 8 reasoning categories
Multi-image support (some questions require reasoning across multiple views)
Categories: Trajectory Reasoning, Action Reasoning, Pointing, State Estimation, Spatial Reasoning, Multi-view Reasoning, Task Reasoning, Other
Egocentric perspective from robotic manipulation scenarios

Evaluation Notes#

Default configuration uses 0-shot evaluation
Evaluates on test split
Primary metric: Accuracy on multiple-choice questions
Answers are single letters (A/B/C/D)

Properties#

Property	Value
Benchmark Name	`erqa`
Dataset ID	evalscope/ERQA
Paper	N/A
Tags	`MCQ`, `MultiModal`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	400
Prompt Length (Mean)	289.05 chars
Prompt Length (Min/Max)	147 / 834 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Trajectory Reasoning`	66	307.3	161	834
`Action Reasoning`	72	271.97	175	503
`Pointing`	34	300.82	178	676
`State Estimation`	55	314.84	147	735
`Spatial Reasoning`	84	286.21	152	819
`Multi-view Reasoning`	37	277.78	196	475
`Task Reasoning`	38	280.18	152	737
`Other`	14	231.79	188	409

Image Statistics:

Metric	Value
Total Images	630
Images per Sample	min: 1, max: 16, mean: 1.57
Resolution Range	200x150 - 3072x4080
Formats	jpeg

Sample Example#

Subset: Trajectory Reasoning

{
  "input": [
    {
      "id": "eec4c5cf",
      "content": [
        {
          "text": "If the yellow robot gripper follows the yellow trajectory, what will happen? Choices: A. Robot puts the soda on the wooden steps. B. Robot moves the soda in front of the wooden steps. C. Robot moves the soda to the very top of the wooden steps. D. Robot picks up the soda can and moves it up. Please answer directly with only the letter of the correct option and nothing else."
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~16.0KB]"
        }
      ]
    }
  ],
  "choices": [
    "A",
    "B",
    "C",
    "D"
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "Trajectory Reasoning",
  "metadata": {
    "question_id": "ERQA_1",
    "question_type": "Trajectory Reasoning"
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets erqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['erqa'],
    dataset_args={
        'erqa': {
            # subset_list: ['Trajectory Reasoning', 'Action Reasoning', 'Pointing']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)