VisuLogic#

Overview#

VisuLogic is a benchmark for evaluating visual reasoning capabilities of Multimodal Large Language Models (MLLMs), independent of textual reasoning. It features carefully constructed visual reasoning tasks that are inherently difficult to articulate using language alone.

Task Description#

Task Type: Visual Reasoning (Multiple-Choice)
Input: Image + visual reasoning question with 4 choices
Output: Answer letter (A/B/C/D)
Domains: Pure visual reasoning without text-based shortcuts

Key Features#

Six reasoning skill categories:
- Quantitative Reasoning: Understanding quantity changes in images
- Positional Reasoning: Understanding spatial positions
- Spatial Reasoning: Understanding 3D spatial relationships
- Attribute Reasoning: Understanding visual attributes
- Stylistic Reasoning: Understanding visual styles
- Other: Miscellaneous visual reasoning tasks
Tests genuine visual understanding beyond language shortcuts

Evaluation Notes#

Default evaluation uses the test split
Primary metric: Accuracy on multiple-choice questions
Uses Chain-of-Thought (CoT) prompting with “ANSWER: [LETTER]” format
Results grouped by reasoning skill category

Properties#

Property	Value
Benchmark Name	`visulogic`
Dataset ID	evalscope/VisuLogic
Paper	N/A
Tags	`MCQ`, `Math`, `MultiModal`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	1,000
Prompt Length (Mean)	394.16 chars
Prompt Length (Min/Max)	285 / 697 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Quantitative Reasoning`	353	399.15	308	697
`Other`	108	399.08	285	560
`Positional Reasoning`	136	372.37	295	448
`Stylistic Reasoning`	90	375.32	303	483
`Spatial Reasoning`	231	401.91	314	537
`Attribute Reasoning`	82	401.15	295	458

Image Statistics:

Metric	Value
Total Images	1,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	288x125 - 700x825
Formats	jpeg, png

Sample Example#

Subset: Quantitative Reasoning

{
  "input": [
    {
      "id": "1a6407d8",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.\n\nFrom the four given options, select the most suitable one to fill in the question mark, so that a certain regularity is presented:\n\n\n\nA: A  \nB: B  \nC: C  \nD: D"
        },
        {
          "image": "[BASE64_IMAGE: png, ~44.9KB]"
        }
      ]
    }
  ],
  "choices": [
    "A",
    "B",
    "C",
    "D"
  ],
  "target": "A",
  "id": 0,
  "group_id": 0,
  "subset_key": "Quantitative Reasoning",
  "metadata": {
    "id": "00000"
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets visulogic \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['visulogic'],
    dataset_args={
        'visulogic': {
            # subset_list: ['Quantitative Reasoning', 'Other', 'Positional Reasoning']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)