TIR-Bench#

Overview#

TIR-Bench (Thinking-with-Images Reasoning Benchmark) is a comprehensive multimodal benchmark that evaluates agentic visual reasoning capabilities of vision-language models. It covers diverse task categories requiring spatial, compositional, and multi-step visual reasoning.

Task Description#

  • Task Type: Multi-task Visual Reasoning (MCQ, OCR, word search, spot difference, jigsaw, etc.)

  • Input: One or two images + question (most tasks use multiple-choice format)

  • Output: Answer letter (MCQ) or numeric/text response depending on task type

  • Domains: instrument, color, refcoco, rotation_game, math, word_search, visual_search, ocr, symbolic, spot_difference, contrast, jigsaw, maze

Key Features#

  • 1,215 test samples across 13 diverse visual reasoning task categories

  • Covers single-image and dual-image reasoning scenarios

  • Answers span letter choices (A-J), integers, floats, and text

  • Task-specific scoring with LLM-as-judge fallback for robust evaluation

Evaluation Notes#

  • Default evaluation uses the test split (1,215 samples)

  • Primary metric: Accuracy (acc)

  • Images are downloaded as data.zip from ModelScope and extracted automatically

  • Rule-based scoring: OCR (substring match), jigsaw (grid IoU), spot_difference (set IoU), word_search (numeric match), all other tasks (MCQ / numeric judge)

  • Recommended: set judge_strategy=JudgeStrategy.LLM_RECALL and provide judge_model_args to activate LLM-as-judge as a recall mechanism — the judge is called only when rule-based scoring gives 0, providing more accurate evaluation without unnecessary API overhead

  • Paper | GitHub

Properties#

Property

Value

Benchmark Name

tir_bench

Dataset ID

evalscope/TIR-Bench

Paper

Paper

Tags

MultiModal, QA, Reasoning

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

1,215

Prompt Length (Mean)

384.97 chars

Prompt Length (Min/Max)

19 / 4039 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

instrument

80

110.96

57

196

color

100

130.26

98

241

refcoco

120

144.51

132

182

rotation_game

75

146.44

140

148

math

120

126.69

50

397

word_search

100

126.72

24

307

visual_search

120

111.64

19

501

ocr

60

35.08

29

116

symbolic

50

88.18

66

243

spot_difference

100

1114.79

93

1379

contrast

50

48.1

31

123

jigsaw

120

605

605

605

maze

120

1527.06

626

4039

Image Statistics:

Metric

Value

Total Images

1,255

Images per Sample

min: 1, max: 2, mean: 1.03

Resolution Range

60x23 - 6944x9280

Formats

jpeg, mpo, png, webp

Sample Example#

Subset: instrument

{
  "input": [
    {
      "id": "dd25f10d",
      "content": [
        {
          "image": "[BASE64_IMAGE: jpg, ~2.9MB]"
        },
        {
          "text": "According to the image, what is the thermometer reading in Fahrenheit? Answer as an integer like 1,2,3."
        }
      ]
    }
  ],
  "target": "72",
  "id": 0,
  "group_id": 0,
  "subset_key": "instrument",
  "metadata": {
    "task": "instrument",
    "meta_data": {},
    "id": 6
  }
}

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets tir_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['tir_bench'],
    dataset_args={
        'tir_bench': {
            # subset_list: ['instrument', 'color', 'refcoco']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)