TIR-Bench#
Overview#
TIR-Bench (Thinking-with-Images Reasoning Benchmark) is a comprehensive multimodal benchmark that evaluates agentic visual reasoning capabilities of vision-language models. It covers diverse task categories requiring spatial, compositional, and multi-step visual reasoning.
Task Description#
Task Type: Multi-task Visual Reasoning (MCQ, OCR, word search, spot difference, jigsaw, etc.)
Input: One or two images + question (most tasks use multiple-choice format)
Output: Answer letter (MCQ) or numeric/text response depending on task type
Domains: instrument, color, refcoco, rotation_game, math, word_search, visual_search, ocr, symbolic, spot_difference, contrast, jigsaw, maze
Key Features#
1,215 test samples across 13 diverse visual reasoning task categories
Covers single-image and dual-image reasoning scenarios
Answers span letter choices (A-J), integers, floats, and text
Task-specific scoring with LLM-as-judge fallback for robust evaluation
Evaluation Notes#
Default evaluation uses the test split (1,215 samples)
Primary metric: Accuracy (acc)
Images are downloaded as
data.zipfrom ModelScope and extracted automaticallyRule-based scoring: OCR (substring match), jigsaw (grid IoU), spot_difference (set IoU), word_search (numeric match), all other tasks (MCQ / numeric judge)
Recommended: set
judge_strategy=JudgeStrategy.LLM_RECALLand providejudge_model_argsto activate LLM-as-judge as a recall mechanism — the judge is called only when rule-based scoring gives 0, providing more accurate evaluation without unnecessary API overhead
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
|
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,215 |
Prompt Length (Mean) |
384.97 chars |
Prompt Length (Min/Max) |
19 / 4039 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
80 |
110.96 |
57 |
196 |
|
100 |
130.26 |
98 |
241 |
|
120 |
144.51 |
132 |
182 |
|
75 |
146.44 |
140 |
148 |
|
120 |
126.69 |
50 |
397 |
|
100 |
126.72 |
24 |
307 |
|
120 |
111.64 |
19 |
501 |
|
60 |
35.08 |
29 |
116 |
|
50 |
88.18 |
66 |
243 |
|
100 |
1114.79 |
93 |
1379 |
|
50 |
48.1 |
31 |
123 |
|
120 |
605 |
605 |
605 |
|
120 |
1527.06 |
626 |
4039 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
1,255 |
Images per Sample |
min: 1, max: 2, mean: 1.03 |
Resolution Range |
60x23 - 6944x9280 |
Formats |
jpeg, mpo, png, webp |
Sample Example#
Subset: instrument
{
"input": [
{
"id": "dd25f10d",
"content": [
{
"image": "[BASE64_IMAGE: jpg, ~2.9MB]"
},
{
"text": "According to the image, what is the thermometer reading in Fahrenheit? Answer as an integer like 1,2,3."
}
]
}
],
"target": "72",
"id": 0,
"group_id": 0,
"subset_key": "instrument",
"metadata": {
"task": "instrument",
"meta_data": {},
"id": 6
}
}
Prompt Template#
No prompt template defined.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets tir_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['tir_bench'],
dataset_args={
'tir_bench': {
# subset_list: ['instrument', 'color', 'refcoco'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)