OCRBench-v2#

Overview#

OCRBench v2 is a large-scale bilingual text-centric benchmark with the most comprehensive set of OCR tasks (4x more than OCRBench v1), covering 31 diverse scenarios including street scenes, receipts, formulas, diagrams, and more.

Task Description#

Task Type: Optical Character Recognition and Document Understanding
Input: Image + OCR/document question
Output: Text recognition, extraction, or analysis result
Languages: English and Chinese (bilingual)

Key Features#

10,000 human-verified question-answering pairs
31 diverse scenarios (street scene, receipt, formula, diagram, etc.)
High proportion of difficult samples
Comprehensive OCR task coverage
Bilingual (English and Chinese) evaluation

Evaluation Notes#

Default configuration uses 0-shot evaluation
Evaluates on test split
Requires: apted, distance, Levenshtein, lxml, Polygon3, zss packages
Simple accuracy metric

Properties#

Property	Value
Benchmark Name	`ocr_bench_v2`
Dataset ID	evalscope/OCRBench_v2
Paper	N/A
Tags	`Knowledge`, `MultiModal`, `QA`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	10,000
Prompt Length (Mean)	155.62 chars
Prompt Length (Min/Max)	6 / 1863 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`APP agent en`	300	36.9	16	84
`ASCII art classification en`	200	174.62	162	193
`key information extraction cn`	400	32.45	9	162
`key information extraction en`	400	506.23	327	1261
`key information mapping en`	300	664.02	429	1647
`VQA with position en`	300	557.69	539	612
`chart parsing en`	400	67	67	67
`cognition VQA cn`	200	15.91	6	48
`cognition VQA en`	800	44.73	14	179
`diagram QA en`	300	118.23	54	356
`document classification en`	200	314	314	314
`document parsing cn`	300	43	43	43
`document parsing en`	400	51	51	51
`formula recognition cn`	200	19	19	19
`formula recognition en`	400	58.39	53	60
`handwritten answer extraction cn`	200	39.75	22	60
`math QA en`	300	119	119	119
`full-page OCR cn`	200	31	31	31
`full-page OCR en`	200	91	91	91
`reasoning VQA en`	600	80.45	26	256
`reasoning VQA cn`	400	163.96	62	633
`fine-grained text recognition en`	200	155.63	152	156
`science QA en`	300	387.54	159	1863
`table parsing cn`	300	139.4	79	211
`table parsing en`	400	65.39	57	134
`text counting en`	200	113.87	101	127
`text grounding en`	200	361.5	357	379
`text recognition en`	800	37.26	29	104
`text spotting en`	200	446	446	446
`text translation cn`	400	230.88	96	291

Image Statistics:

Metric	Value
Total Images	10,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	19x10 - 3912x21253
Formats	jpeg

Sample Example#

Subset: APP agent en

{
  "input": [
    {
      "id": "51292e32",
      "content": [
        {
          "text": "What is the wrong answer 2?"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~121.7KB]"
        }
      ]
    }
  ],
  "target": "[\"enabled\", \"on\"]",
  "id": 0,
  "group_id": 0,
  "subset_key": "APP agent en",
  "metadata": {
    "question": "What is the wrong answer 2?",
    "answers": [
      "enabled",
      "on"
    ],
    "eval": "None",
    "dataset_name": "rico",
    "type": "APP agent en",
    "bbox": null,
    "bbox_list": null,
    "content": null
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ocr_bench_v2 \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ocr_bench_v2'],
    dataset_args={
        'ocr_bench_v2': {
            # subset_list: ['APP agent en', 'ASCII art classification en', 'key information extraction cn']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)