OCRBench#

Overview#

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR (Optical Character Recognition) capabilities of Large Multimodal Models. It covers five key OCR-related tasks with 1,000 manually verified question-answer pairs.

Task Description#

Task Type: OCR and Document Understanding
Input: Image with OCR-related question
Output: Text recognition or extraction result
Components: Text Recognition, VQA, Document VQA, Key Info Extraction, Math Expression Recognition

Key Features#

1,000 question-answer pairs across 10 categories
Manually verified and corrected answers
Categories include: Regular/Irregular/Artistic/Handwriting Text Recognition
Scene Text-centric VQA and Document-oriented VQA
Key Information Extraction
Handwritten Mathematical Expression Recognition (HME100k)

Evaluation Notes#

Default configuration uses 0-shot evaluation
Simple accuracy metric (inclusion-based matching)
Results broken down by question type/category
Different matching rules for HME100k (space-insensitive)
Comprehensive test of OCR capabilities in multimodal models

Properties#

Property	Value
Benchmark Name	`ocr_bench`
Dataset ID	evalscope/OCRBench
Paper	N/A
Tags	`Knowledge`, `MultiModal`, `QA`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	1,000
Prompt Length (Mean)	55.78 chars
Prompt Length (Min/Max)	14 / 149 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`Regular Text Recognition`	50	29	29	29
`Irregular Text Recognition`	50	29	29	29
`Artistic Text Recognition`	50	29	29	29
`Handwriting Recognition`	50	29	29	29
`Digit String Recognition`	50	32	32	32
`Non-Semantic Text Recognition`	50	29	29	29
`Scene Text-centric VQA`	200	34.6	14	101
`Doc-oriented VQA`	200	59	21	136
`Key Information Extraction`	200	101.58	86	149
`Handwritten Mathematical Expression Recognition`	100	79	79	79

Image Statistics:

Metric	Value
Total Images	1,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	25x16 - 4961x7016
Formats	jpeg

Sample Example#

Subset: Regular Text Recognition

{
  "input": [
    {
      "id": "ff23a835",
      "content": [
        {
          "text": "what is written in the image?"
        },
        {
          "image": "[BASE64_IMAGE: jpeg, ~1.2KB]"
        }
      ]
    }
  ],
  "target": "[\"CENTRE\"]",
  "id": 0,
  "group_id": 0,
  "subset_key": "Regular Text Recognition",
  "metadata": {
    "dataset": "IIIT5K",
    "question_type": "Regular Text Recognition"
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ocr_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ocr_bench'],
    dataset_args={
        'ocr_bench': {
            # subset_list: ['Regular Text Recognition', 'Irregular Text Recognition', 'Artistic Text Recognition']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)