OCRBench#
Overview#
OCRBench is a comprehensive evaluation benchmark designed to assess the OCR (Optical Character Recognition) capabilities of Large Multimodal Models. It covers five key OCR-related tasks with 1,000 manually verified question-answer pairs.
Task Description#
Task Type: OCR and Document Understanding
Input: Image with OCR-related question
Output: Text recognition or extraction result
Components: Text Recognition, VQA, Document VQA, Key Info Extraction, Math Expression Recognition
Key Features#
1,000 question-answer pairs across 10 categories
Manually verified and corrected answers
Categories include: Regular/Irregular/Artistic/Handwriting Text Recognition
Scene Text-centric VQA and Document-oriented VQA
Key Information Extraction
Handwritten Mathematical Expression Recognition (HME100k)
Evaluation Notes#
Default configuration uses 0-shot evaluation
Simple accuracy metric (inclusion-based matching)
Results broken down by question type/category
Different matching rules for HME100k (space-insensitive)
Comprehensive test of OCR capabilities in multimodal models
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,000 |
Prompt Length (Mean) |
55.78 chars |
Prompt Length (Min/Max) |
14 / 149 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
50 |
29 |
29 |
29 |
|
50 |
29 |
29 |
29 |
|
50 |
29 |
29 |
29 |
|
50 |
29 |
29 |
29 |
|
50 |
32 |
32 |
32 |
|
50 |
29 |
29 |
29 |
|
200 |
34.6 |
14 |
101 |
|
200 |
59 |
21 |
136 |
|
200 |
101.58 |
86 |
149 |
|
100 |
79 |
79 |
79 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
1,000 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
25x16 - 4961x7016 |
Formats |
jpeg |
Sample Example#
Subset: Regular Text Recognition
{
"input": [
{
"id": "ff23a835",
"content": [
{
"text": "what is written in the image?"
},
{
"image": "[BASE64_IMAGE: jpeg, ~1.2KB]"
}
]
}
],
"target": "[\"CENTRE\"]",
"id": 0,
"group_id": 0,
"subset_key": "Regular Text Recognition",
"metadata": {
"dataset": "IIIT5K",
"question_type": "Regular Text Recognition"
}
}
Prompt Template#
Prompt Template:
{question}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets ocr_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['ocr_bench'],
dataset_args={
'ocr_bench': {
# subset_list: ['Regular Text Recognition', 'Irregular Text Recognition', 'Artistic Text Recognition'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)