Maritime-OCR-Bench#
Overview#
Maritime-OCR-Bench is a comprehensive evaluation benchmark for assessing multimodal large model capabilities on OCR-related tasks. The current released set contains 1,888 manually curated samples across five task types.
Task Types#
VQA: Visual question answering on document/scene images
IE: Information extraction requiring strict JSON output
parsing: Text recognition and parsing from images
json1: Text spotting with JSON v1 structured output
json2: Text spotting with JSON v2 structured output
Evaluation Metrics#
Each task type uses a specialized scoring method:
VQA/parsing: Multi-dimensional text similarity (edit distance, char F1, LCS F1, table-aware similarity)
IE: Text coverage + JSON strictness (0.5 * coverage + 0.5 * json_strict)
json1/json2: DIoU layout score + text score (0.7 * diou + 0.3 * text)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,888 |
Prompt Length (Mean) |
102.91 chars |
Prompt Length (Min/Max) |
23 / 288 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
471 |
23 |
23 |
23 |
|
471 |
39.65 |
28 |
141 |
|
472 |
80 |
80 |
80 |
|
237 |
256.1 |
248 |
288 |
|
237 |
279.9 |
248 |
288 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
1,888 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
108x50 - 4030x4075 |
Formats |
jpeg, png |
Sample Example#
Subset: IE
{
"input": [
{
"id": "1d6ce119",
"content": [
{
"text": "请提取所有关键信息,并以 JSON 格式返回。"
},
{
"image": "[BASE64_IMAGE: png, ~550.9KB]"
}
]
}
],
"target": "{\n \"Document ID\": \"WiCE Error Message Description\",\n \"Revision\": \"REV 1\",\n \"Date\": \"2024-09-12\",\n \"Company_Logo_Text\":\"WIN GD\",\n \"Error Messages\": [\n {\n \"ID Number\": \"COFD-52\",\n \"Designation\": \"Fuel Pump Control Signal #2 Fa ... [TRUNCATED 1668 chars] ... nt must be within 4 ~ 20mA.\\n• If necessary, the sensor can be replaced with new one (Caution: before dismantling, do depressurize rail)\"\n }\n ],\n \"Footer\": \"T_PC-Drawing_Portrait | Release: 3.10 (2024-05-15)\",\n \"Page\": \"Page 20 of 83\"\n}",
"id": 0,
"group_id": 0,
"subset_key": "IE",
"metadata": {
"task_type": "IE",
"prompt": "<image>请提取所有关键信息,并以 JSON 格式返回。",
"images": [
"images/580968569085497344_f33f02a81a.png"
]
}
}
Prompt Template#
Prompt Template:
{question}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets maritime_ocr_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['maritime_ocr_bench'],
dataset_args={
'maritime_ocr_bench': {
# subset_list: ['IE', 'VQA', 'parsing'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)