Maritime-OCR-Bench#

Overview#

Maritime-OCR-Bench is a comprehensive evaluation benchmark for assessing multimodal large model capabilities on OCR-related tasks. The current released set contains 1,888 manually curated samples across five task types.

Task Types#

VQA: Visual question answering on document/scene images
IE: Information extraction requiring strict JSON output
parsing: Text recognition and parsing from images
json1: Text spotting with JSON v1 structured output
json2: Text spotting with JSON v2 structured output

Evaluation Metrics#

Each task type uses a specialized scoring method:

VQA/parsing: Multi-dimensional text similarity (edit distance, char F1, LCS F1, table-aware similarity)
IE: Text coverage + JSON strictness (0.5 * coverage + 0.5 * json_strict)
json1/json2: DIoU layout score + text score (0.7 * diou + 0.3 * text)

Properties#

Property	Value
Benchmark Name	`maritime_ocr_bench`
Dataset ID	HiDolphin/MaritimeOCRBench
Paper	N/A
Tags	`MultiModal`, `QA`
Metrics	`score`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	1,888
Prompt Length (Mean)	102.91 chars
Prompt Length (Min/Max)	23 / 288 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`IE`	471	23	23	23
`VQA`	471	39.65	28	141
`parsing`	472	80	80	80
`json1`	237	256.1	248	288
`json2`	237	279.9	248	288

Image Statistics:

Metric	Value
Total Images	1,888
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	108x50 - 4030x4075
Formats	jpeg, png

Sample Example#

Subset: IE

{
  "input": [
    {
      "id": "1d6ce119",
      "content": [
        {
          "text": "请提取所有关键信息，并以 JSON 格式返回。"
        },
        {
          "image": "[BASE64_IMAGE: png, ~550.9KB]"
        }
      ]
    }
  ],
  "target": "{\n  \"Document ID\": \"WiCE Error Message Description\",\n  \"Revision\": \"REV 1\",\n  \"Date\": \"2024-09-12\",\n  \"Company_Logo_Text\":\"WIN GD\",\n  \"Error Messages\": [\n    {\n      \"ID Number\": \"COFD-52\",\n      \"Designation\": \"Fuel Pump Control Signal #2 Fa ... [TRUNCATED 1668 chars] ... nt must be within 4 ~ 20mA.\\n• If necessary, the sensor can be replaced with new one (Caution: before dismantling, do depressurize rail)\"\n    }\n  ],\n  \"Footer\": \"T_PC-Drawing_Portrait | Release: 3.10 (2024-05-15)\",\n  \"Page\": \"Page 20 of 83\"\n}",
  "id": 0,
  "group_id": 0,
  "subset_key": "IE",
  "metadata": {
    "task_type": "IE",
    "prompt": "<image>请提取所有关键信息，并以 JSON 格式返回。",
    "images": [
      "images/580968569085497344_f33f02a81a.png"
    ]
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets maritime_ocr_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['maritime_ocr_bench'],
    dataset_args={
        'maritime_ocr_bench': {
            # subset_list: ['IE', 'VQA', 'parsing']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)