OmniDocBench#

Overview#

OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, covering 9 document types, 4 layout types, and 3 language types with 1,355 PDF pages.

Task Description#

  • Task Type: Document Parsing and Understanding

  • Input: PDF page image

  • Output: Parsed document structure in Markdown format

  • Domain: Document understanding, OCR, layout analysis

Key Features#

  • 1,355 PDF pages across 9 document types

  • Rich annotations: 15 block-level and 4 span-level element types

  • Over 20k block-level and 80k span-level annotations

  • Reading order annotations

  • Coverage: academic papers, financial reports, newspapers, textbooks, handwritten notes

Evaluation Notes#

  • Implements end2end and quick_match methods from official OmniDocBench-v1.5

  • Metrics: Edit_dist, BLEU, METEOR (text), TEDS (tables)

  • Requires: apted, distance, lxml, Polygon3, zss, rapidfuzz packages

  • Output format: Markdown with LaTeX formulas and HTML tables

Properties#

Property

Value

Benchmark Name

omni_doc_bench

Dataset ID

evalscope/OmniDocBench_tsv

Paper

N/A

Tags

Knowledge, MultiModal, QA

Metrics

text_block, display_formula, table, reading_order

Default Shots

0-shot

Evaluation Split

train

Data Statistics#

Metric

Value

Total Samples

981

Prompt Length (Mean)

1408 chars

Prompt Length (Min/Max)

1408 / 1408 chars

Image Statistics:

Metric

Value

Total Images

981

Images per Sample

min: 1, max: 1, mean: 1

Resolution Range

516x729 - 10142x14342

Formats

jpeg

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "fa523475",
      "content": [
        {
          "text": " You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:\n\n    1. Text Processing:\n    - Accurately recognize all text content in the PDF image without guessing or i ... [TRUNCATED] ... sible.\n\n    Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.\n"
        },
        {
          "image": "[BASE64_IMAGE: png, ~321.8KB]"
        }
      ]
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "layout_dets": [
      {
        "category_type": "title",
        "poly": [
          102.5999912116609,
          120.87255879760278,
          719.3118659856144,
          120.87255879760278,
          719.3118659856144,
          194.14083813380114,
          102.5999912116609,
          194.14083813380114
        ],
        "ignore": false,
        "order": 1,
        "anno_id": 6,
        "text": "国资背景基金情况",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              109.3333333333331,
              121.73651418039208,
              722.1022134807848,
              121.73651418039208,
              722.1022134807848,
              195.75809149176507,
              109.3333333333331,
              195.75809149176507
            ],
            "text": "国资背景基金情况"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "text_block",
        "poly": [
          97.71487020898245,
          226.92028692633914,
          1271.9932332148471,
          226.92028692633914,
          1271.9932332148471,
          264.88925750697814,
          97.71487020898245,
          264.88925750697814
        ],
        "ignore": false,
        "order": 2,
        "anno_id": 4,
        "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              99.66504579139392,
              227.6650457913944,
              1269.333333333333,
              227.6650457913944,
              1269.333333333333,
              271.3365750838786,
              99.66504579139392,
              271.3365750838786
            ],
            "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "figure_caption",
        "poly": [
          246.96994018554688,
          318.7444152832031,
          1088.26025390625,
          318.7444152832031,
          1088.26025390625,
          369.0964660644531,
          246.96994018554688,
          369.0964660644531
        ],
        "ignore": false,
        "order": 3,
        "anno_id": 3,
        "text": "2014年-2023Q3国资背景基金的备案数量及规模",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              253.94664201855937,
              321.21295194692755,
              1076.1203813864063,
              321.21295194692755,
              1076.1203813864063,
              364.93470762745034,
              253.94664201855937,
              364.93470762745034
            ],
            "text": "2014年-2023Q3国资背景基金的备案数量及规模"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "figure",
        "poly": [
          118.08102792118407,
          379.29373168945347,
          1299.4279383691976,
          379.29373168945347,
          1299.4279383691976,
          1028.2773128579047,
          118.08102792118407,
          1028.2773128579047
        ],
        "ignore": false,
        "order": 4,
        "anno_id": 2
      },
      {
        "category_type": "figure_caption",
        "poly": [
          1497.726318359375,
          318.7418518066406,
          2301.80224609375,
          318.7418518066406,
          2301.80224609375,
          367.1272888183594,
          1497.726318359375,
          367.1272888183594
        ],
        "ignore": false,
        "order": 5,
        "anno_id": 8,
        "text": "2014年-2023Q3国资背景基金数量TOP10地区",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              1509.6758069519938,
              324.34247361866034,
              2292.4771492866826,
              324.34247361866034,
              2292.4771492866826,
              364.8196229053426,
              1509.6758069519938,
              364.8196229053426
            ],
            "text": "2014年-2023Q3国资背景基金数量TOP10地区"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "figure",
        "poly": [
          1370.0374839590943,
          424.35013794251097,
          2552.3561471143494,
          424.35013794251097,
          2552.3561471143494,
          1026.8955618700252,
          1370.0374839590943,
          1026.8955618700252
        ],
        "ignore": false,
        "order": 6,
        "anno_id": 5
      },
      {
        "category_type": "title",
        "poly": [
          170.92340081387997,
          1069.7956822171332,
          326.21460986860313,
          1069.7956822171332,
          326.21460986860313,
          1111.7494049722532,
          170.92340081387997,
          1111.7494049722532
        ],
        "ignore": false,
        "order": 7,
        "anno_id": 9,
        "text": "核心发现",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              169.67751098302242,
              1071.225836994341,
              328.08580770628134,
              1071.225836994341,
              328.08580770628134,
              1111.655822350311,
              169.67751098302242,
              1111.655822350311
            ],
            "text": "核心发现"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "text_block",
        "poly": [
          172.66793877059249,
          1155.2640660519091,
          2514.2408071863138,
          1155.2640660519091,
          2514.2408071863138,
          1241.6284871157177,
          172.66793877059249,
          1241.6284871157177
        ],
        "ignore": false,
        "order": 8,
        "anno_id": 7,
        "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然 2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              165.603649650326,
              1150.009124125815,
              2509.333333333333,
              1150.009124125815,
              2509.333333333333,
              1198.666666666666,
              165.603649650326,
              1198.666666666666
            ],
            "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然"
          },
          {
            "category_type": "text_span",
            "poly": [
              219.22996126565647,
              1201.1457902508969,
              2250.770752144285,
              1201.1457902508969,
              2250.770752144285,
              1243.9433217869077,
              219.22996126565647,
              1243.9433217869077
            ],
            "text": "2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "text_block",
        "poly": [
          171.69999831539863,
          1278.820932742719,
          2512.084408886781,
          1278.820932742719,
          2512.084408886781,
          1365.690053585406,
          171.69999831539863,
          1365.690053585406
        ],
        "ignore": false,
        "order": 9,
        "anno_id": 1,
        "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ,规模占全国总量的 68%。",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              161.7899369148969,
              1278.308761376868,
              2508,
              1278.308761376868,
              2508,
              1317.333333333333,
              161.7899369148969,
              1317.333333333333
            ],
            "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东"
          },
          {
            "category_type": "text_span",
            "poly": [
              222.66666666666688,
              1325.3333333333335,
              1623.8331583485456,
              1325.3333333333335,
              1623.8331583485456,
              1365.333333333333,
              222.66666666666688,
              1365.333333333333
            ],
            "text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
          },
          {
            "category_type": "equation_ignore",
            "poly": [
              1624.4165959289367,
              1327.0154193159506,
              1703.7259660435407,
              1327.0154193159506,
              1703.7259660435407,
              1363.1237504250385,
              1624.4165959289367,
              1363.1237504250385
            ],
            "text": "73%"
          },
          {
            "category_type": "text_span",
            "poly": [
              1704.6905743174548,
              1322.6134268787764,
              2053.985160092844,
              1322.6134268787764,
              2053.985160092844,
              1370.6736155849724,
              1704.6905743174548,
              1370.6736155849724
            ],
            "text": ",规模占全国总量的"
          },
          {
            "category_type": "equation_ignore",
            "poly": [
              2055.1374027302004,
              1326.3706276890023,
              2149.276980264608,
              1326.3706276890023,
              2149.276980264608,
              1365.7029169328305,
              2055.1374027302004,
              1365.7029169328305
            ],
            "text": "68%。"
          }
        ],
        "attribute": {
          "text_language": "text_simplified_chinese",
          "text_background": "white",
          "text_rotate": "normal"
        }
      },
      {
        "category_type": "abandon",
        "poly": [
          114.12910090860571,
          1403.1676953230935,
          175.21358196554792,
          1403.1676953230935,
          175.21358196554792,
          1462.6586681785502,
          114.12910090860571,
          1462.6586681785502
        ],
        "ignore": false,
        "order": null,
        "anno_id": 10
      },
      {
        "category_type": "footer",
        "poly": [
          180.18207532211585,
          1404.2778174322868,
          289.9793827860912,
          1404.2778174322868,
          289.9793827860912,
          1462.652231000048,
          180.18207532211585,
          1462.652231000048
        ],
        "ignore": false,
        "order": null,
        "anno_id": 0,
        "text": "CVINFO 投中信息",
        "line_with_spans": [
          {
            "category_type": "text_span",
            "poly": [
              178.18192276049803,
              1409.8767302579377,
              288.0868232114207,
              1409.8767302579377,
              288.0868232114207,
              1467.2607048296584,
              178.18192276049803,
              1467.2607048296584
            ],
            "text": "CVINFO 投中信息"
          }
        ],
        "attribute": {
          "text_language": "text_en_ch_mixed",
          "text_background": "white",
          "text_rotate": "normal"
        }
      }
    ],
    "extra": {
      "relation": [
        {
          "source_anno_id": 2,
          "target_anno_id": 3,
          "relation_type": "parent_son"
        },
        {
          "source_anno_id": 5,
          "target_anno_id": 8,
          "relation_type": "parent_son"
        }
      ]
    },
    "page_info": {
      "page_attribute": {
        "data_source": "PPT2PDF",
        "language": "simplified_chinese",
        "layout": "1andmore_column",
        "special_issue": [
          "watermark"
        ]
      },
      "page_no": 11,
      "height": 1500,
      "width": 2667,
      "image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg"
    }
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

 You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:

    1. Text Processing:
    - Accurately recognize all text content in the PDF image without guessing or inferring.
    - Convert the recognized text into Markdown format.
    - Maintain the original document structure, including headings, paragraphs, lists, etc.

    2. Mathematical Formula Processing:
    - Convert all mathematical formulas to LaTeX format.
    - Enclose inline formulas with \( \). For example: This is an inline formula \( E = mc^2 \)
    - Enclose block formulas with \\[ \\]. For example: \[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

    3. Table Processing:
    - Convert tables to HTML format.
    - Wrap the entire table with <table> and </table>.

    4. Figure Handling:
    - Ignore figures content in the PDF image. Do not attempt to describe or convert images.

    5. Output Format:
    - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
    - For complex layouts, try to maintain the original document's structure and format as closely as possible.

    Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.

Extra Parameters#

Parameter

Type

Default

Description

match_method

str

quick_match

Scoring match method used for evaluation. Choices: [‘quick_match’, ‘simple_match’, ‘no_split’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets omni_doc_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['omni_doc_bench'],
    dataset_args={
        'omni_doc_bench': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)