OmniDocBench#
Overview#
OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, covering 9 document types, 4 layout types, and 3 language types with 1,355 PDF pages.
Task Description#
Task Type: Document Parsing and Understanding
Input: PDF page image
Output: Parsed document structure in Markdown format
Domain: Document understanding, OCR, layout analysis
Key Features#
1,355 PDF pages across 9 document types
Rich annotations: 15 block-level and 4 span-level element types
Over 20k block-level and 80k span-level annotations
Reading order annotations
Coverage: academic papers, financial reports, newspapers, textbooks, handwritten notes
Evaluation Notes#
Implements
end2endandquick_matchmethods from official OmniDocBench-v1.5Metrics: Edit_dist, BLEU, METEOR (text), TEDS (tables)
Requires: apted, distance, lxml, Polygon3, zss, rapidfuzz packages
Output format: Markdown with LaTeX formulas and HTML tables
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
981 |
Prompt Length (Mean) |
1408 chars |
Prompt Length (Min/Max) |
1408 / 1408 chars |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
981 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
516x729 - 10142x14342 |
Formats |
jpeg |
Sample Example#
Subset: default
{
"input": [
{
"id": "fa523475",
"content": [
{
"text": " You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:\n\n 1. Text Processing:\n - Accurately recognize all text content in the PDF image without guessing or i ... [TRUNCATED] ... sible.\n\n Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.\n"
},
{
"image": "[BASE64_IMAGE: png, ~321.8KB]"
}
]
}
],
"target": "",
"id": 0,
"group_id": 0,
"metadata": {
"layout_dets": [
{
"category_type": "title",
"poly": [
102.5999912116609,
120.87255879760278,
719.3118659856144,
120.87255879760278,
719.3118659856144,
194.14083813380114,
102.5999912116609,
194.14083813380114
],
"ignore": false,
"order": 1,
"anno_id": 6,
"text": "国资背景基金情况",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
109.3333333333331,
121.73651418039208,
722.1022134807848,
121.73651418039208,
722.1022134807848,
195.75809149176507,
109.3333333333331,
195.75809149176507
],
"text": "国资背景基金情况"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "text_block",
"poly": [
97.71487020898245,
226.92028692633914,
1271.9932332148471,
226.92028692633914,
1271.9932332148471,
264.88925750697814,
97.71487020898245,
264.88925750697814
],
"ignore": false,
"order": 2,
"anno_id": 4,
"text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
99.66504579139392,
227.6650457913944,
1269.333333333333,
227.6650457913944,
1269.333333333333,
271.3365750838786,
99.66504579139392,
271.3365750838786
],
"text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "figure_caption",
"poly": [
246.96994018554688,
318.7444152832031,
1088.26025390625,
318.7444152832031,
1088.26025390625,
369.0964660644531,
246.96994018554688,
369.0964660644531
],
"ignore": false,
"order": 3,
"anno_id": 3,
"text": "2014年-2023Q3国资背景基金的备案数量及规模",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
253.94664201855937,
321.21295194692755,
1076.1203813864063,
321.21295194692755,
1076.1203813864063,
364.93470762745034,
253.94664201855937,
364.93470762745034
],
"text": "2014年-2023Q3国资背景基金的备案数量及规模"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "figure",
"poly": [
118.08102792118407,
379.29373168945347,
1299.4279383691976,
379.29373168945347,
1299.4279383691976,
1028.2773128579047,
118.08102792118407,
1028.2773128579047
],
"ignore": false,
"order": 4,
"anno_id": 2
},
{
"category_type": "figure_caption",
"poly": [
1497.726318359375,
318.7418518066406,
2301.80224609375,
318.7418518066406,
2301.80224609375,
367.1272888183594,
1497.726318359375,
367.1272888183594
],
"ignore": false,
"order": 5,
"anno_id": 8,
"text": "2014年-2023Q3国资背景基金数量TOP10地区",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
1509.6758069519938,
324.34247361866034,
2292.4771492866826,
324.34247361866034,
2292.4771492866826,
364.8196229053426,
1509.6758069519938,
364.8196229053426
],
"text": "2014年-2023Q3国资背景基金数量TOP10地区"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "figure",
"poly": [
1370.0374839590943,
424.35013794251097,
2552.3561471143494,
424.35013794251097,
2552.3561471143494,
1026.8955618700252,
1370.0374839590943,
1026.8955618700252
],
"ignore": false,
"order": 6,
"anno_id": 5
},
{
"category_type": "title",
"poly": [
170.92340081387997,
1069.7956822171332,
326.21460986860313,
1069.7956822171332,
326.21460986860313,
1111.7494049722532,
170.92340081387997,
1111.7494049722532
],
"ignore": false,
"order": 7,
"anno_id": 9,
"text": "核心发现",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
169.67751098302242,
1071.225836994341,
328.08580770628134,
1071.225836994341,
328.08580770628134,
1111.655822350311,
169.67751098302242,
1111.655822350311
],
"text": "核心发现"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "text_block",
"poly": [
172.66793877059249,
1155.2640660519091,
2514.2408071863138,
1155.2640660519091,
2514.2408071863138,
1241.6284871157177,
172.66793877059249,
1241.6284871157177
],
"ignore": false,
"order": 8,
"anno_id": 7,
"text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然 2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
165.603649650326,
1150.009124125815,
2509.333333333333,
1150.009124125815,
2509.333333333333,
1198.666666666666,
165.603649650326,
1198.666666666666
],
"text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然"
},
{
"category_type": "text_span",
"poly": [
219.22996126565647,
1201.1457902508969,
2250.770752144285,
1201.1457902508969,
2250.770752144285,
1243.9433217869077,
219.22996126565647,
1243.9433217869077
],
"text": "2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "text_block",
"poly": [
171.69999831539863,
1278.820932742719,
2512.084408886781,
1278.820932742719,
2512.084408886781,
1365.690053585406,
171.69999831539863,
1365.690053585406
],
"ignore": false,
"order": 9,
"anno_id": 1,
"text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ,规模占全国总量的 68%。",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
161.7899369148969,
1278.308761376868,
2508,
1278.308761376868,
2508,
1317.333333333333,
161.7899369148969,
1317.333333333333
],
"text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东"
},
{
"category_type": "text_span",
"poly": [
222.66666666666688,
1325.3333333333335,
1623.8331583485456,
1325.3333333333335,
1623.8331583485456,
1365.333333333333,
222.66666666666688,
1365.333333333333
],
"text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
},
{
"category_type": "equation_ignore",
"poly": [
1624.4165959289367,
1327.0154193159506,
1703.7259660435407,
1327.0154193159506,
1703.7259660435407,
1363.1237504250385,
1624.4165959289367,
1363.1237504250385
],
"text": "73%"
},
{
"category_type": "text_span",
"poly": [
1704.6905743174548,
1322.6134268787764,
2053.985160092844,
1322.6134268787764,
2053.985160092844,
1370.6736155849724,
1704.6905743174548,
1370.6736155849724
],
"text": ",规模占全国总量的"
},
{
"category_type": "equation_ignore",
"poly": [
2055.1374027302004,
1326.3706276890023,
2149.276980264608,
1326.3706276890023,
2149.276980264608,
1365.7029169328305,
2055.1374027302004,
1365.7029169328305
],
"text": "68%。"
}
],
"attribute": {
"text_language": "text_simplified_chinese",
"text_background": "white",
"text_rotate": "normal"
}
},
{
"category_type": "abandon",
"poly": [
114.12910090860571,
1403.1676953230935,
175.21358196554792,
1403.1676953230935,
175.21358196554792,
1462.6586681785502,
114.12910090860571,
1462.6586681785502
],
"ignore": false,
"order": null,
"anno_id": 10
},
{
"category_type": "footer",
"poly": [
180.18207532211585,
1404.2778174322868,
289.9793827860912,
1404.2778174322868,
289.9793827860912,
1462.652231000048,
180.18207532211585,
1462.652231000048
],
"ignore": false,
"order": null,
"anno_id": 0,
"text": "CVINFO 投中信息",
"line_with_spans": [
{
"category_type": "text_span",
"poly": [
178.18192276049803,
1409.8767302579377,
288.0868232114207,
1409.8767302579377,
288.0868232114207,
1467.2607048296584,
178.18192276049803,
1467.2607048296584
],
"text": "CVINFO 投中信息"
}
],
"attribute": {
"text_language": "text_en_ch_mixed",
"text_background": "white",
"text_rotate": "normal"
}
}
],
"extra": {
"relation": [
{
"source_anno_id": 2,
"target_anno_id": 3,
"relation_type": "parent_son"
},
{
"source_anno_id": 5,
"target_anno_id": 8,
"relation_type": "parent_son"
}
]
},
"page_info": {
"page_attribute": {
"data_source": "PPT2PDF",
"language": "simplified_chinese",
"layout": "1andmore_column",
"special_issue": [
"watermark"
]
},
"page_no": 11,
"height": 1500,
"width": 2667,
"image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg"
}
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
1. Text Processing:
- Accurately recognize all text content in the PDF image without guessing or inferring.
- Convert the recognized text into Markdown format.
- Maintain the original document structure, including headings, paragraphs, lists, etc.
2. Mathematical Formula Processing:
- Convert all mathematical formulas to LaTeX format.
- Enclose inline formulas with \( \). For example: This is an inline formula \( E = mc^2 \)
- Enclose block formulas with \\[ \\]. For example: \[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]
3. Table Processing:
- Convert tables to HTML format.
- Wrap the entire table with <table> and </table>.
4. Figure Handling:
- Ignore figures content in the PDF image. Do not attempt to describe or convert images.
5. Output Format:
- Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
- For complex layouts, try to maintain the original document's structure and format as closely as possible.
Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Scoring match method used for evaluation. Choices: [‘quick_match’, ‘simple_match’, ‘no_split’] |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets omni_doc_bench \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['omni_doc_bench'],
dataset_args={
'omni_doc_bench': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)