DocVQA#

Overview#

DocVQA (Document Visual Question Answering) is a benchmark designed to evaluate AI systems’ ability to answer questions based on document images such as scanned pages, forms, invoices, and reports. It requires understanding complex document layouts, structure, and visual elements beyond simple text extraction.

Task Description#

Task Type: Document Visual Question Answering
Input: Document image + natural language question
Output: Single word or phrase answer extracted from document
Domains: Document understanding, OCR, layout comprehension

Key Features#

Covers diverse document types (forms, invoices, letters, reports)
Requires understanding document layout and structure
Tests both text extraction and contextual reasoning
Questions require locating and interpreting specific information
Combines OCR capabilities with visual understanding

Evaluation Notes#

Default evaluation uses the validation split
Primary metric: ANLS (Average Normalized Levenshtein Similarity)
Answers should be in format “ANSWER: [ANSWER]”
ANLS metric accounts for minor OCR/spelling variations
Multiple valid answers may be accepted for each question

Properties#

Property	Value
Benchmark Name	`docvqa`
Dataset ID	lmms-lab/DocVQA
Paper	N/A
Tags	`Knowledge`, `MultiModal`, `QA`
Metrics	`anls`
Default Shots	0-shot
Evaluation Split	`validation`

Data Statistics#

Metric	Value
Total Samples	5,349
Prompt Length (Mean)	254.82 chars
Prompt Length (Min/Max)	220 / 354 chars

Image Statistics:

Metric	Value
Total Images	5,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	593x294 - 5367x7184
Formats	png

Sample Example#

Subset: DocVQA

{
  "input": [
    {
      "id": "002390bd",
      "content": [
        {
          "text": "Answer the question according to the image using a single word or phrase.\nWhat is the ‘actual’ value per 1000, during the year 1975?\nThe last line of your response should be of the form \"ANSWER: [ANSWER]\" (without quotes) where [ANSWER] is the answer to the question."
        },
        {
          "image": "[BASE64_IMAGE: png, ~1.2MB]"
        }
      ]
    }
  ],
  "target": "[\"0.28\"]",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "questionId": "49153",
    "question_types": [
      "figure/diagram"
    ],
    "docId": 14465,
    "ucsf_document_id": "pybv0228",
    "ucsf_document_page_no": "81"
  }
}

Prompt Template#

Prompt Template:

Answer the question according to the image using a single word or phrase.
{question}
The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the question.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets docvqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['docvqa'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)