DocMath#

Overview#

DocMath-Eval is a comprehensive benchmark focused on numerical reasoning within specialized domains. It requires models to comprehend long and specialized documents and perform numerical reasoning to answer questions.

Task Description#

  • Task Type: Document-based Mathematical Reasoning

  • Input: Long document context + numerical reasoning question

  • Output: Numerical answer with reasoning

  • Focus: Long-context comprehension and quantitative reasoning

Key Features#

  • Long specialized documents requiring comprehension

  • Numerical reasoning within document context

  • Multiple complexity levels (comp/simp, long/short)

  • Tests real-world document understanding

  • Requires both reading comprehension and math skills

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Uses LLM-as-judge for answer evaluation

  • Subsets: complong_testmini, compshort_testmini, simplong_testmini, simpshort_testmini

  • Answer format: “Therefore, the answer is (answer)”

Properties#

Property

Value

Benchmark Name

docmath

Dataset ID

yale-nlp/DocMath-Eval

Paper

N/A

Tags

LongContext, Math, Reasoning

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

800

Prompt Length (Mean)

68791.03 chars

Prompt Length (Min/Max)

505 / 1009038 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

complong_testmini

300

175355.17

18687

1009038

compshort_testmini

200

1990.74

505

9460

simplong_testmini

100

13972.84

6870

24001

simpshort_testmini

200

3154.2

560

9600

Sample Example#

Subset: complong_testmini

{
  "input": [
    {
      "id": "a07cbfcf",
      "content": "Please read the following text and answer the question below.\n\n<text>\nDELTA AIR LINES, INC.\nConsolidated Balance Sheets\n| (in millions, except share data) | March 31, 2018 | December 31, 2017 |\n| ASSETS |\n| Current Assets: |\n| Cash and cash e ... [TRUNCATED] ... comprehensive income for foreign currency exchange contracts in 2017 and 2018, and the changes in value for derivative contracts and other in 2018, in million?\n\nFormat your response as follows: \"Therefore, the answer is (insert answer here)\"."
    }
  ],
  "target": "-31.0",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "question_id": "complong-testmini-0",
    "answer_type": "float"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Please read the following text and answer the question below.

<text>
{context}
</text>

{question}

Format your response as follows: "Therefore, the answer is (insert answer here)".

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets docmath \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['docmath'],
    dataset_args={
        'docmath': {
            # subset_list: ['complong_testmini', 'compshort_testmini', 'simplong_testmini']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)