DROP#

概述#

DROP(Discrete Reasoning Over Paragraphs,段落离散推理)是一个具有挑战性的阅读理解基准测试,要求模型在文本段落上执行离散推理操作。与简单的抽取式问答不同,DROP 的问题需要进行数值推理、计数和比较等操作。

任务描述#

  • 任务类型:带离散推理的阅读理解

  • 输入:需要推理的段落和问题

  • 输出:数值答案、文本片段(span)或日期

  • 推理类型:加法、减法、计数、比较、排序

主要特点#

  • 包含 96,567 个需要对文本进行离散推理的问题

  • 问题基于 NFL 比赛摘要、维基百科文章等

  • 需要多步推理和算术运算

  • 支持多种有效答案格式(数字、文本片段、日期)

  • 测试模型的组合推理能力

评估说明#

  • 默认配置使用 3-shot 示例

  • 评估指标包括精确匹配(Exact Match, EM)和 token 级别的 F1 分数

  • 答案应遵循格式:"Answer: [ANSWER]"

  • F1 分数是主要的比较指标

  • 答案会与多个参考答案进行比对验证

属性#

属性

基准测试名称

drop

数据集ID

AI-ModelScope/DROP

论文

N/A

标签

Reasoning

指标

em, f1

默认示例数量

3-shot

评估划分

validation

数据统计#

指标

总样本数

9,536

提示词长度(平均)

5454.05 字符

提示词长度(最小/最大)

4638 / 9893 字符

样例示例#

子集: default

{
  "input": [
    {
      "id": "d4ab7ff6",
      "content": "You will be asked to read a passage and answer a question. Some examples of passages and Q&A are provided below.\n\n# Examples\n---\nPassage: Trunajaya rebellion  or Trunajaya War was the ultimately unsuccessful rebellion waged by the Madurese pr ... [TRUNCATED] ... iled a 40-yard field goal, yet the Raiders' defense would shut down any possible attempt.\nQuestion: Who scored the first touchdown of the game?\n\nThink step by step, then write a line of the form \"Answer: [ANSWER]\" at the end of your response."
    }
  ],
  "target": "[('Chaz Schilens',), ('JaMarcus Russell',)]",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "passage": " Hoping to rebound from their loss to the Patriots, the Raiders stayed at home for a Week 16 duel with the Houston Texans.  Oakland would get the early lead in the first quarter as quarterback JaMarcus Russell completed a 20-yard touchdown pa ... [TRUNCATED] ...  29-yard touchdown pass from Russell, followed up by an 80-yard punt return for a touchdown.  The Texans tried to rally in the fourth quarter as Brown nailed a 40-yard field goal, yet the Raiders' defense would shut down any possible attempt.",
    "answer": {
      "number": "",
      "date": {
        "day": "",
        "month": "",
        "year": ""
      },
      "spans": [
        "Chaz Schilens"
      ],
      "worker_id": "",
      "hit_id": ""
    },
    "validated_answers": {
      "number": [
        "",
        ""
      ],
      "date": [
        {
          "day": "",
          "month": "",
          "year": ""
        },
        {
          "day": "",
          "month": "",
          "year": ""
        }
      ],
      "spans": [
        [
          "Chaz Schilens"
        ],
        [
          "JaMarcus Russell"
        ]
      ],
      "worker_id": [
        "",
        ""
      ],
      "hit_id": [
        "",
        ""
      ]
    }
  }
}

注:部分内容因展示需要已被截断。

提示模板#

提示模板:

You will be asked to read a passage and answer a question. {drop_examples}
# Your Task

---
{query}

Think step by step, then write a line of the form "Answer: [ANSWER]" at the end of your response.

使用方法#

使用 CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets drop \
    --limit 10  # 正式评估时请删除此行

使用 Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['drop'],
    limit=10,  # 正式评估时请删除此行
)

run_task(task_cfg=task_cfg)