IFEval#

概述#

IFEval（Instruction-Following Eval）是一个用于评估语言模型遵循明确、可验证指令能力的基准测试。它包含一系列带有特定格式、内容或结构要求的提示（prompts），这些要求可以被客观地验证。

任务描述#

任务类型：指令遵循评估
输入：包含明确、可验证约束条件的提示
输出：完全遵循所有指定指令的响应
约束类型：格式、长度、关键词、结构等

主要特点#

包含约500个提示，涵盖25种可验证的指令类型
指令可被客观检查（非主观判断）
示例：“写恰好3段文字”、“包含单词X”、“使用项目符号列表”
测试模型对指令的理解与遵守能力
评估标准无歧义

评估说明#

默认配置采用 0-shot 评估方式
提供四种评估指标：
- prompt_level_strict：提示中的所有指令都必须被严格遵循
- prompt_level_loose：允许对轻微偏差有一定容忍度
- inst_level_strict：按每条指令计算准确率（严格）
- inst_level_loose：按每条指令计算准确率（宽松）
主要指标为 prompt_level_strict
支持自动验证指令遵循情况

属性#

属性	值
基准测试名称	`ifeval`
数据集ID	opencompass/ifeval
论文	N/A
标签	`InstructionFollowing`
指标	`prompt_level_strict`, `inst_level_strict`, `prompt_level_loose`, `inst_level_loose`
默认示例数（Shots）	0-shot
评估划分	`train`

数据统计#

指标	值
总样本数	541
提示词长度（平均）	210.75 字符
提示词长度（最小/最大）	53 / 1858 字符

样例示例#

子集：default

{
  "input": [
    {
      "id": "cb71907f",
      "content": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*."
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "key": 1000,
    "prompt": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.",
    "instruction_id_list": [
      "punctuation:no_comma",
      "detectable_format:number_highlighted_sections",
      "length_constraints:number_words"
    ],
    "kwargs": [
      {
        "num_highlights": null,
        "relation": null,
        "num_words": null,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      },
      {
        "num_highlights": 3,
        "relation": null,
        "num_words": null,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      },
      {
        "num_highlights": null,
        "relation": "at least",
        "num_words": 300,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      }
    ]
  }
}

提示模板#

未定义提示模板。

使用方法#

使用命令行（CLI）#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ifeval \
    --limit 10  # 正式评估时请删除此行

使用 Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ifeval'],
    limit=10,  # 正式评估时请删除此行
)

run_task(task_cfg=task_cfg)