IFEval#
概述#
IFEval(Instruction-Following Eval)是一个用于评估语言模型遵循明确、可验证指令能力的基准测试。它包含一系列带有特定格式、内容或结构要求的提示(prompts),这些要求可以被客观地验证。
任务描述#
任务类型:指令遵循评估
输入:包含明确、可验证约束条件的提示
输出:完全遵循所有指定指令的响应
约束类型:格式、长度、关键词、结构等
主要特点#
包含约500个提示,涵盖25种可验证的指令类型
指令可被客观检查(非主观判断)
示例:“写恰好3段文字”、“包含单词X”、“使用项目符号列表”
测试模型对指令的理解与遵守能力
评估标准无歧义
评估说明#
默认配置采用 0-shot 评估方式
提供四种评估指标:
prompt_level_strict:提示中的所有指令都必须被严格遵循prompt_level_loose:允许对轻微偏差有一定容忍度inst_level_strict:按每条指令计算准确率(严格)inst_level_loose:按每条指令计算准确率(宽松)
主要指标为
prompt_level_strict支持自动验证指令遵循情况
属性#
属性 |
值 |
|---|---|
基准测试名称 |
|
数据集ID |
|
论文 |
N/A |
标签 |
|
指标 |
|
默认示例数(Shots) |
0-shot |
评估划分 |
|
数据统计#
指标 |
值 |
|---|---|
总样本数 |
541 |
提示词长度(平均) |
210.75 字符 |
提示词长度(最小/最大) |
53 / 1858 字符 |
样例示例#
子集:default
{
"input": [
{
"id": "cb71907f",
"content": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*."
}
],
"target": "",
"id": 0,
"group_id": 0,
"metadata": {
"key": 1000,
"prompt": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.",
"instruction_id_list": [
"punctuation:no_comma",
"detectable_format:number_highlighted_sections",
"length_constraints:number_words"
],
"kwargs": [
{
"num_highlights": null,
"relation": null,
"num_words": null,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
},
{
"num_highlights": 3,
"relation": null,
"num_words": null,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
},
{
"num_highlights": null,
"relation": "at least",
"num_words": 300,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
}
]
}
}
提示模板#
未定义提示模板。
使用方法#
使用命令行(CLI)#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets ifeval \
--limit 10 # 正式评估时请删除此行
使用 Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['ifeval'],
limit=10, # 正式评估时请删除此行
)
run_task(task_cfg=task_cfg)