IFEval#
Overview#
IFEval (Instruction-Following Eval) is a benchmark for evaluating how well language models follow explicit, verifiable instructions. It contains prompts with specific formatting, content, or structural requirements that can be objectively verified.
Task Description#
Task Type: Instruction Following Evaluation
Input: Prompts with explicit, verifiable constraints
Output: Response that follows all specified instructions
Constraint Types: Format, length, keywords, structure, etc.
Key Features#
~500 prompts with 25 types of verifiable instructions
Instructions are objectively checkable (not subjective)
Examples: “write exactly 3 paragraphs”, “include the word X”, “use bullet points”
Tests instruction comprehension and compliance
No ambiguity in evaluation criteria
Evaluation Notes#
Default configuration uses 0-shot evaluation
Four metrics available:
prompt_level_strict: All instructions in prompt must be followedprompt_level_loose: Some tolerance for minor deviationsinst_level_strict: Per-instruction accuracy (strict)inst_level_loose: Per-instruction accuracy (loose)
prompt_level_strictis the primary metricAutomatic verification of instruction compliance
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
541 |
Prompt Length (Mean) |
210.75 chars |
Prompt Length (Min/Max) |
53 / 1858 chars |
Sample Example#
Subset: default
{
"input": [
{
"id": "cb71907f",
"content": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*."
}
],
"target": "",
"id": 0,
"group_id": 0,
"metadata": {
"key": 1000,
"prompt": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.",
"instruction_id_list": [
"punctuation:no_comma",
"detectable_format:number_highlighted_sections",
"length_constraints:number_words"
],
"kwargs": [
{
"num_highlights": null,
"relation": null,
"num_words": null,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
},
{
"num_highlights": 3,
"relation": null,
"num_words": null,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
},
{
"num_highlights": null,
"relation": "at least",
"num_words": 300,
"num_placeholders": null,
"prompt_to_repeat": null,
"num_bullets": null,
"section_spliter": null,
"num_sections": null,
"capital_relation": null,
"capital_frequency": null,
"keywords": null,
"num_paragraphs": null,
"language": null,
"let_relation": null,
"letter": null,
"let_frequency": null,
"end_phrase": null,
"forbidden_words": null,
"keyword": null,
"frequency": null,
"num_sentences": null,
"postscript_marker": null,
"first_word": null,
"nth_paragraph": null
}
]
}
}
Prompt Template#
No prompt template defined.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets ifeval \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['ifeval'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)