IFEval#

Overview#

IFEval (Instruction-Following Eval) is a benchmark for evaluating how well language models follow explicit, verifiable instructions. It contains prompts with specific formatting, content, or structural requirements that can be objectively verified.

Task Description#

  • Task Type: Instruction Following Evaluation

  • Input: Prompts with explicit, verifiable constraints

  • Output: Response that follows all specified instructions

  • Constraint Types: Format, length, keywords, structure, etc.

Key Features#

  • ~500 prompts with 25 types of verifiable instructions

  • Instructions are objectively checkable (not subjective)

  • Examples: “write exactly 3 paragraphs”, “include the word X”, “use bullet points”

  • Tests instruction comprehension and compliance

  • No ambiguity in evaluation criteria

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Four metrics available:

    • prompt_level_strict: All instructions in prompt must be followed

    • prompt_level_loose: Some tolerance for minor deviations

    • inst_level_strict: Per-instruction accuracy (strict)

    • inst_level_loose: Per-instruction accuracy (loose)

  • prompt_level_strict is the primary metric

  • Automatic verification of instruction compliance

Properties#

Property

Value

Benchmark Name

ifeval

Dataset ID

opencompass/ifeval

Paper

N/A

Tags

InstructionFollowing

Metrics

prompt_level_strict, inst_level_strict, prompt_level_loose, inst_level_loose

Default Shots

0-shot

Evaluation Split

train

Data Statistics#

Metric

Value

Total Samples

541

Prompt Length (Mean)

210.75 chars

Prompt Length (Min/Max)

53 / 1858 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "cb71907f",
      "content": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*."
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "key": 1000,
    "prompt": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.",
    "instruction_id_list": [
      "punctuation:no_comma",
      "detectable_format:number_highlighted_sections",
      "length_constraints:number_words"
    ],
    "kwargs": [
      {
        "num_highlights": null,
        "relation": null,
        "num_words": null,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      },
      {
        "num_highlights": 3,
        "relation": null,
        "num_words": null,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      },
      {
        "num_highlights": null,
        "relation": "at least",
        "num_words": 300,
        "num_placeholders": null,
        "prompt_to_repeat": null,
        "num_bullets": null,
        "section_spliter": null,
        "num_sections": null,
        "capital_relation": null,
        "capital_frequency": null,
        "keywords": null,
        "num_paragraphs": null,
        "language": null,
        "let_relation": null,
        "letter": null,
        "let_frequency": null,
        "end_phrase": null,
        "forbidden_words": null,
        "keyword": null,
        "frequency": null,
        "num_sentences": null,
        "postscript_marker": null,
        "first_word": null,
        "nth_paragraph": null
      }
    ]
  }
}

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ifeval \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ifeval'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)