IFBench#

Overview#

IFBench is a benchmark designed to evaluate how reliably AI models follow novel, challenging, and diverse verifiable instructions, with a strong focus on out-of-domain generalization. Developed by AllenAI, it addresses overfitting and data contamination issues in existing benchmarks.

Task Description#

  • Task Type: Instruction Following Evaluation

  • Input: Prompts with verifiable constraints

  • Output: Responses that must satisfy specific constraints

  • Focus: Precise instruction-following capabilities

Key Features#

  • 58 manually curated verifiable constraints

  • Categories: counting, formatting, word usage, etc.

  • Focus on out-of-domain generalization

  • Programmatic verification of constraint satisfaction

  • Addresses data contamination concerns

Evaluation Notes#

  • Default configuration uses 0-shot evaluation

  • Metrics: prompt_level_strict, inst_level_strict, prompt_level_loose, inst_level_loose

  • Requires emoji, syllapy packages

  • Evaluates both strict and loose constraint satisfaction

Properties#

Property

Value

Benchmark Name

ifbench

Dataset ID

allenai/IFBench_test

Paper

N/A

Tags

InstructionFollowing

Metrics

prompt_level_strict, inst_level_strict, prompt_level_loose, inst_level_loose

Default Shots

0-shot

Evaluation Split

train

Data Statistics#

Metric

Value

Total Samples

300

Prompt Length (Mean)

343.41 chars

Prompt Length (Min/Max)

50 / 904 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "9e0a5835",
      "content": "What should the world's smartest man, surrounded by corruption, greed, inequity, madness, inequality, an establishment who preached conspiracy theories and wild speculations over truth and an equally evil resistance funded by the mega rich, a ... [TRUNCATED] ... ad here. Include keyword kaleidoscope once in your response, keyword nebula twice in your response, keyword whisper three times in your response, keyword labyrinth five times in your response, and keyword paradox seven times in your response."
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "key": "0",
    "prompt": "What should the world's smartest man, surrounded by corruption, greed, inequity, madness, inequality, an establishment who preached conspiracy theories and wild speculations over truth and an equally evil resistance funded by the mega rich, a ... [TRUNCATED] ... ad here. Include keyword kaleidoscope once in your response, keyword nebula twice in your response, keyword whisper three times in your response, keyword labyrinth five times in your response, and keyword paradox seven times in your response.",
    "instruction_id_list": [
      "count:keywords_multiple"
    ],
    "kwargs": [
      {
        "N": null,
        "capital_frequency": null,
        "capital_relation": null,
        "end_phrase": null,
        "first_word": null,
        "forbidden_words": null,
        "frequency": null,
        "keyword": null,
        "keyword1": "kaleidoscope",
        "keyword2": "nebula",
        "keyword3": "whisper",
        "keyword4": "labyrinth",
        "keyword5": "paradox",
        "keywords": null,
        "language": null,
        "let_frequency": null,
        "let_relation": null,
        "letter": null,
        "m": null,
        "max_words": null,
        "min_words": null,
        "n": null,
        "n_end": null,
        "n_start": null,
        "nth_paragraph": null,
        "num_bullets": null,
        "num_highlights": null,
        "num_paragraphs": null,
        "num_placeholders": null,
        "num_sections": null,
        "num_sentences": null,
        "num_words": null,
        "options": null,
        "percentage": null,
        "postscript_marker": null,
        "prompt_to_repeat": null,
        "reference_text": null,
        "relation": null,
        "section_spliter": null,
        "sep": null,
        "small_n": null,
        "word": null
      }
    ]
  }
}

Note: Some content was truncated for display.

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets ifbench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['ifbench'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)