AlpacaEval2.0#

Overview#

AlpacaEval 2.0 is an evaluation framework for instruction-following language models that uses an LLM judge to compare model outputs against a strong baseline. It provides win-rate metrics reflecting human preferences.

Task Description#

Task Type: Instruction-Following Evaluation (Pairwise Comparison)
Input: User instruction/question
Output: Model response compared against GPT-4 Turbo baseline
Metric: Win rate against baseline model

Key Features#

Auto-annotator for scalable evaluation
Compares against GPT-4 Turbo baseline outputs
High correlation with human preferences
Cost-effective evaluation method
Tests general instruction-following capabilities

Evaluation Notes#

Default configuration uses 0-shot evaluation
Uses LLM judge (default: gpt-4-1106-preview)
Baseline model: gpt-4-turbo outputs
Reports win rate metric
Note: Length-controlled win rate not currently supported

Properties#

Property	Value
Benchmark Name	`alpaca_eval`
Dataset ID	AI-ModelScope/alpaca_eval
Paper	N/A
Tags	`Arena`, `InstructionFollowing`
Metrics	`winrate`
Default Shots	0-shot
Evaluation Split	`eval`

Data Statistics#

Metric	Value
Total Samples	805
Prompt Length (Mean)	164.92 chars
Prompt Length (Min/Max)	12 / 1917 chars

Sample Example#

Subset: alpaca_eval_gpt4_baseline

{
  "input": [
    {
      "id": "95236545",
      "content": "What are the names of some famous actors that started their careers on Broadway?"
    }
  ],
  "target": "Several famous actors started their careers on Broadway before making it big in film and television. Here are a few notable examples:\n\n1. Sarah Jessica Parker - Before she was Carrie Bradshaw on \"Sex and the City,\" Sarah Jessica Parker was a  ... [TRUNCATED] ... f the many performers who have transitioned from the Broadway stage to broader fame in the entertainment industry. Broadway often serves as a proving ground for talent, and many actors continue to return to the stage throughout their careers.",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "generator": "gpt4_1106_preview",
    "dataset": "helpful_base"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets alpaca_eval \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['alpaca_eval'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)