AIR-Bench-Chat#

Overview#

AIR-Bench Chat is the generative half of AIR-Bench (Audio InstRuction Benchmark, ACL 2024 main conference) — the first instruction-following benchmark for large audio-language models (LALMs), covering human speech, natural sounds and music. It contains roughly 2k open-ended audio QA pairs covering speech, sound, music and mixed-audio scenes; responses are graded by a GPT-4 judge against a reference answer.

Task Description#

  • Task Type: Open-ended audio question answering.

  • Input: An audio clip plus a free-form question.

  • Output: A textual answer evaluated against the reference response.

Categories (8 tasks → 5 reported categories)#

The 8 Chat tasks are aggregated by the official cal_score.py into five categories:

  • speech: speech_QA, speech_dialogue_QA

  • sound: sound_QA, sound_generation_QA

  • music: music_QA, music_generation_analysis_QA

  • speech_and_sound: speech_and_sound_QA

  • speech_and_music: speech_and_music_QA

The paper’s Mixed-audio = mean(speech_and_sound, speech_and_music).

Dataset Access#

  • The dataset is hosted on ModelScope: evalscope/AIR-Bench-Dataset. It uses an audiofolder + JSON metadata layout. evalscope downloads it lazily via modelscope.dataset_snapshot_download on first run; the full release is ~49 GB, so it is recommended to limit which tasks are pulled via extra_params.

  • If the dataset is already on disk, pass dataset_args={'air_bench_chat': {'local_path': '/path/to/AIR-Bench-Dataset'}}; the local root should contain Chat/.

Evaluation Protocol#

  • The judge LLM (default: GPT-4) receives the question, the textual audio description (meta_info from the dataset), the reference answer (answer_gt), and the model’s response. It outputs a single line with two integer scores in [1, 10].

  • To remove position bias, every sample is judged twice with the order of reference and prediction swapped, then averaged. This mirrors cal_score.py in the official repository — disable it via extra_params={'do_swap': False} to halve judge cost.

  • Reported metric gpt_score is the model’s mean judge score; win_rate records how often the model strictly beats the reference.

Warning

The official leaderboard uses gpt-4-0125-preview as the judge model. If that exact snapshot is unavailable, use an available GPT-4-class judge; absolute scores can drift versus the published numbers because the judge model changed.

Implementation Notes#

  • The judge model is selected via --judge-model-args; ensure the model id supports long contexts (meta_info may exceed 4k tokens for dialogue tasks).

  • Set extra_params={'tasks': [...]} to evaluate only specific Chat task names — useful for partial runs.

Properties#

Property

Value

Benchmark Name

air_bench_chat

Dataset ID

evalscope/AIR-Bench

Paper

Paper

Tags

Audio, InstructionFollowing, QA

Metrics

gpt_score, win_rate

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

2,200

Prompt Length (Mean)

83.89 chars

Prompt Length (Min/Max)

17 / 423 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

speech_QA

400

64.33

23

148

speech_dialogue_QA

400

77.03

29

206

sound_QA

400

73.29

17

166

sound_generation_QA

100

222.52

130

423

music_QA

400

57.54

24

202

music_generation_analysis_QA

100

267.52

148

395

speech_and_sound_QA

200

63.98

25

127

speech_and_music_QA

200

69.37

32

127

Audio Statistics:

Metric

Value

Total Audio Files

2,200

Audio per Sample

min: 1, max: 1, mean: 1

Formats

mp3, wav

Sample Example#

Subset: speech_QA

{
  "input": [
    {
      "id": "5781ee73",
      "content": [
        {
          "audio": "/root/.cache/modelscope/hub/datasets/evalscope/AIR-Bench-Dataset/Chat/speech_QA_iemocap/Ses01F_script01_1_M025.wav",
          "format": "wav"
        },
        {
          "text": "Who is the speaker addressing at the end of the speech?"
        }
      ]
    }
  ],
  "target": "The speaker is addressing Mom at the end of the speech.",
  "id": 0,
  "group_id": 0,
  "subset_key": "speech_QA",
  "metadata": {
    "uniq_id": 400,
    "task_name": "speech_QA",
    "dataset_name": "iemocap",
    "category": "speech",
    "meta_info": "{'emotion': 'neutral', 'gender': 'male', 'transcription': \"And then we'll thrash it out with father. Okay Mom? Don't avoid me.\"}",
    "question": "Who is the speaker addressing at the end of the speech?"
  }
}

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter

Type

Default

Description

tasks

list

None

Optional list of Chat task names to evaluate (subset of [‘music_QA’, ‘music_generation_analysis_QA’, ‘sound_QA’, ‘sound_generation_QA’, ‘speech_QA’, ‘speech_and_music_QA’, ‘speech_and_sound_QA’, ‘speech_dialogue_QA’]). Defaults to all tasks.

do_swap

bool

True

When True (default), each sample is judged twice with the order of reference vs. prediction swapped, then scores are averaged. Disable to halve judge cost at the price of position bias.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets air_bench_chat \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['air_bench_chat'],
    dataset_args={
        'air_bench_chat': {
            # subset_list: ['speech_QA', 'speech_dialogue_QA', 'sound_QA']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)