AIR-Bench-Chat#
Overview#
AIR-Bench Chat is the generative half of AIR-Bench (Audio InstRuction Benchmark, ACL 2024 main conference) — the first instruction-following benchmark for large audio-language models (LALMs), covering human speech, natural sounds and music. It contains roughly 2k open-ended audio QA pairs covering speech, sound, music and mixed-audio scenes; responses are graded by a GPT-4 judge against a reference answer.
Task Description#
Task Type: Open-ended audio question answering.
Input: An audio clip plus a free-form question.
Output: A textual answer evaluated against the reference response.
Categories (8 tasks → 5 reported categories)#
The 8 Chat tasks are aggregated by the official cal_score.py into five categories:
speech:speech_QA,speech_dialogue_QAsound:sound_QA,sound_generation_QAmusic:music_QA,music_generation_analysis_QAspeech_and_sound:speech_and_sound_QAspeech_and_music:speech_and_music_QA
The paper’s Mixed-audio = mean(speech_and_sound, speech_and_music).
Dataset Access#
The dataset is hosted on ModelScope:
evalscope/AIR-Bench-Dataset. It uses an audiofolder + JSON metadata layout. evalscope downloads it lazily viamodelscope.dataset_snapshot_downloadon first run; the full release is ~49 GB, so it is recommended to limit which tasks are pulled viaextra_params.If the dataset is already on disk, pass
dataset_args={'air_bench_chat': {'local_path': '/path/to/AIR-Bench-Dataset'}}; the local root should containChat/.
Evaluation Protocol#
The judge LLM (default: GPT-4) receives the question, the textual audio description (
meta_infofrom the dataset), the reference answer (answer_gt), and the model’s response. It outputs a single line with two integer scores in[1, 10].To remove position bias, every sample is judged twice with the order of reference and prediction swapped, then averaged. This mirrors
cal_score.pyin the official repository — disable it viaextra_params={'do_swap': False}to halve judge cost.Reported metric
gpt_scoreis the model’s mean judge score;win_raterecords how often the model strictly beats the reference.
Warning
The official leaderboard uses gpt-4-0125-preview as the judge model. If that exact snapshot is unavailable, use an available GPT-4-class judge; absolute scores can drift versus the published numbers because the judge model changed.
Implementation Notes#
The judge model is selected via
--judge-model-args; ensure the model id supports long contexts (meta_infomay exceed 4k tokens for dialogue tasks).Set
extra_params={'tasks': [...]}to evaluate only specific Chat task names — useful for partial runs.
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
|
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
2,200 |
Prompt Length (Mean) |
83.89 chars |
Prompt Length (Min/Max) |
17 / 423 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
400 |
64.33 |
23 |
148 |
|
400 |
77.03 |
29 |
206 |
|
400 |
73.29 |
17 |
166 |
|
100 |
222.52 |
130 |
423 |
|
400 |
57.54 |
24 |
202 |
|
100 |
267.52 |
148 |
395 |
|
200 |
63.98 |
25 |
127 |
|
200 |
69.37 |
32 |
127 |
Audio Statistics:
Metric |
Value |
|---|---|
Total Audio Files |
2,200 |
Audio per Sample |
min: 1, max: 1, mean: 1 |
Formats |
mp3, wav |
Sample Example#
Subset: speech_QA
{
"input": [
{
"id": "5781ee73",
"content": [
{
"audio": "/root/.cache/modelscope/hub/datasets/evalscope/AIR-Bench-Dataset/Chat/speech_QA_iemocap/Ses01F_script01_1_M025.wav",
"format": "wav"
},
{
"text": "Who is the speaker addressing at the end of the speech?"
}
]
}
],
"target": "The speaker is addressing Mom at the end of the speech.",
"id": 0,
"group_id": 0,
"subset_key": "speech_QA",
"metadata": {
"uniq_id": 400,
"task_name": "speech_QA",
"dataset_name": "iemocap",
"category": "speech",
"meta_info": "{'emotion': 'neutral', 'gender': 'male', 'transcription': \"And then we'll thrash it out with father. Okay Mom? Don't avoid me.\"}",
"question": "Who is the speaker addressing at the end of the speech?"
}
}
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Optional list of Chat task names to evaluate (subset of [‘music_QA’, ‘music_generation_analysis_QA’, ‘sound_QA’, ‘sound_generation_QA’, ‘speech_QA’, ‘speech_and_music_QA’, ‘speech_and_sound_QA’, ‘speech_dialogue_QA’]). Defaults to all tasks. |
|
|
|
When True (default), each sample is judged twice with the order of reference vs. prediction swapped, then scores are averaged. Disable to halve judge cost at the price of position bias. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets air_bench_chat \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['air_bench_chat'],
dataset_args={
'air_bench_chat': {
# subset_list: ['speech_QA', 'speech_dialogue_QA', 'sound_QA'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)