AIR-Bench-Foundation#
Overview#
AIR-Bench Foundation is the discriminative half of AIR-Bench (Audio InstRuction Benchmark, ACL 2024 main conference) — the first instruction-following benchmark for large audio-language models (LALMs), covering human speech, natural sounds and music. The Foundation track contains roughly 25k single-choice questions spanning 19 logical tasks across three audio categories.
Task Description#
Task Type: Single-choice question answering grounded on an audio clip.
Input: One audio clip + a question with up to four candidate answers (A/B/C/D).
Output: A single letter chosen from the provided options.
Categories (19 tasks / 25 source-dataset subsets)#
Speech (11 dirs / 9 tasks): speech grounding, language ID, gender, emotion (IEMOCAP+MELD), age, speech entity recognition, intent classification, speaker counting, synthesized-voice detection.
Sound (6 dirs / 4 tasks): audio grounding, vocal sound classification, acoustic scene classification (CochlScene+TUT2017), sound QA (avqa+clothoaqa).
Music (8 dirs / 6 tasks): instruments, genre, MIDI pitch, MIDI velocity, music QA, music emotion.
Prompt Template (matches official Inference_Foundation.py)#
Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.
{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}
Dataset Access#
The dataset is hosted on ModelScope:
evalscope/AIR-Bench-Dataset. It uses an audiofolder + JSON metadata layout. evalscope downloads it lazily viamodelscope.dataset_snapshot_downloadon first run; the full release is ~49 GB, so it is recommended to limit which subsets are pulled viaextra_params.If the dataset is already on disk, pass
dataset_args={'air_bench_foundation': {'local_path': '/path/to/AIR-Bench-Dataset'}}; the local root should containFoundation/.Some Foundation samples are FLAC. For OpenAI-compatible audio input evalscope converts them to cached WAV files, which requires either
soundfile(pip install "evalscope[air_bench]") or a workingffmpegbinary.
Evaluation Notes#
Metric: accuracy (per source-dataset subset, plus per-category aggregation).
Default prompt follows the official
Inference_Foundation.pyformatting so existing AIR-Bench leaderboard numbers can be reproduced.Set
extra_params={'subsets': [...]}to limit to a subset of the 25 source directories — useful for partial downloads.
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
|
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
21,426 |
Prompt Length (Mean) |
236.68 chars |
Prompt Length (Min/Max) |
179 / 321 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
1,000 |
291.98 |
278 |
305 |
|
1,000 |
227.13 |
218 |
238 |
|
1,000 |
233.21 |
218 |
250 |
|
780 |
226.19 |
213 |
241 |
|
1,000 |
226.42 |
213 |
241 |
|
662 |
268.52 |
232 |
295 |
|
314 |
208.24 |
194 |
221 |
|
1,000 |
253.14 |
226 |
316 |
|
981 |
253.92 |
230 |
282 |
|
495 |
207.12 |
191 |
232 |
|
1,000 |
224.79 |
203 |
242 |
|
1,000 |
240.97 |
213 |
278 |
|
1,000 |
241.72 |
207 |
284 |
|
896 |
273.74 |
249 |
321 |
|
1,000 |
227.21 |
193 |
298 |
|
1,000 |
199.89 |
179 |
262 |
|
985 |
232.62 |
210 |
253 |
|
814 |
208.7 |
185 |
238 |
|
1,000 |
223.84 |
200 |
248 |
|
1,000 |
224.59 |
201 |
250 |
|
1,000 |
236.52 |
218 |
262 |
|
996 |
228.12 |
216 |
247 |
|
493 |
253.88 |
243 |
264 |
|
484 |
270.6 |
259 |
279 |
|
526 |
229.67 |
210 |
248 |
Audio Statistics:
Metric |
Value |
|---|---|
Total Audio Files |
21,426 |
Audio per Sample |
min: 1, max: 1, mean: 1 |
Formats |
mp3, wav |
Sample Example#
Subset: Speaker_Age_Prediction_common_voice_13.0_en
{
"input": [
{
"id": "544443c0",
"content": [
{
"audio": "[BASE64_AUDIO: mp3, ~25.9KB]",
"format": "mp3"
},
{
"text": "Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.\nWhich age range do you believe best matches the speaker's voice?\nA. teens to twenties\nB. thirties to fourties\nC. fifties to sixties\nD. seventies to eighties"
}
]
}
],
"target": "B",
"id": 0,
"group_id": 0,
"subset_key": "Speaker_Age_Prediction_common_voice_13.0_en",
"metadata": {
"uniq_id": 5973,
"task_name": "Speaker_Age_Prediction",
"dataset_name": "common_voice_13.0_en",
"category": "speech",
"answer_gt_text": "thirties to fourties",
"choices": {
"A": "teens to twenties",
"B": "thirties to fourties",
"C": "fifties to sixties",
"D": "seventies to eighties"
}
}
}
Prompt Template#
Prompt Template:
Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Optional list of Foundation source-dataset directories to evaluate. Defaults to all 25 directories. Useful when only a subset has been downloaded locally. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets air_bench_foundation \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['air_bench_foundation'],
dataset_args={
'air_bench_foundation': {
# subset_list: ['Speaker_Age_Prediction_common_voice_13.0_en', 'Speaker_Emotion_Recontion_iemocap', 'Speaker_Emotion_Recontion_meld'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)