AIR-Bench-Foundation#

Overview#

AIR-Bench Foundation is the discriminative half of AIR-Bench (Audio InstRuction Benchmark, ACL 2024 main conference) — the first instruction-following benchmark for large audio-language models (LALMs), covering human speech, natural sounds and music. The Foundation track contains roughly 25k single-choice questions spanning 19 logical tasks across three audio categories.

Task Description#

  • Task Type: Single-choice question answering grounded on an audio clip.

  • Input: One audio clip + a question with up to four candidate answers (A/B/C/D).

  • Output: A single letter chosen from the provided options.

Categories (19 tasks / 25 source-dataset subsets)#

  • Speech (11 dirs / 9 tasks): speech grounding, language ID, gender, emotion (IEMOCAP+MELD), age, speech entity recognition, intent classification, speaker counting, synthesized-voice detection.

  • Sound (6 dirs / 4 tasks): audio grounding, vocal sound classification, acoustic scene classification (CochlScene+TUT2017), sound QA (avqa+clothoaqa).

  • Music (8 dirs / 6 tasks): instruments, genre, MIDI pitch, MIDI velocity, music QA, music emotion.

Prompt Template (matches official Inference_Foundation.py)#

Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.
{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}

Dataset Access#

  • The dataset is hosted on ModelScope: evalscope/AIR-Bench-Dataset. It uses an audiofolder + JSON metadata layout. evalscope downloads it lazily via modelscope.dataset_snapshot_download on first run; the full release is ~49 GB, so it is recommended to limit which subsets are pulled via extra_params.

  • If the dataset is already on disk, pass dataset_args={'air_bench_foundation': {'local_path': '/path/to/AIR-Bench-Dataset'}}; the local root should contain Foundation/.

  • Some Foundation samples are FLAC. For OpenAI-compatible audio input evalscope converts them to cached WAV files, which requires either soundfile (pip install "evalscope[air_bench]") or a working ffmpeg binary.

Evaluation Notes#

  • Metric: accuracy (per source-dataset subset, plus per-category aggregation).

  • Default prompt follows the official Inference_Foundation.py formatting so existing AIR-Bench leaderboard numbers can be reproduced.

  • Set extra_params={'subsets': [...]} to limit to a subset of the 25 source directories — useful for partial downloads.

Properties#

Property

Value

Benchmark Name

air_bench_foundation

Dataset ID

evalscope/AIR-Bench

Paper

Paper

Tags

Audio, Knowledge, MCQ

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

21,426

Prompt Length (Mean)

236.68 chars

Prompt Length (Min/Max)

179 / 321 chars

Per-Subset Statistics:

Subset

Samples

Prompt Mean

Prompt Min

Prompt Max

Speaker_Age_Prediction_common_voice_13.0_en

1,000

291.98

278

305

Speaker_Emotion_Recontion_iemocap

1,000

227.13

218

238

Speaker_Emotion_Recontion_meld

1,000

233.21

218

250

Speaker_Gender_Recognition_common_voice_13_en

780

226.19

213

241

Speaker_Gender_Recognition_meld

1,000

226.42

213

241

Speaker_Intent_Classification_slurp

662

268.52

232

295

Speaker_Number_Verification_voxceleb1

314

208.24

194

221

Speech_Entity_Reconition_slurp

1,000

253.14

226

316

Speech_Grounding_librispeech

981

253.92

230

282

Spoken_Language_Identification_covost2

495

207.12

191

232

Synthesized_Voice_Detection_fake_or_real

1,000

224.79

203

242

Acoustic_Scene_Classification_CochlScene

1,000

240.97

213

278

Acoustic_Scene_Classification_TUT2017

1,000

241.72

207

284

Audio_Grounding_AudioGrounding

896

273.74

249

321

Sound_AQA_avqa

1,000

227.21

193

298

Sound_AQA_clothoaqa

1,000

199.89

179

262

vocal_sound_classification_VocalSound

985

232.62

210

253

Music_AQA_music_avqa

814

208.7

185

238

Music_Genre_Recognition_MTJ-Jamendo

1,000

223.84

200

248

Music_Genre_Recognition_fma

1,000

224.59

201

250

Music_Instruments_Classfication_MTJ-Jamendo

1,000

236.52

218

262

Music_Instruments_Classfication_nsynth

996

228.12

216

247

Music_Midi_Pitch_Analysis_nsynth

493

253.88

243

264

Music_Midi_Velocity_Analysis_nsynth

484

270.6

259

279

Music_Mood_Recognition_MTJ-Jamendo

526

229.67

210

248

Audio Statistics:

Metric

Value

Total Audio Files

21,426

Audio per Sample

min: 1, max: 1, mean: 1

Formats

mp3, wav

Sample Example#

Subset: Speaker_Age_Prediction_common_voice_13.0_en

{
  "input": [
    {
      "id": "544443c0",
      "content": [
        {
          "audio": "[BASE64_AUDIO: mp3, ~25.9KB]",
          "format": "mp3"
        },
        {
          "text": "Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.\nWhich age range do you believe best matches the speaker's voice?\nA. teens to twenties\nB. thirties to fourties\nC. fifties to sixties\nD. seventies to eighties"
        }
      ]
    }
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "subset_key": "Speaker_Age_Prediction_common_voice_13.0_en",
  "metadata": {
    "uniq_id": 5973,
    "task_name": "Speaker_Age_Prediction",
    "dataset_name": "common_voice_13.0_en",
    "category": "speech",
    "answer_gt_text": "thirties to fourties",
    "choices": {
      "A": "teens to twenties",
      "B": "thirties to fourties",
      "C": "fifties to sixties",
      "D": "seventies to eighties"
    }
  }
}

Prompt Template#

Prompt Template:

Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.

Extra Parameters#

Parameter

Type

Default

Description

subsets

list

None

Optional list of Foundation source-dataset directories to evaluate. Defaults to all 25 directories. Useful when only a subset has been downloaded locally.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets air_bench_foundation \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['air_bench_foundation'],
    dataset_args={
        'air_bench_foundation': {
            # subset_list: ['Speaker_Age_Prediction_common_voice_13.0_en', 'Speaker_Emotion_Recontion_iemocap', 'Speaker_Emotion_Recontion_meld']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)