Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets. If the dataset you need is not in the list, please submit an issue, or use the OpenCompass backend for evaluation, or use the VLMEvalKit backend for multi-modal model evaluation.

Dataset Name

Link

Status

Note

mmlu

mmlu

Active

ceval

ceval

Active

gsm8k

gsm8k

Active

arc

arc

Active

hellaswag

hellaswag

Active

truthful_qa

truthful_qa

Active

competition_math

competition_math

Active

humaneval

humaneval

Active

bbh

bbh

Active

race

race

Active

trivia_qa

trivia_qa

To be integrated

2. Datasets Supported by OpenCompass#

Refer to the detailed explanation

Language Knowledge Reasoning Examination
Word Definition
  • WiC

  • SummEdits

Idiom Learning
  • CHID

Semantic Similarity
  • AFQMC

  • BUSTM

Coreference Resolution
  • CLUEWSC

  • WSC

  • WinoGrande

Translation
  • Flores

  • IWSLT2017

Multi-language Question Answering
  • TyDi-QA

  • XCOPA

Multi-language Summary
  • XLSum

Knowledge Question Answering
  • BoolQ

  • CommonSenseQA

  • NaturalQuestions

  • TriviaQA

Textual Entailment
  • CMNLI

  • OCNLI

  • OCNLI_FC

  • AX-b

  • AX-g

  • CB

  • RTE

  • ANLI

Commonsense Reasoning
  • StoryCloze

  • COPA

  • ReCoRD

  • HellaSwag

  • PIQA

  • SIQA

Mathematical Reasoning
  • MATH

  • GSM8K

Theorem Application
  • TheoremQA

  • StrategyQA

  • SciBench

Comprehensive Reasoning
  • BBH

Junior High, High School, University, Professional Examinations
  • C-Eval

  • AGIEval

  • MMLU

  • GAOKAO-Bench

  • CMMLU

  • ARC

  • Xiezhi

Medical Examinations
  • CMB

Understanding Long Context Safety Code
Reading Comprehension
  • C3

  • CMRC

  • DRCD

  • MultiRC

  • RACE

  • DROP

  • OpenBookQA

  • SQuAD2.0

Content Summary
  • CSL

  • LCSTS

  • XSum

  • SummScreen

Content Analysis
  • EPRSTMT

  • LAMBADA

  • TNEWS

Long Context Understanding
  • LEval

  • LongBench

  • GovReports

  • NarrativeQA

  • Qasper

Safety
  • CivilComments

  • CrowsPairs

  • CValues

  • JigsawMultilingual

  • TruthfulQA

Robustness
  • AdvGLUE

Code
  • HumanEval

  • HumanEvalX

  • MBPP

  • APPs

  • DS1000

3. Datasets Supported by VLMEvalKit#

Refer to the detailed explanation

Image Understanding Dataset#

Abbreviations used:

  • MCQ: Multiple Choice Questions;

  • Y/N: Yes/No Questions;

  • MTT: Multiturn Dialogue Evaluation;

  • MTI: Multi-image Input Evaluation

Dataset

Dataset Names

Task

MMBench Series:
MMBench, MMBench-CN, CCBench

MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench

MCQ

MMStar

MMStar

MCQ

MME

MME

Y/N

SEEDBench Series

SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus

MCQ

MM-Vet

MMVet

VQA

MMMU

MMMU_[DEV_VAL/TEST]

MCQ

MathVista

MathVista_MINI

VQA

ScienceQA_IMG

ScienceQA_[VAL/TEST]

MCQ

COCO Caption

COCO_VAL

Caption

HallusionBench

HallusionBench

Y/N

OCRVQA*

OCRVQA_[TESTCORE/TEST]

VQA

TextVQA*

TextVQA_VAL

VQA

ChartQA*

ChartQA_TEST

VQA

AI2D

AI2D_[TEST/TEST_NO_MASK]

MCQ

LLaVABench

LLaVABench

VQA

DocVQA+

DocVQA_[VAL/TEST]

VQA

InfoVQA+

InfoVQA_[VAL/TEST]

VQA

OCRBench

OCRBench

VQA

RealWorldQA

RealWorldQA

MCQ

POPE

POPE

Y/N

Core-MM-

CORE_MM (MTI)

VQA

MMT-Bench

MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI

MCQ (MTI)

MLLMGuard -

MLLMGuard_DS

VQA

AesBench+

AesBench_[VAL/TEST]

MCQ

VCR-wiki +

VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]

VQA

MMLongBench-Doc+

MMLongBench_DOC

VQA (MTI)

BLINK

BLINK

MCQ (MTI)

MathVision+

MathVision
MathVision_MINI

VQA

MT-VQA+

MTVQA_TEST

VQA

MMDU+

MMDU

VQA (MTT, MTI)

Q-Bench1+

Q-Bench1_[VAL/TEST]

MCQ

A-Bench+

A-Bench_[VAL/TEST]

MCQ

DUDE+

DUDE

VQA (MTI)

SlideVQA+

SLIDEVQA
SLIDEVQA_MINI

VQA (MTI)

TaskMeAnything ImageQA Random+

TaskMeAnything_v1_imageqa_random

MCQ

MMMB and Multilingual MMBench+

MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs

MCQ

A-OKVQA+

A-OKVQA

MCQ

MuirBench

MUIRBench

MCQ

GMAI-MMBench+

GMAI-MMBench_VAL

MCQ

TableVQABench+

TableVQABench

VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset

Dataset Name

Task

MMBench-Video

MMBench-Video

VQA

MVBench

MVBench/MVBench_MP4

MCQ

Video-MME

Video-MME

MCQ