Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets. If the dataset you need is not in the list, please submit an issue, or use the OpenCompass backend for evaluation, or use the VLMEvalKit backend for multi-modal model evaluation.

Dataset Name	Link	Status
`mmlu`	mmlu	Active
`ceval`	ceval	Active
`gsm8k`	gsm8k	Active
`arc`	arc	Active
`hellaswag`	hellaswag	Active
`truthful_qa`	truthful_qa	Active
`competition_math`	competition_math	Active
`humaneval`	humaneval	Active
`bbh`	bbh	Active
`race`	race	Active
`trivia_qa`	trivia_qa	To be integrated

2. Datasets Supported by OpenCompass#

Refer to the detailed explanation

Language	Knowledge	Reasoning	Examination
Word Definition WiC SummEdits Idiom Learning CHID Semantic Similarity AFQMC BUSTM Coreference Resolution CLUEWSC WSC WinoGrande Translation Flores IWSLT2017 Multi-language Question Answering TyDi-QA XCOPA Multi-language Summary XLSum	Knowledge Question Answering BoolQ CommonSenseQA NaturalQuestions TriviaQA	Textual Entailment CMNLI OCNLI OCNLI_FC AX-b AX-g CB RTE ANLI Commonsense Reasoning StoryCloze COPA ReCoRD HellaSwag PIQA SIQA Mathematical Reasoning MATH GSM8K Theorem Application TheoremQA StrategyQA SciBench Comprehensive Reasoning BBH	Junior High, High School, University, Professional Examinations C-Eval AGIEval MMLU GAOKAO-Bench CMMLU ARC Xiezhi Medical Examinations CMB
Understanding	Long Context	Safety	Code
Reading Comprehension C3 CMRC DRCD MultiRC RACE DROP OpenBookQA SQuAD2.0 Content Summary CSL LCSTS XSum SummScreen Content Analysis EPRSTMT LAMBADA TNEWS	Long Context Understanding LEval LongBench GovReports NarrativeQA Qasper	Safety CivilComments CrowsPairs CValues JigsawMultilingual TruthfulQA Robustness AdvGLUE	Code HumanEval HumanEvalX MBPP APPs DS1000

3. Datasets Supported by VLMEvalKit#

Refer to the detailed explanation

Image Understanding Dataset#

Abbreviations used:

MCQ: Multiple Choice Questions;
Y/N: Yes/No Questions;
MTT: Multiturn Dialogue Evaluation;
MTI: Multi-image Input Evaluation

Dataset	Dataset Names	Task
MMBench Series: MMBench, MMBench-CN, CCBench	MMBench_DEV_[EN/CN] MMBench_TEST_[EN/CN] MMBench_DEV_[EN/CN]_V11 MMBench_TEST_[EN/CN]_V11 CCBench	MCQ
MMStar	MMStar	MCQ
MME	MME	Y/N
SEEDBench Series	SEEDBench_IMG SEEDBench2 SEEDBench2_Plus	MCQ
MM-Vet	MMVet	VQA
MMMU	MMMU_[DEV_VAL/TEST]	MCQ
MathVista	MathVista_MINI	VQA
ScienceQA_IMG	ScienceQA_[VAL/TEST]	MCQ
COCO Caption	COCO_VAL	Caption
HallusionBench	HallusionBench	Y/N
OCRVQA*	OCRVQA_[TESTCORE/TEST]	VQA
TextVQA*	TextVQA_VAL	VQA
ChartQA*	ChartQA_TEST	VQA
AI2D	AI2D_[TEST/TEST_NO_MASK]	MCQ
LLaVABench	LLaVABench	VQA
DocVQA+	DocVQA_[VAL/TEST]	VQA
InfoVQA+	InfoVQA_[VAL/TEST]	VQA
OCRBench	OCRBench	VQA
RealWorldQA	RealWorldQA	MCQ
POPE	POPE	Y/N
Core-MM-	CORE_MM (MTI)	VQA
MMT-Bench	MMT-Bench_[VAL/ALL] MMT-Bench_[VAL/ALL]_MI	MCQ (MTI)
MLLMGuard -	MLLMGuard_DS	VQA
AesBench+	AesBench_[VAL/TEST]	MCQ
VCR-wiki +	VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]	VQA
MMLongBench-Doc+	MMLongBench_DOC	VQA (MTI)
BLINK	BLINK	MCQ (MTI)
MathVision+	MathVision MathVision_MINI	VQA
MT-VQA+	MTVQA_TEST	VQA
MMDU+	MMDU	VQA (MTT, MTI)
Q-Bench1+	Q-Bench1_[VAL/TEST]	MCQ
A-Bench+	A-Bench_[VAL/TEST]	MCQ
DUDE+	DUDE	VQA (MTI)
SlideVQA+	SLIDEVQA SLIDEVQA_MINI	VQA (MTI)
TaskMeAnything ImageQA Random+	TaskMeAnything_v1_imageqa_random	MCQ
MMMB and Multilingual MMBench+	MMMB_[ar/cn/en/pt/ru/tr] MMBench_dev_[ar/cn/en/pt/ru/tr] MMMB MTL_MMBench_DEV PS: MMMB & MTL_MMBench_DEV are all-in-one names for 6 langs	MCQ
A-OKVQA+	A-OKVQA	MCQ
MuirBench	MUIRBench	MCQ
GMAI-MMBench+	GMAI-MMBench_VAL	MCQ
TableVQABench+	TableVQABench	VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset	Dataset Name	Task
MMBench-Video	MMBench-Video	VQA
MVBench	MVBench/MVBench_MP4	MCQ
Video-MME	Video-MME	MCQ