Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets. If the dataset you need is not in the list, please submit an issue, or use the OpenCompass backend for evaluation, or use the VLMEvalKit backend for multi-modal model evaluation.

以下是翻译成英文后的表格：

Name	Link	Notes
`mmlu`	mmlu
`ceval`	ceval
`gsm8k`	gsm8k
`arc`	arc
`hellaswag`	hellaswag
`truthful_qa`	truthful_qa
`competition_math`	competition_math
`humaneval`	humaneval	Requires humaneval installation. Since it involves some code execution operations, it is recommended to run in a sandbox environment (docker).
`bbh`	bbh
`race`	race
`trivia_qa`	trivia_qa

2. OpenCompass Backend#

Refer to the detailed explanation

Language	Knowledge	Reasoning	Examination
Word Definition WiC SummEdits Idiom Learning CHID Semantic Similarity AFQMC BUSTM Coreference Resolution CLUEWSC WSC WinoGrande Translation Flores IWSLT2017 Multi-language Question Answering TyDi-QA XCOPA Multi-language Summary XLSum	Knowledge Question Answering BoolQ CommonSenseQA NaturalQuestions TriviaQA	Textual Entailment CMNLI OCNLI OCNLI_FC AX-b AX-g CB RTE ANLI Commonsense Reasoning StoryCloze COPA ReCoRD HellaSwag PIQA SIQA Mathematical Reasoning MATH GSM8K Theorem Application TheoremQA StrategyQA SciBench Comprehensive Reasoning BBH	Junior High, High School, University, Professional Examinations C-Eval AGIEval MMLU GAOKAO-Bench CMMLU ARC Xiezhi Medical Examinations CMB
Understanding	Long Context	Safety	Code
Reading Comprehension C3 CMRC DRCD MultiRC RACE DROP OpenBookQA SQuAD2.0 Content Summary CSL LCSTS XSum SummScreen Content Analysis EPRSTMT LAMBADA TNEWS	Long Context Understanding LEval LongBench GovReports NarrativeQA Qasper	Safety CivilComments CrowsPairs CValues JigsawMultilingual TruthfulQA Robustness AdvGLUE	Code HumanEval HumanEvalX MBPP APPs DS1000

3. VLMEvalKit Backend#

Refer to the detailed explanation

Image Understanding Dataset#

Abbreviations used:

MCQ: Multiple Choice Questions;
Y/N: Yes/No Questions;
MTT: Multiturn Dialogue Evaluation;
MTI: Multi-image Input Evaluation

Dataset	Dataset Names	Task
MMBench Series: MMBench, MMBench-CN, CCBench	MMBench_DEV_[EN/CN] MMBench_TEST_[EN/CN] MMBench_DEV_[EN/CN]_V11 MMBench_TEST_[EN/CN]_V11 CCBench	MCQ
MMStar	MMStar	MCQ
MME	MME	Y/N
SEEDBench Series	SEEDBench_IMG SEEDBench2 SEEDBench2_Plus	MCQ
MM-Vet	MMVet	VQA
MMMU	MMMU_[DEV_VAL/TEST]	MCQ
MathVista	MathVista_MINI	VQA
ScienceQA_IMG	ScienceQA_[VAL/TEST]	MCQ
COCO Caption	COCO_VAL	Caption
HallusionBench	HallusionBench	Y/N
OCRVQA*	OCRVQA_[TESTCORE/TEST]	VQA
TextVQA*	TextVQA_VAL	VQA
ChartQA*	ChartQA_TEST	VQA
AI2D	AI2D_[TEST/TEST_NO_MASK]	MCQ
LLaVABench	LLaVABench	VQA
DocVQA+	DocVQA_[VAL/TEST]	VQA
InfoVQA+	InfoVQA_[VAL/TEST]	VQA
OCRBench	OCRBench	VQA
RealWorldQA	RealWorldQA	MCQ
POPE	POPE	Y/N
Core-MM-	CORE_MM (MTI)	VQA
MMT-Bench	MMT-Bench_[VAL/ALL] MMT-Bench_[VAL/ALL]_MI	MCQ (MTI)
MLLMGuard -	MLLMGuard_DS	VQA
AesBench+	AesBench_[VAL/TEST]	MCQ
VCR-wiki +	VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]	VQA
MMLongBench-Doc+	MMLongBench_DOC	VQA (MTI)
BLINK	BLINK	MCQ (MTI)
MathVision+	MathVision MathVision_MINI	VQA
MT-VQA+	MTVQA_TEST	VQA
MMDU+	MMDU	VQA (MTT, MTI)
Q-Bench1+	Q-Bench1_[VAL/TEST]	MCQ
A-Bench+	A-Bench_[VAL/TEST]	MCQ
DUDE+	DUDE	VQA (MTI)
SlideVQA+	SLIDEVQA SLIDEVQA_MINI	VQA (MTI)
TaskMeAnything ImageQA Random+	TaskMeAnything_v1_imageqa_random	MCQ
MMMB and Multilingual MMBench+	MMMB_[ar/cn/en/pt/ru/tr] MMBench_dev_[ar/cn/en/pt/ru/tr] MMMB MTL_MMBench_DEV PS: MMMB & MTL_MMBench_DEV are all-in-one names for 6 langs	MCQ
A-OKVQA+	A-OKVQA	MCQ
MuirBench	MUIRBench	MCQ
GMAI-MMBench+	GMAI-MMBench_VAL	MCQ
TableVQABench+	TableVQABench	VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset	Dataset Name	Task
MMBench-Video	MMBench-Video	VQA
MVBench	MVBench_MP4	MCQ
MLVU	MLVU	MCQ & VQA
TempCompass	TempCompass	MCQ & Y/N & Caption
LongVideoBench	LongVideoBench	MCQ
Video-MME	Video-MME	MCQ

4. RAGEval Backend#

CMTEB Evaluation Dataset#

Name	Hub Link	Description	Type	Category	Number of Test Samples
T2Retrieval	C-MTEB/T2Retrieval	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Retrieval	s2p	24,832
MMarcoRetrieval	C-MTEB/MMarcoRetrieval	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Retrieval	s2p	7,437
DuRetrieval	C-MTEB/DuRetrieval	A large-scale Chinese web search engine paragraph retrieval benchmark	Retrieval	s2p	4,000
CovidRetrieval	C-MTEB/CovidRetrieval	COVID-19 news articles	Retrieval	s2p	949
CmedqaRetrieval	C-MTEB/CmedqaRetrieval	Online medical consultation texts	Retrieval	s2p	3,999
EcomRetrieval	C-MTEB/EcomRetrieval	Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems	Retrieval	s2p	1,000
MedicalRetrieval	C-MTEB/MedicalRetrieval	Paragraph retrieval dataset collected from Alibaba medical search engine systems	Retrieval	s2p	1,000
VideoRetrieval	C-MTEB/VideoRetrieval	Paragraph retrieval dataset collected from Alibaba video search engine systems	Retrieval	s2p	1,000
T2Reranking	C-MTEB/T2Reranking	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Re-ranking	s2p	24,382
MMarcoReranking	C-MTEB/MMarco-reranking	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Re-ranking	s2p	7,437
CMedQAv1	C-MTEB/CMedQAv1-reranking	Chinese community medical Q&A	Re-ranking	s2p	2,000
CMedQAv2	C-MTEB/CMedQAv2-reranking	Chinese community medical Q&A	Re-ranking	s2p	4,000
Ocnli	C-MTEB/OCNLI	Original Chinese natural language inference dataset	Pair Classification	s2s	3,000
Cmnli	C-MTEB/CMNLI	Chinese multi-class natural language inference	Pair Classification	s2s	139,000
CLSClusteringS2S	C-MTEB/CLSClusteringS2S	Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	s2s	10,000
CLSClusteringP2P	C-MTEB/CLSClusteringP2P	Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	p2p	10,000
ThuNewsClusteringS2S	C-MTEB/ThuNewsClusteringS2S	Clustering titles from the THUCNews dataset	Clustering	s2s	10,000
ThuNewsClusteringP2P	C-MTEB/ThuNewsClusteringP2P	Clustering titles + abstracts from the THUCNews dataset	Clustering	p2p	10,000
ATEC	C-MTEB/ATEC	ATEC NLP Sentence Pair Similarity Competition	STS	s2s	20,000
BQ	C-MTEB/BQ	Banking Question Semantic Similarity	STS	s2s	10,000
LCQMC	C-MTEB/LCQMC	Large-scale Chinese Question Matching Corpus	STS	s2s	12,500
PAWSX	C-MTEB/PAWSX	Translated PAWS evaluation pairs	STS	s2s	2,000
STSB	C-MTEB/STSB	Translated STS-B into Chinese	STS	s2s	1,360
AFQMC	C-MTEB/AFQMC	Ant Financial Question Matching Corpus	STS	s2s	3,861
QBQTC	C-MTEB/QBQTC	QQ Browser Query Title Corpus	STS	s2s	5,000
TNews	C-MTEB/TNews-classification	News Short Text Classification	Classification	s2s	10,000
IFlyTek	C-MTEB/IFlyTek-classification	Long Text Classification of Application Descriptions	Classification	s2s	2,600
Waimai	C-MTEB/waimai-classification	Sentiment Analysis of User Reviews on Food Delivery Platforms	Classification	s2s	1,000
OnlineShopping	C-MTEB/OnlineShopping-classification	Sentiment Analysis of User Reviews on Online Shopping Websites	Classification	s2s	1,000
MultilingualSentiment	C-MTEB/MultilingualSentiment-classification	A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative	Classification	s2s	3,000
JDReview	C-MTEB/JDReview-classification	Reviews of iPhone	Classification	s2s	533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

MTEB Evaluation Dataset#

CLIP-Benchmark#

Dataset Name	Task Type	Notes
muge	zeroshot_retrieval	Chinese Multimodal Dataset
flickr30k	zeroshot_retrieval
flickr8k	zeroshot_retrieval
mscoco_captions	zeroshot_retrieval
mscoco_captions2017	zeroshot_retrieval
imagenet1k	zeroshot_classification
imagenetv2	zeroshot_classification
imagenet_sketch	zeroshot_classification
imagenet-a	zeroshot_classification
imagenet-r	zeroshot_classification
imagenet-o	zeroshot_classification
objectnet	zeroshot_classification
fer2013	zeroshot_classification
voc2007	zeroshot_classification
voc2007_multilabel	zeroshot_classification
sun397	zeroshot_classification
cars	zeroshot_classification
fgvc_aircraft	zeroshot_classification
mnist	zeroshot_classification
stl10	zeroshot_classification
gtsrb	zeroshot_classification
country211	zeroshot_classification
renderedsst2	zeroshot_classification
vtab_caltech101	zeroshot_classification
vtab_cifar10	zeroshot_classification
vtab_cifar100	zeroshot_classification
vtab_clevr_count_all	zeroshot_classification
vtab_clevr_closest_object_distance	zeroshot_classification
vtab_diabetic_retinopathy	zeroshot_classification
vtab_dmlab	zeroshot_classification
vtab_dsprites_label_orientation	zeroshot_classification
vtab_dsprites_label_x_position	zeroshot_classification
vtab_dsprites_label_y_position	zeroshot_classification
vtab_dtd	zeroshot_classification
vtab_eurosat	zeroshot_classification
vtab_kitti_closest_vehicle_distance	zeroshot_classification
vtab_flowers	zeroshot_classification
vtab_pets	zeroshot_classification
vtab_pcam	zeroshot_classification
vtab_resisc45	zeroshot_classification
vtab_smallnorb_label_azimuth	zeroshot_classification
vtab_smallnorb_label_elevation	zeroshot_classification
vtab_svhn	zeroshot_classification