Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets. If the dataset you need is not in the list, please submit an issue, or use the OpenCompass backend for evaluation, or use the VLMEvalKit backend for multi-modal model evaluation.

Dataset Name

Link

Status

Note

mmlu

mmlu

Active

ceval

ceval

Active

gsm8k

gsm8k

Active

arc

arc

Active

hellaswag

hellaswag

Active

truthful_qa

truthful_qa

Active

competition_math

competition_math

Active

humaneval

humaneval

Active

bbh

bbh

Active

race

race

Active

trivia_qa

trivia_qa

To be integrated

2. OpenCompass Backend#

Refer to the detailed explanation

Language Knowledge Reasoning Examination
Word Definition
  • WiC

  • SummEdits

Idiom Learning
  • CHID

Semantic Similarity
  • AFQMC

  • BUSTM

Coreference Resolution
  • CLUEWSC

  • WSC

  • WinoGrande

Translation
  • Flores

  • IWSLT2017

Multi-language Question Answering
  • TyDi-QA

  • XCOPA

Multi-language Summary
  • XLSum

Knowledge Question Answering
  • BoolQ

  • CommonSenseQA

  • NaturalQuestions

  • TriviaQA

Textual Entailment
  • CMNLI

  • OCNLI

  • OCNLI_FC

  • AX-b

  • AX-g

  • CB

  • RTE

  • ANLI

Commonsense Reasoning
  • StoryCloze

  • COPA

  • ReCoRD

  • HellaSwag

  • PIQA

  • SIQA

Mathematical Reasoning
  • MATH

  • GSM8K

Theorem Application
  • TheoremQA

  • StrategyQA

  • SciBench

Comprehensive Reasoning
  • BBH

Junior High, High School, University, Professional Examinations
  • C-Eval

  • AGIEval

  • MMLU

  • GAOKAO-Bench

  • CMMLU

  • ARC

  • Xiezhi

Medical Examinations
  • CMB

Understanding Long Context Safety Code
Reading Comprehension
  • C3

  • CMRC

  • DRCD

  • MultiRC

  • RACE

  • DROP

  • OpenBookQA

  • SQuAD2.0

Content Summary
  • CSL

  • LCSTS

  • XSum

  • SummScreen

Content Analysis
  • EPRSTMT

  • LAMBADA

  • TNEWS

Long Context Understanding
  • LEval

  • LongBench

  • GovReports

  • NarrativeQA

  • Qasper

Safety
  • CivilComments

  • CrowsPairs

  • CValues

  • JigsawMultilingual

  • TruthfulQA

Robustness
  • AdvGLUE

Code
  • HumanEval

  • HumanEvalX

  • MBPP

  • APPs

  • DS1000

3. VLMEvalKit Backend#

Refer to the detailed explanation

Image Understanding Dataset#

Abbreviations used:

  • MCQ: Multiple Choice Questions;

  • Y/N: Yes/No Questions;

  • MTT: Multiturn Dialogue Evaluation;

  • MTI: Multi-image Input Evaluation

Dataset

Dataset Names

Task

MMBench Series:
MMBench, MMBench-CN, CCBench

MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench

MCQ

MMStar

MMStar

MCQ

MME

MME

Y/N

SEEDBench Series

SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus

MCQ

MM-Vet

MMVet

VQA

MMMU

MMMU_[DEV_VAL/TEST]

MCQ

MathVista

MathVista_MINI

VQA

ScienceQA_IMG

ScienceQA_[VAL/TEST]

MCQ

COCO Caption

COCO_VAL

Caption

HallusionBench

HallusionBench

Y/N

OCRVQA*

OCRVQA_[TESTCORE/TEST]

VQA

TextVQA*

TextVQA_VAL

VQA

ChartQA*

ChartQA_TEST

VQA

AI2D

AI2D_[TEST/TEST_NO_MASK]

MCQ

LLaVABench

LLaVABench

VQA

DocVQA+

DocVQA_[VAL/TEST]

VQA

InfoVQA+

InfoVQA_[VAL/TEST]

VQA

OCRBench

OCRBench

VQA

RealWorldQA

RealWorldQA

MCQ

POPE

POPE

Y/N

Core-MM-

CORE_MM (MTI)

VQA

MMT-Bench

MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI

MCQ (MTI)

MLLMGuard -

MLLMGuard_DS

VQA

AesBench+

AesBench_[VAL/TEST]

MCQ

VCR-wiki +

VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]

VQA

MMLongBench-Doc+

MMLongBench_DOC

VQA (MTI)

BLINK

BLINK

MCQ (MTI)

MathVision+

MathVision
MathVision_MINI

VQA

MT-VQA+

MTVQA_TEST

VQA

MMDU+

MMDU

VQA (MTT, MTI)

Q-Bench1+

Q-Bench1_[VAL/TEST]

MCQ

A-Bench+

A-Bench_[VAL/TEST]

MCQ

DUDE+

DUDE

VQA (MTI)

SlideVQA+

SLIDEVQA
SLIDEVQA_MINI

VQA (MTI)

TaskMeAnything ImageQA Random+

TaskMeAnything_v1_imageqa_random

MCQ

MMMB and Multilingual MMBench+

MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs

MCQ

A-OKVQA+

A-OKVQA

MCQ

MuirBench

MUIRBench

MCQ

GMAI-MMBench+

GMAI-MMBench_VAL

MCQ

TableVQABench+

TableVQABench

VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset

Dataset Name

Task

MMBench-Video

MMBench-Video

VQA

MVBench

MVBench_MP4

MCQ

MLVU

MLVU

MCQ & VQA

TempCompass

TempCompass

MCQ & Y/N & Caption

LongVideoBench

LongVideoBench

MCQ

Video-MME

Video-MME

MCQ

4. RAGEval Backend#

CMTEB Evaluation Dataset#

Name

Hub Link

Description

Type

Category

Number of Test Samples

T2Retrieval

C-MTEB/T2Retrieval

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Retrieval

s2p

24,832

MMarcoRetrieval

C-MTEB/MMarcoRetrieval

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Retrieval

s2p

7,437

DuRetrieval

C-MTEB/DuRetrieval

A large-scale Chinese web search engine paragraph retrieval benchmark

Retrieval

s2p

4,000

CovidRetrieval

C-MTEB/CovidRetrieval

COVID-19 news articles

Retrieval

s2p

949

CmedqaRetrieval

C-MTEB/CmedqaRetrieval

Online medical consultation texts

Retrieval

s2p

3,999

EcomRetrieval

C-MTEB/EcomRetrieval

Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems

Retrieval

s2p

1,000

MedicalRetrieval

C-MTEB/MedicalRetrieval

Paragraph retrieval dataset collected from Alibaba medical search engine systems

Retrieval

s2p

1,000

VideoRetrieval

C-MTEB/VideoRetrieval

Paragraph retrieval dataset collected from Alibaba video search engine systems

Retrieval

s2p

1,000

T2Reranking

C-MTEB/T2Reranking

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Re-ranking

s2p

24,382

MMarcoReranking

C-MTEB/MMarco-reranking

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Re-ranking

s2p

7,437

CMedQAv1

C-MTEB/CMedQAv1-reranking

Chinese community medical Q&A

Re-ranking

s2p

2,000

CMedQAv2

C-MTEB/CMedQAv2-reranking

Chinese community medical Q&A

Re-ranking

s2p

4,000

Ocnli

C-MTEB/OCNLI

Original Chinese natural language inference dataset

Pair Classification

s2s

3,000

Cmnli

C-MTEB/CMNLI

Chinese multi-class natural language inference

Pair Classification

s2s

139,000

CLSClusteringS2S

C-MTEB/CLSClusteringS2S

Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

s2s

10,000

CLSClusteringP2P

C-MTEB/CLSClusteringP2P

Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

p2p

10,000

ThuNewsClusteringS2S

C-MTEB/ThuNewsClusteringS2S

Clustering titles from the THUCNews dataset

Clustering

s2s

10,000

ThuNewsClusteringP2P

C-MTEB/ThuNewsClusteringP2P

Clustering titles + abstracts from the THUCNews dataset

Clustering

p2p

10,000

ATEC

C-MTEB/ATEC

ATEC NLP Sentence Pair Similarity Competition

STS

s2s

20,000

BQ

C-MTEB/BQ

Banking Question Semantic Similarity

STS

s2s

10,000

LCQMC

C-MTEB/LCQMC

Large-scale Chinese Question Matching Corpus

STS

s2s

12,500

PAWSX

C-MTEB/PAWSX

Translated PAWS evaluation pairs

STS

s2s

2,000

STSB

C-MTEB/STSB

Translated STS-B into Chinese

STS

s2s

1,360

AFQMC

C-MTEB/AFQMC

Ant Financial Question Matching Corpus

STS

s2s

3,861

QBQTC

C-MTEB/QBQTC

QQ Browser Query Title Corpus

STS

s2s

5,000

TNews

C-MTEB/TNews-classification

News Short Text Classification

Classification

s2s

10,000

IFlyTek

C-MTEB/IFlyTek-classification

Long Text Classification of Application Descriptions

Classification

s2s

2,600

Waimai

C-MTEB/waimai-classification

Sentiment Analysis of User Reviews on Food Delivery Platforms

Classification

s2s

1,000

OnlineShopping

C-MTEB/OnlineShopping-classification

Sentiment Analysis of User Reviews on Online Shopping Websites

Classification

s2s

1,000

MultilingualSentiment

C-MTEB/MultilingualSentiment-classification

A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative

Classification

s2s

3,000

JDReview

C-MTEB/JDReview-classification

Reviews of iPhone

Classification

s2s

533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

MTEB Evaluation Dataset#

See also

See also: MTEB Related Tasks

CLIP-Benchmark#

Dataset Name

Task Type

Notes

muge

zeroshot_retrieval

Chinese Multimodal Dataset

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification