Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets natively. If the dataset you need is not on the list, you may submit an issue, and we will support it as soon as possible. Alternatively, you can refer to the Benchmark Addition Guide to add datasets by yourself and submit a PR. Contributions are welcome.

You can also use other tools supported by this framework for evaluation, such as OpenCompass for language model evaluation, or VLMEvalKit for multimodal model evaluation.

Name

Dataset ID

Task Category

Remarks

arc

modelscope/ai2_arc

Exam

bbh

modelscope/bbh

General Reasoning

ceval

modelscope/ceval-exam

Chinese Comprehensive Exam

cmmlu

modelscope/cmmlu

Chinese Comprehensive Exam

competition_math

modelscope/competition_math

Math Competition

gsm8k

modelscope/gsm8k

Math Problems

hellaswag

modelscope/hellaswag

Commonsense Reasoning

humaneval+

modelscope/humaneval

Code Generation

ifeval

modelscope/ifeval

Instruction Following

mmlu

modelscope/mmlu

Comprehensive Exam

mmlu_pro

modelscope/mmlu-pro

Comprehensive Exam

race

modelscope/race

Reading Comprehension

trivia_qa

modelscope/trivia_qa

Knowledge Q&A

truthful_qa*

modelscope/truthful_qa

Safety

Note

* Evaluation requires calculating logits, etc., and it does not support API service evaluation (eval-type != server) at present.

+ Due to operations involving code execution, it is recommended to run in a sandbox environment (docker) to prevent impacts on the local environment.

2. OpenCompass Backend#

Refer to the detailed explanation

Language Knowledge Reasoning Examination
Word Definition
  • WiC

  • SummEdits

Idiom Learning
  • CHID

Semantic Similarity
  • AFQMC

  • BUSTM

Coreference Resolution
  • CLUEWSC

  • WSC

  • WinoGrande

Translation
  • Flores

  • IWSLT2017

Multi-language Question Answering
  • TyDi-QA

  • XCOPA

Multi-language Summary
  • XLSum

Knowledge Question Answering
  • BoolQ

  • CommonSenseQA

  • NaturalQuestions

  • TriviaQA

Textual Entailment
  • CMNLI

  • OCNLI

  • OCNLI_FC

  • AX-b

  • AX-g

  • CB

  • RTE

  • ANLI

Commonsense Reasoning
  • StoryCloze

  • COPA

  • ReCoRD

  • HellaSwag

  • PIQA

  • SIQA

Mathematical Reasoning
  • MATH

  • GSM8K

Theorem Application
  • TheoremQA

  • StrategyQA

  • SciBench

Comprehensive Reasoning
  • BBH

Junior High, High School, University, Professional Examinations
  • C-Eval

  • AGIEval

  • MMLU

  • GAOKAO-Bench

  • CMMLU

  • ARC

  • Xiezhi

Medical Examinations
  • CMB

Understanding Long Context Safety Code
Reading Comprehension
  • C3

  • CMRC

  • DRCD

  • MultiRC

  • RACE

  • DROP

  • OpenBookQA

  • SQuAD2.0

Content Summary
  • CSL

  • LCSTS

  • XSum

  • SummScreen

Content Analysis
  • EPRSTMT

  • LAMBADA

  • TNEWS

Long Context Understanding
  • LEval

  • LongBench

  • GovReports

  • NarrativeQA

  • Qasper

Safety
  • CivilComments

  • CrowsPairs

  • CValues

  • JigsawMultilingual

  • TruthfulQA

Robustness
  • AdvGLUE

Code
  • HumanEval

  • HumanEvalX

  • MBPP

  • APPs

  • DS1000

3. VLMEvalKit Backend#

Refer to the detailed explanation

Image Understanding Dataset#

Abbreviations used:

  • MCQ: Multiple Choice Questions;

  • Y/N: Yes/No Questions;

  • MTT: Multiturn Dialogue Evaluation;

  • MTI: Multi-image Input Evaluation

Dataset

Dataset Names

Task

MMBench Series:
MMBench, MMBench-CN, CCBench

MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench

MCQ

MMStar

MMStar

MCQ

MME

MME

Y/N

SEEDBench Series

SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus

MCQ

MM-Vet

MMVet

VQA

MMMU

MMMU_[DEV_VAL/TEST]

MCQ

MathVista

MathVista_MINI

VQA

ScienceQA_IMG

ScienceQA_[VAL/TEST]

MCQ

COCO Caption

COCO_VAL

Caption

HallusionBench

HallusionBench

Y/N

OCRVQA*

OCRVQA_[TESTCORE/TEST]

VQA

TextVQA*

TextVQA_VAL

VQA

ChartQA*

ChartQA_TEST

VQA

AI2D

AI2D_[TEST/TEST_NO_MASK]

MCQ

LLaVABench

LLaVABench

VQA

DocVQA+

DocVQA_[VAL/TEST]

VQA

InfoVQA+

InfoVQA_[VAL/TEST]

VQA

OCRBench

OCRBench

VQA

RealWorldQA

RealWorldQA

MCQ

POPE

POPE

Y/N

Core-MM-

CORE_MM (MTI)

VQA

MMT-Bench

MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI

MCQ (MTI)

MLLMGuard -

MLLMGuard_DS

VQA

AesBench+

AesBench_[VAL/TEST]

MCQ

VCR-wiki +

VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]

VQA

MMLongBench-Doc+

MMLongBench_DOC

VQA (MTI)

BLINK

BLINK

MCQ (MTI)

MathVision+

MathVision
MathVision_MINI

VQA

MT-VQA+

MTVQA_TEST

VQA

MMDU+

MMDU

VQA (MTT, MTI)

Q-Bench1+

Q-Bench1_[VAL/TEST]

MCQ

A-Bench+

A-Bench_[VAL/TEST]

MCQ

DUDE+

DUDE

VQA (MTI)

SlideVQA+

SLIDEVQA
SLIDEVQA_MINI

VQA (MTI)

TaskMeAnything ImageQA Random+

TaskMeAnything_v1_imageqa_random

MCQ

MMMB and Multilingual MMBench+

MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs

MCQ

A-OKVQA+

A-OKVQA

MCQ

MuirBench

MUIRBench

MCQ

GMAI-MMBench+

GMAI-MMBench_VAL

MCQ

TableVQABench+

TableVQABench

VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset

Dataset Name

Task

MMBench-Video

MMBench-Video

VQA

MVBench

MVBench_MP4

MCQ

MLVU

MLVU

MCQ & VQA

TempCompass

TempCompass

MCQ & Y/N & Caption

LongVideoBench

LongVideoBench

MCQ

Video-MME

Video-MME

MCQ

4. RAGEval Backend#

CMTEB Evaluation Dataset#

Name

Hub Link

Description

Type

Category

Number of Test Samples

T2Retrieval

C-MTEB/T2Retrieval

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Retrieval

s2p

24,832

MMarcoRetrieval

C-MTEB/MMarcoRetrieval

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Retrieval

s2p

7,437

DuRetrieval

C-MTEB/DuRetrieval

A large-scale Chinese web search engine paragraph retrieval benchmark

Retrieval

s2p

4,000

CovidRetrieval

C-MTEB/CovidRetrieval

COVID-19 news articles

Retrieval

s2p

949

CmedqaRetrieval

C-MTEB/CmedqaRetrieval

Online medical consultation texts

Retrieval

s2p

3,999

EcomRetrieval

C-MTEB/EcomRetrieval

Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems

Retrieval

s2p

1,000

MedicalRetrieval

C-MTEB/MedicalRetrieval

Paragraph retrieval dataset collected from Alibaba medical search engine systems

Retrieval

s2p

1,000

VideoRetrieval

C-MTEB/VideoRetrieval

Paragraph retrieval dataset collected from Alibaba video search engine systems

Retrieval

s2p

1,000

T2Reranking

C-MTEB/T2Reranking

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Re-ranking

s2p

24,382

MMarcoReranking

C-MTEB/MMarco-reranking

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Re-ranking

s2p

7,437

CMedQAv1

C-MTEB/CMedQAv1-reranking

Chinese community medical Q&A

Re-ranking

s2p

2,000

CMedQAv2

C-MTEB/CMedQAv2-reranking

Chinese community medical Q&A

Re-ranking

s2p

4,000

Ocnli

C-MTEB/OCNLI

Original Chinese natural language inference dataset

Pair Classification

s2s

3,000

Cmnli

C-MTEB/CMNLI

Chinese multi-class natural language inference

Pair Classification

s2s

139,000

CLSClusteringS2S

C-MTEB/CLSClusteringS2S

Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

s2s

10,000

CLSClusteringP2P

C-MTEB/CLSClusteringP2P

Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

p2p

10,000

ThuNewsClusteringS2S

C-MTEB/ThuNewsClusteringS2S

Clustering titles from the THUCNews dataset

Clustering

s2s

10,000

ThuNewsClusteringP2P

C-MTEB/ThuNewsClusteringP2P

Clustering titles + abstracts from the THUCNews dataset

Clustering

p2p

10,000

ATEC

C-MTEB/ATEC

ATEC NLP Sentence Pair Similarity Competition

STS

s2s

20,000

BQ

C-MTEB/BQ

Banking Question Semantic Similarity

STS

s2s

10,000

LCQMC

C-MTEB/LCQMC

Large-scale Chinese Question Matching Corpus

STS

s2s

12,500

PAWSX

C-MTEB/PAWSX

Translated PAWS evaluation pairs

STS

s2s

2,000

STSB

C-MTEB/STSB

Translated STS-B into Chinese

STS

s2s

1,360

AFQMC

C-MTEB/AFQMC

Ant Financial Question Matching Corpus

STS

s2s

3,861

QBQTC

C-MTEB/QBQTC

QQ Browser Query Title Corpus

STS

s2s

5,000

TNews

C-MTEB/TNews-classification

News Short Text Classification

Classification

s2s

10,000

IFlyTek

C-MTEB/IFlyTek-classification

Long Text Classification of Application Descriptions

Classification

s2s

2,600

Waimai

C-MTEB/waimai-classification

Sentiment Analysis of User Reviews on Food Delivery Platforms

Classification

s2s

1,000

OnlineShopping

C-MTEB/OnlineShopping-classification

Sentiment Analysis of User Reviews on Online Shopping Websites

Classification

s2s

1,000

MultilingualSentiment

C-MTEB/MultilingualSentiment-classification

A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative

Classification

s2s

3,000

JDReview

C-MTEB/JDReview-classification

Reviews of iPhone

Classification

s2s

533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

MTEB Evaluation Dataset#

See also

See also: MTEB Related Tasks

CLIP-Benchmark#

Dataset Name

Task Type

Notes

muge

zeroshot_retrieval

Chinese Multimodal Dataset

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification