Supported Datasets#

1. Native Supported Datasets#

Tip

The framework currently supports the following datasets natively. If the dataset you need is not on the list, you may submit an issue, and we will support it as soon as possible. Alternatively, you can refer to the Benchmark Addition Guide to add datasets by yourself and submit a PR. Contributions are welcome.

You can also use other tools supported by this framework for evaluation, such as OpenCompass for language model evaluation, or VLMEvalKit for multimodal model evaluation.

LLM Evaluation Datasets#

Name

Dataset ID

Task Category

Remarks

aime24

HuggingFaceH4/aime_2024

Math Competition

aime25

opencompass/AIME2025

Math Competition

Part1,2

alpaca_eval3

AI-ModelScope/alpaca_eval

Instruction Following

Notelength-controlled winrate is not currently supported; Official Judge model is gpt-4-1106-preview, baseline model is gpt-4-turbo

arc

modelscope/ai2_arc

Exam

arena_hard3

AI-ModelScope/arena-hard-auto-v0.1

Comprehensive Reasoning

Notestyle-control is not currently supported; Official Judge model is gpt-4-1106-preview, baseline model is gpt-4-0314

bbh

modelscope/bbh

Comprehensive Reasoning

ceval

modelscope/ceval-exam

Chinese Comprehensive Exam

chinese_simpleqa3

AI-ModelScope/Chinese-SimpleQA

Chinese Knowledge Q&A

Use primary_category field as sub-dataset

cmmlu

modelscope/cmmlu

Chinese Comprehensive Exam

competition_math

modelscope/competition_math

Math Competition

Use level field as sub-dataset

docmath

yale-nlp/DocMath-Eval

Document-Level Mathematical Reasoning

Utilizes the testmini subset

drop

AI-ModelScope/DROP

Reading Comprehension, Reasoning

frames

iic/frames

Long Text Comprehension

Default is exactly match for answer matching; configuring Judge can enhance matching accuracy

gpqa

modelscope/gpqa

Expert-Level Examination

gsm8k

modelscope/gsm8k

Math Problems

hellaswag

modelscope/hellaswag

Common Sense Reasoning

humaneval2

modelscope/humaneval

Code Generation

ifeval4

modelscope/ifeval

Instruction Following

iquiz

modelscope/iquiz

IQ and EQ

live_code_bench2,4

AI-ModelScope/code_generation_lite

Code Generation

Parameter Description Sub-datasets support release_v1, release_v5, v1, v4_v5 version tags; dataset-args supports setting {'extra_params': {'start_date': '2024-12-01','end_date': '2025-01-01'}} to filter specific time range questions

math_500

AI-ModelScope/MATH-500

Math Competition

Use level field as sub-dataset

maritime_bench

HiDolphin/MaritimeBench

Maritime Knowledge

mmlu

modelscope/mmlu

Comprehensive Exam

mmlu_pro

modelscope/mmlu-pro

Comprehensive Exam

Use category field as sub-dataset

mmlu_redux

AI-ModelScope/mmlu-redux-2.0

Comprehensive Exam

musr

AI-ModelScope/MuSR

Multi-step Soft Reasoning

needle_haystack3

AI-ModelScope/Needle-in-a-Haystack-Corpus

Needle-in-a-Haystack Test

Generates corresponding heatmap images in outputs/reports for easy observation of model performance, refer to doc

process_bench

Qwen/ProcessBench

Mathematical Process Reasoning

race

modelscope/race

Reading Comprehension

simple_qa3

AI-ModelScope/SimpleQA

Knowledge Q&A

super_gpqa

m-a-p/SuperGPQA

Expert-Level Examination

Use field field as sub-dataset

tool_bench

AI-ModelScope/ToolBench-Statich

Tool Calling

Refer to usage doc

trivia_qa

modelscope/trivia_qa

Knowledge Q&A

truthful_qa1

modelscope/truthful_qa

Safety

winogrande

AI-ModelScope/winogrande_val

Reasoning

Note

1. Evaluation requires calculating logits, not currently supported for API service evaluation (eval-type != server).

2. Due to operations involving code execution, it is recommended to run in a sandbox environment (e.g., Docker) to prevent impact on the local environment.

3. This dataset requires specifying a Judge Model for evaluation. Refer to Judge Parameters.

4. For better evaluation results, it is recommended that reasoning models set post-processing corresponding to the dataset, such as {"filters": {"remove_until": "</think>"}}.

AIGC Evaluation Datasets#

This framework also supports evaluation datasets related to text-to-image and other AIGC tasks. The specific datasets are as follows:

Name

Dataset ID

Task Type

Remarks

general_t2i

General Text-to-Image

Refer to the tutorial

evalmuse

AI-ModelScope/T2V-Eval-Prompts

Text-Image Consistency

EvalMuse subset, default metric is FGA_BLIP2Score

genai_bench

AI-ModelScope/T2V-Eval-Prompts

Text-Image Consistency

GenAI-Bench-1600 subset, default metric is VQAScore

hpdv2

AI-ModelScope/T2V-Eval-Prompts

Text-Image Consistency

HPDv2 subset, default metric is HPSv2.1Score

tifa160

AI-ModelScope/T2V-Eval-Prompts

Text-Image Consistency

TIFA160 subset, default metric is PickScore

2. OpenCompass Backend#

Refer to the detailed explanation

Language Knowledge Reasoning Examination
Word Definition
  • WiC

  • SummEdits

Idiom Learning
  • CHID

Semantic Similarity
  • AFQMC

  • BUSTM

Coreference Resolution
  • CLUEWSC

  • WSC

  • WinoGrande

Translation
  • Flores

  • IWSLT2017

Multi-language Question Answering
  • TyDi-QA

  • XCOPA

Multi-language Summary
  • XLSum

Knowledge Question Answering
  • BoolQ

  • CommonSenseQA

  • NaturalQuestions

  • TriviaQA

Textual Entailment
  • CMNLI

  • OCNLI

  • OCNLI_FC

  • AX-b

  • AX-g

  • CB

  • RTE

  • ANLI

Commonsense Reasoning
  • StoryCloze

  • COPA

  • ReCoRD

  • HellaSwag

  • PIQA

  • SIQA

Mathematical Reasoning
  • MATH

  • GSM8K

Theorem Application
  • TheoremQA

  • StrategyQA

  • SciBench

Comprehensive Reasoning
  • BBH

Junior High, High School, University, Professional Examinations
  • C-Eval

  • AGIEval

  • MMLU

  • GAOKAO-Bench

  • CMMLU

  • ARC

  • Xiezhi

Medical Examinations
  • CMB

Understanding Long Context Safety Code
Reading Comprehension
  • C3

  • CMRC

  • DRCD

  • MultiRC

  • RACE

  • DROP

  • OpenBookQA

  • SQuAD2.0

Content Summary
  • CSL

  • LCSTS

  • XSum

  • SummScreen

Content Analysis
  • EPRSTMT

  • LAMBADA

  • TNEWS

Long Context Understanding
  • LEval

  • LongBench

  • GovReports

  • NarrativeQA

  • Qasper

Safety
  • CivilComments

  • CrowsPairs

  • CValues

  • JigsawMultilingual

  • TruthfulQA

Robustness
  • AdvGLUE

Code
  • HumanEval

  • HumanEvalX

  • MBPP

  • APPs

  • DS1000

3. VLMEvalKit Backend#

Note

For more comprehensive instructions and an up-to-date list of datasets, please refer to detailed instructions.

Image Understanding Dataset#

Abbreviations used:

  • MCQ: Multiple Choice Questions;

  • Y/N: Yes/No Questions;

  • MTT: Multiturn Dialogue Evaluation;

  • MTI: Multi-image Input Evaluation

Dataset

Dataset Names

Task

MMBench Series:
MMBench, MMBench-CN, CCBench

MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench

MCQ

MMStar

MMStar

MCQ

MME

MME

Y/N

SEEDBench Series

SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus

MCQ

MM-Vet

MMVet

VQA

MMMU

MMMU_[DEV_VAL/TEST]

MCQ

MathVista

MathVista_MINI

VQA

ScienceQA_IMG

ScienceQA_[VAL/TEST]

MCQ

COCO Caption

COCO_VAL

Caption

HallusionBench

HallusionBench

Y/N

OCRVQA*

OCRVQA_[TESTCORE/TEST]

VQA

TextVQA*

TextVQA_VAL

VQA

ChartQA*

ChartQA_TEST

VQA

AI2D

AI2D_[TEST/TEST_NO_MASK]

MCQ

LLaVABench

LLaVABench

VQA

DocVQA+

DocVQA_[VAL/TEST]

VQA

InfoVQA+

InfoVQA_[VAL/TEST]

VQA

OCRBench

OCRBench

VQA

RealWorldQA

RealWorldQA

MCQ

POPE

POPE

Y/N

Core-MM-

CORE_MM (MTI)

VQA

MMT-Bench

MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI

MCQ (MTI)

MLLMGuard -

MLLMGuard_DS

VQA

AesBench+

AesBench_[VAL/TEST]

MCQ

VCR-wiki +

VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]

VQA

MMLongBench-Doc+

MMLongBench_DOC

VQA (MTI)

BLINK

BLINK

MCQ (MTI)

MathVision+

MathVision
MathVision_MINI

VQA

MT-VQA+

MTVQA_TEST

VQA

MMDU+

MMDU

VQA (MTT, MTI)

Q-Bench1+

Q-Bench1_[VAL/TEST]

MCQ

A-Bench+

A-Bench_[VAL/TEST]

MCQ

DUDE+

DUDE

VQA (MTI)

SlideVQA+

SLIDEVQA
SLIDEVQA_MINI

VQA (MTI)

TaskMeAnything ImageQA Random+

TaskMeAnything_v1_imageqa_random

MCQ

MMMB and Multilingual MMBench+

MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs

MCQ

A-OKVQA+

A-OKVQA

MCQ

MuirBench

MUIRBench

MCQ

GMAI-MMBench+

GMAI-MMBench_VAL

MCQ

TableVQABench+

TableVQABench

VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset

Dataset Name

Task

MMBench-Video

MMBench-Video

VQA

MVBench

MVBench_MP4

MCQ

MLVU

MLVU

MCQ & VQA

TempCompass

TempCompass

MCQ & Y/N & Caption

LongVideoBench

LongVideoBench

MCQ

Video-MME

Video-MME

MCQ

4. RAGEval Backend#

CMTEB Evaluation Dataset#

Name

Hub Link

Description

Type

Category

Number of Test Samples

T2Retrieval

C-MTEB/T2Retrieval

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Retrieval

s2p

24,832

MMarcoRetrieval

C-MTEB/MMarcoRetrieval

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Retrieval

s2p

7,437

DuRetrieval

C-MTEB/DuRetrieval

A large-scale Chinese web search engine paragraph retrieval benchmark

Retrieval

s2p

4,000

CovidRetrieval

C-MTEB/CovidRetrieval

COVID-19 news articles

Retrieval

s2p

949

CmedqaRetrieval

C-MTEB/CmedqaRetrieval

Online medical consultation texts

Retrieval

s2p

3,999

EcomRetrieval

C-MTEB/EcomRetrieval

Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems

Retrieval

s2p

1,000

MedicalRetrieval

C-MTEB/MedicalRetrieval

Paragraph retrieval dataset collected from Alibaba medical search engine systems

Retrieval

s2p

1,000

VideoRetrieval

C-MTEB/VideoRetrieval

Paragraph retrieval dataset collected from Alibaba video search engine systems

Retrieval

s2p

1,000

T2Reranking

C-MTEB/T2Reranking

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Re-ranking

s2p

24,382

MMarcoReranking

C-MTEB/MMarco-reranking

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Re-ranking

s2p

7,437

CMedQAv1

C-MTEB/CMedQAv1-reranking

Chinese community medical Q&A

Re-ranking

s2p

2,000

CMedQAv2

C-MTEB/CMedQAv2-reranking

Chinese community medical Q&A

Re-ranking

s2p

4,000

Ocnli

C-MTEB/OCNLI

Original Chinese natural language inference dataset

Pair Classification

s2s

3,000

Cmnli

C-MTEB/CMNLI

Chinese multi-class natural language inference

Pair Classification

s2s

139,000

CLSClusteringS2S

C-MTEB/CLSClusteringS2S

Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

s2s

10,000

CLSClusteringP2P

C-MTEB/CLSClusteringP2P

Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

p2p

10,000

ThuNewsClusteringS2S

C-MTEB/ThuNewsClusteringS2S

Clustering titles from the THUCNews dataset

Clustering

s2s

10,000

ThuNewsClusteringP2P

C-MTEB/ThuNewsClusteringP2P

Clustering titles + abstracts from the THUCNews dataset

Clustering

p2p

10,000

ATEC

C-MTEB/ATEC

ATEC NLP Sentence Pair Similarity Competition

STS

s2s

20,000

BQ

C-MTEB/BQ

Banking Question Semantic Similarity

STS

s2s

10,000

LCQMC

C-MTEB/LCQMC

Large-scale Chinese Question Matching Corpus

STS

s2s

12,500

PAWSX

C-MTEB/PAWSX

Translated PAWS evaluation pairs

STS

s2s

2,000

STSB

C-MTEB/STSB

Translated STS-B into Chinese

STS

s2s

1,360

AFQMC

C-MTEB/AFQMC

Ant Financial Question Matching Corpus

STS

s2s

3,861

QBQTC

C-MTEB/QBQTC

QQ Browser Query Title Corpus

STS

s2s

5,000

TNews

C-MTEB/TNews-classification

News Short Text Classification

Classification

s2s

10,000

IFlyTek

C-MTEB/IFlyTek-classification

Long Text Classification of Application Descriptions

Classification

s2s

2,600

Waimai

C-MTEB/waimai-classification

Sentiment Analysis of User Reviews on Food Delivery Platforms

Classification

s2s

1,000

OnlineShopping

C-MTEB/OnlineShopping-classification

Sentiment Analysis of User Reviews on Online Shopping Websites

Classification

s2s

1,000

MultilingualSentiment

C-MTEB/MultilingualSentiment-classification

A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative

Classification

s2s

3,000

JDReview

C-MTEB/JDReview-classification

Reviews of iPhone

Classification

s2s

533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

MTEB Evaluation Dataset#

See also

See also: MTEB Related Tasks

CLIP-Benchmark#

Dataset Name

Task Type

Notes

muge

zeroshot_retrieval

Chinese Multimodal Dataset

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification