Supported Datasets#
1. Native Supported Datasets#
Tip
The framework currently supports the following datasets natively. If the dataset you need is not on the list, you may submit an issue, and we will support it as soon as possible. Alternatively, you can refer to the Benchmark Addition Guide to add datasets by yourself and submit a PR. Contributions are welcome.
You can also use other tools supported by this framework for evaluation, such as OpenCompass for language model evaluation, or VLMEvalKit for multimodal model evaluation.
LLM Evaluation Datasets#
Name |
Dataset ID |
Task Category |
Remarks |
---|---|---|---|
|
Math Competition |
||
|
Math Competition |
Part1,2 |
|
|
Instruction Following |
|
|
|
Exam |
||
|
Comprehensive Reasoning |
|
|
|
Comprehensive Reasoning |
||
|
Chinese Comprehensive Exam |
||
|
Chinese Knowledge Q&A |
Use |
|
|
Chinese Comprehensive Exam |
||
|
Math Competition |
Use |
|
|
Document-Level Mathematical Reasoning |
Utilizes the |
|
|
Reading Comprehension, Reasoning |
||
|
Long Text Comprehension |
Default is |
|
|
Expert-Level Examination |
||
|
Math Problems |
||
|
Common Sense Reasoning |
||
|
Code Generation |
||
|
Instruction Following |
||
|
IQ and EQ |
||
|
Code Generation |
|
|
|
Math Competition |
Use |
|
|
Maritime Knowledge |
||
|
Comprehensive Exam |
||
|
Comprehensive Exam |
Use |
|
|
Comprehensive Exam |
||
|
Multi-step Soft Reasoning |
||
|
Needle-in-a-Haystack Test |
Generates corresponding heatmap images in outputs/reports for easy observation of model performance, refer to doc |
|
|
Mathematical Process Reasoning |
||
|
Reading Comprehension |
||
|
Knowledge Q&A |
||
|
Expert-Level Examination |
Use |
|
|
Tool Calling |
Refer to usage doc |
|
|
Knowledge Q&A |
||
|
Safety |
||
|
Reasoning |
Note
1. Evaluation requires calculating logits, not currently supported for API service evaluation (eval-type != server
).
2. Due to operations involving code execution, it is recommended to run in a sandbox environment (e.g., Docker) to prevent impact on the local environment.
3. This dataset requires specifying a Judge Model for evaluation. Refer to Judge Parameters.
4. For better evaluation results, it is recommended that reasoning models set post-processing corresponding to the dataset, such as {"filters": {"remove_until": "</think>"}}
.
AIGC Evaluation Datasets#
This framework also supports evaluation datasets related to text-to-image and other AIGC tasks. The specific datasets are as follows:
Name |
Dataset ID |
Task Type |
Remarks |
---|---|---|---|
|
General Text-to-Image |
Refer to the tutorial |
|
|
Text-Image Consistency |
EvalMuse subset, default metric is |
|
|
Text-Image Consistency |
GenAI-Bench-1600 subset, default metric is |
|
|
Text-Image Consistency |
HPDv2 subset, default metric is |
|
|
Text-Image Consistency |
TIFA160 subset, default metric is |
2. OpenCompass Backend#
Refer to the detailed explanation
Language | Knowledge | Reasoning | Examination |
Word Definition
Idiom Learning
Semantic Similarity
Coreference Resolution
Translation
Multi-language Question Answering
Multi-language Summary
|
Knowledge Question Answering
|
Textual Entailment
Commonsense Reasoning
Mathematical Reasoning
Theorem Application
Comprehensive Reasoning
|
Junior High, High School, University, Professional Examinations
Medical Examinations
|
Understanding | Long Context | Safety | Code |
Reading Comprehension
Content Summary
Content Analysis
|
Long Context Understanding
|
Safety
Robustness
|
Code
|
3. VLMEvalKit Backend#
Note
For more comprehensive instructions and an up-to-date list of datasets, please refer to detailed instructions.
Image Understanding Dataset#
Abbreviations used:
MCQ
: Multiple Choice Questions;Y/N
: Yes/No Questions;MTT
: Multiturn Dialogue Evaluation;MTI
: Multi-image Input Evaluation
Dataset |
Dataset Names |
Task |
---|---|---|
MMBench Series: |
MMBench_DEV_[EN/CN] |
MCQ |
MMStar |
MCQ |
|
MME |
Y/N |
|
SEEDBench_IMG |
MCQ |
|
MMVet |
VQA |
|
MMMU_[DEV_VAL/TEST] |
MCQ |
|
MathVista_MINI |
VQA |
|
ScienceQA_[VAL/TEST] |
MCQ |
|
COCO_VAL |
Caption |
|
HallusionBench |
Y/N |
|
OCRVQA_[TESTCORE/TEST] |
VQA |
|
TextVQA_VAL |
VQA |
|
ChartQA_TEST |
VQA |
|
AI2D_[TEST/TEST_NO_MASK] |
MCQ |
|
LLaVABench |
VQA |
|
DocVQA_[VAL/TEST] |
VQA |
|
InfoVQA_[VAL/TEST] |
VQA |
|
OCRBench |
VQA |
|
RealWorldQA |
MCQ |
|
POPE |
Y/N |
|
CORE_MM (MTI) |
VQA |
|
MMT-Bench_[VAL/ALL] |
MCQ (MTI) |
|
MLLMGuard_DS |
VQA |
|
AesBench_[VAL/TEST] |
MCQ |
|
VCR-wiki + |
VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100] |
VQA |
MMLongBench_DOC |
VQA (MTI) |
|
BLINK |
MCQ (MTI) |
|
MathVision |
VQA |
|
MTVQA_TEST |
VQA |
|
MMDU+ |
MMDU |
VQA (MTT, MTI) |
Q-Bench1_[VAL/TEST] |
MCQ |
|
A-Bench_[VAL/TEST] |
MCQ |
|
DUDE+ |
DUDE |
VQA (MTI) |
SLIDEVQA |
VQA (MTI) |
|
TaskMeAnything_v1_imageqa_random |
MCQ |
|
MMMB_[ar/cn/en/pt/ru/tr] |
MCQ |
|
A-OKVQA |
MCQ |
|
MUIRBench |
MCQ |
|
GMAI-MMBench_VAL |
MCQ |
|
TableVQABench |
VQA |
Note
* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.
+ Testing results for this evaluation set have not yet been provided.
- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.
Video Understanding Dataset#
Dataset |
Dataset Name |
Task |
---|---|---|
MMBench-Video |
VQA |
|
MVBench_MP4 |
MCQ |
|
MLVU |
MCQ & VQA |
|
TempCompass |
MCQ & Y/N & Caption |
|
LongVideoBench |
MCQ |
|
Video-MME |
MCQ |
4. RAGEval Backend#
CMTEB Evaluation Dataset#
Name |
Hub Link |
Description |
Type |
Category |
Number of Test Samples |
---|---|---|---|---|---|
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Retrieval |
s2p |
24,832 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Retrieval |
s2p |
7,437 |
||
A large-scale Chinese web search engine paragraph retrieval benchmark |
Retrieval |
s2p |
4,000 |
||
COVID-19 news articles |
Retrieval |
s2p |
949 |
||
Online medical consultation texts |
Retrieval |
s2p |
3,999 |
||
Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba medical search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba video search engine systems |
Retrieval |
s2p |
1,000 |
||
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Re-ranking |
s2p |
24,382 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Re-ranking |
s2p |
7,437 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
2,000 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
4,000 |
||
Original Chinese natural language inference dataset |
Pair Classification |
s2s |
3,000 |
||
Chinese multi-class natural language inference |
Pair Classification |
s2s |
139,000 |
||
Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
p2p |
10,000 |
||
Clustering titles from the THUCNews dataset |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the THUCNews dataset |
Clustering |
p2p |
10,000 |
||
ATEC NLP Sentence Pair Similarity Competition |
STS |
s2s |
20,000 |
||
Banking Question Semantic Similarity |
STS |
s2s |
10,000 |
||
Large-scale Chinese Question Matching Corpus |
STS |
s2s |
12,500 |
||
Translated PAWS evaluation pairs |
STS |
s2s |
2,000 |
||
Translated STS-B into Chinese |
STS |
s2s |
1,360 |
||
Ant Financial Question Matching Corpus |
STS |
s2s |
3,861 |
||
QQ Browser Query Title Corpus |
STS |
s2s |
5,000 |
||
News Short Text Classification |
Classification |
s2s |
10,000 |
||
Long Text Classification of Application Descriptions |
Classification |
s2s |
2,600 |
||
Sentiment Analysis of User Reviews on Food Delivery Platforms |
Classification |
s2s |
1,000 |
||
Sentiment Analysis of User Reviews on Online Shopping Websites |
Classification |
s2s |
1,000 |
||
A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative |
Classification |
s2s |
3,000 |
||
Reviews of iPhone |
Classification |
s2s |
533 |
For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.
MTEB Evaluation Dataset#
See also
See also: MTEB Related Tasks
CLIP-Benchmark#
Dataset Name |
Task Type |
Notes |
---|---|---|
zeroshot_retrieval |
Chinese Multimodal Dataset |
|
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |