Supported Datasets#
1. Native Supported Datasets#
Tip
The framework currently supports the following datasets natively. If the dataset you need is not on the list, you may submit an issue, and we will support it as soon as possible. Alternatively, you can refer to the Benchmark Addition Guide to add datasets by yourself and submit a PR. Contributions are welcome.
You can also use other tools supported by this framework for evaluation, such as OpenCompass for language model evaluation, or VLMEvalKit for multimodal model evaluation.
Name |
Dataset ID |
Task Category |
Remarks |
---|---|---|---|
|
Exam |
||
|
General Reasoning |
||
|
Chinese Comprehensive Exam |
||
|
Chinese Comprehensive Exam |
||
|
Math Competition |
||
|
Math Problems |
||
|
Commonsense Reasoning |
||
|
Code Generation |
||
|
Instruction Following |
||
|
Comprehensive Exam |
||
|
Comprehensive Exam |
||
|
Reading Comprehension |
||
|
Knowledge Q&A |
||
|
Safety |
Note
* Evaluation requires calculating logits, etc., and it does not support API service evaluation (eval-type != server
) at present.
+ Due to operations involving code execution, it is recommended to run in a sandbox environment (docker) to prevent impacts on the local environment.
2. OpenCompass Backend#
Refer to the detailed explanation
Language | Knowledge | Reasoning | Examination |
Word Definition
Idiom Learning
Semantic Similarity
Coreference Resolution
Translation
Multi-language Question Answering
Multi-language Summary
|
Knowledge Question Answering
|
Textual Entailment
Commonsense Reasoning
Mathematical Reasoning
Theorem Application
Comprehensive Reasoning
|
Junior High, High School, University, Professional Examinations
Medical Examinations
|
Understanding | Long Context | Safety | Code |
Reading Comprehension
Content Summary
Content Analysis
|
Long Context Understanding
|
Safety
Robustness
|
Code
|
3. VLMEvalKit Backend#
Refer to the detailed explanation
Image Understanding Dataset#
Abbreviations used:
MCQ
: Multiple Choice Questions;Y/N
: Yes/No Questions;MTT
: Multiturn Dialogue Evaluation;MTI
: Multi-image Input Evaluation
Dataset |
Dataset Names |
Task |
---|---|---|
MMBench Series: |
MMBench_DEV_[EN/CN] |
MCQ |
MMStar |
MCQ |
|
MME |
Y/N |
|
SEEDBench_IMG |
MCQ |
|
MMVet |
VQA |
|
MMMU_[DEV_VAL/TEST] |
MCQ |
|
MathVista_MINI |
VQA |
|
ScienceQA_[VAL/TEST] |
MCQ |
|
COCO_VAL |
Caption |
|
HallusionBench |
Y/N |
|
OCRVQA_[TESTCORE/TEST] |
VQA |
|
TextVQA_VAL |
VQA |
|
ChartQA_TEST |
VQA |
|
AI2D_[TEST/TEST_NO_MASK] |
MCQ |
|
LLaVABench |
VQA |
|
DocVQA_[VAL/TEST] |
VQA |
|
InfoVQA_[VAL/TEST] |
VQA |
|
OCRBench |
VQA |
|
RealWorldQA |
MCQ |
|
POPE |
Y/N |
|
CORE_MM (MTI) |
VQA |
|
MMT-Bench_[VAL/ALL] |
MCQ (MTI) |
|
MLLMGuard_DS |
VQA |
|
AesBench_[VAL/TEST] |
MCQ |
|
VCR-wiki + |
VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100] |
VQA |
MMLongBench_DOC |
VQA (MTI) |
|
BLINK |
MCQ (MTI) |
|
MathVision |
VQA |
|
MTVQA_TEST |
VQA |
|
MMDU+ |
MMDU |
VQA (MTT, MTI) |
Q-Bench1_[VAL/TEST] |
MCQ |
|
A-Bench_[VAL/TEST] |
MCQ |
|
DUDE+ |
DUDE |
VQA (MTI) |
SLIDEVQA |
VQA (MTI) |
|
TaskMeAnything_v1_imageqa_random |
MCQ |
|
MMMB_[ar/cn/en/pt/ru/tr] |
MCQ |
|
A-OKVQA |
MCQ |
|
MUIRBench |
MCQ |
|
GMAI-MMBench_VAL |
MCQ |
|
TableVQABench |
VQA |
Note
* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.
+ Testing results for this evaluation set have not yet been provided.
- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.
Video Understanding Dataset#
Dataset |
Dataset Name |
Task |
---|---|---|
MMBench-Video |
VQA |
|
MVBench_MP4 |
MCQ |
|
MLVU |
MCQ & VQA |
|
TempCompass |
MCQ & Y/N & Caption |
|
LongVideoBench |
MCQ |
|
Video-MME |
MCQ |
4. RAGEval Backend#
CMTEB Evaluation Dataset#
Name |
Hub Link |
Description |
Type |
Category |
Number of Test Samples |
---|---|---|---|---|---|
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Retrieval |
s2p |
24,832 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Retrieval |
s2p |
7,437 |
||
A large-scale Chinese web search engine paragraph retrieval benchmark |
Retrieval |
s2p |
4,000 |
||
COVID-19 news articles |
Retrieval |
s2p |
949 |
||
Online medical consultation texts |
Retrieval |
s2p |
3,999 |
||
Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba medical search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba video search engine systems |
Retrieval |
s2p |
1,000 |
||
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Re-ranking |
s2p |
24,382 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Re-ranking |
s2p |
7,437 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
2,000 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
4,000 |
||
Original Chinese natural language inference dataset |
Pair Classification |
s2s |
3,000 |
||
Chinese multi-class natural language inference |
Pair Classification |
s2s |
139,000 |
||
Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
p2p |
10,000 |
||
Clustering titles from the THUCNews dataset |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the THUCNews dataset |
Clustering |
p2p |
10,000 |
||
ATEC NLP Sentence Pair Similarity Competition |
STS |
s2s |
20,000 |
||
Banking Question Semantic Similarity |
STS |
s2s |
10,000 |
||
Large-scale Chinese Question Matching Corpus |
STS |
s2s |
12,500 |
||
Translated PAWS evaluation pairs |
STS |
s2s |
2,000 |
||
Translated STS-B into Chinese |
STS |
s2s |
1,360 |
||
Ant Financial Question Matching Corpus |
STS |
s2s |
3,861 |
||
QQ Browser Query Title Corpus |
STS |
s2s |
5,000 |
||
News Short Text Classification |
Classification |
s2s |
10,000 |
||
Long Text Classification of Application Descriptions |
Classification |
s2s |
2,600 |
||
Sentiment Analysis of User Reviews on Food Delivery Platforms |
Classification |
s2s |
1,000 |
||
Sentiment Analysis of User Reviews on Online Shopping Websites |
Classification |
s2s |
1,000 |
||
A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative |
Classification |
s2s |
3,000 |
||
Reviews of iPhone |
Classification |
s2s |
533 |
For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.
MTEB Evaluation Dataset#
See also
See also: MTEB Related Tasks
CLIP-Benchmark#
Dataset Name |
Task Type |
Notes |
---|---|---|
zeroshot_retrieval |
Chinese Multimodal Dataset |
|
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |