VLMEvalKit Backend#
Note
For more comprehensive instructions and an up-to-date list of datasets, please refer to detailed instructions.
Image Understanding Dataset#
Abbreviations used:
MCQ: Multiple Choice Questions;Y/N: Yes/No Questions;MTT: Multiturn Dialogue Evaluation;MTI: Multi-image Input Evaluation
Dataset |
Dataset Names |
Task |
|---|---|---|
MMBench Series: |
MMBench_DEV_[EN/CN] |
MCQ |
MMStar |
MCQ |
|
MME |
Y/N |
|
SEEDBench_IMG |
MCQ |
|
MMVet |
VQA |
|
MMMU_[DEV_VAL/TEST] |
MCQ |
|
MathVista_MINI |
VQA |
|
ScienceQA_[VAL/TEST] |
MCQ |
|
COCO_VAL |
Caption |
|
HallusionBench |
Y/N |
|
OCRVQA_[TESTCORE/TEST] |
VQA |
|
TextVQA_VAL |
VQA |
|
ChartQA_TEST |
VQA |
|
AI2D_[TEST/TEST_NO_MASK] |
MCQ |
|
LLaVABench |
VQA |
|
DocVQA_[VAL/TEST] |
VQA |
|
InfoVQA_[VAL/TEST] |
VQA |
|
OCRBench |
VQA |
|
RealWorldQA |
MCQ |
|
POPE |
Y/N |
|
CORE_MM (MTI) |
VQA |
|
MMT-Bench_[VAL/ALL] |
MCQ (MTI) |
|
MLLMGuard_DS |
VQA |
|
AesBench_[VAL/TEST] |
MCQ |
|
VCR-wiki + |
VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100] |
VQA |
MMLongBench_DOC |
VQA (MTI) |
|
BLINK |
MCQ (MTI) |
|
MathVision |
VQA |
|
MTVQA_TEST |
VQA |
|
MMDU+ |
MMDU |
VQA (MTT, MTI) |
Q-Bench1_[VAL/TEST] |
MCQ |
|
A-Bench_[VAL/TEST] |
MCQ |
|
DUDE+ |
DUDE |
VQA (MTI) |
SLIDEVQA |
VQA (MTI) |
|
TaskMeAnything_v1_imageqa_random |
MCQ |
|
MMMB_[ar/cn/en/pt/ru/tr] |
MCQ |
|
A-OKVQA |
MCQ |
|
MUIRBench |
MCQ |
|
GMAI-MMBench_VAL |
MCQ |
|
TableVQABench |
VQA |
Note
* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.
+ Testing results for this evaluation set have not yet been provided.
- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.
Video Understanding Dataset#
Dataset |
Dataset Name |
Task |
|---|---|---|
MMBench-Video |
VQA |
|
MVBench_MP4 |
MCQ |
|
MLVU |
MCQ & VQA |
|
TempCompass |
MCQ & Y/N & Caption |
|
LongVideoBench |
MCQ |
|
Video-MME |
MCQ |