VLMEvalKit Backend#

Note

For more comprehensive instructions and an up-to-date list of datasets, please refer to detailed instructions.

Image Understanding Dataset#

Abbreviations used:

  • MCQ: Multiple Choice Questions;

  • Y/N: Yes/No Questions;

  • MTT: Multiturn Dialogue Evaluation;

  • MTI: Multi-image Input Evaluation

Dataset

Dataset Names

Task

MMBench Series:
MMBench, MMBench-CN, CCBench

MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench

MCQ

MMStar

MMStar

MCQ

MME

MME

Y/N

SEEDBench Series

SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus

MCQ

MM-Vet

MMVet

VQA

MMMU

MMMU_[DEV_VAL/TEST]

MCQ

MathVista

MathVista_MINI

VQA

ScienceQA_IMG

ScienceQA_[VAL/TEST]

MCQ

COCO Caption

COCO_VAL

Caption

HallusionBench

HallusionBench

Y/N

OCRVQA*

OCRVQA_[TESTCORE/TEST]

VQA

TextVQA*

TextVQA_VAL

VQA

ChartQA*

ChartQA_TEST

VQA

AI2D

AI2D_[TEST/TEST_NO_MASK]

MCQ

LLaVABench

LLaVABench

VQA

DocVQA+

DocVQA_[VAL/TEST]

VQA

InfoVQA+

InfoVQA_[VAL/TEST]

VQA

OCRBench

OCRBench

VQA

RealWorldQA

RealWorldQA

MCQ

POPE

POPE

Y/N

Core-MM-

CORE_MM (MTI)

VQA

MMT-Bench

MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI

MCQ (MTI)

MLLMGuard -

MLLMGuard_DS

VQA

AesBench+

AesBench_[VAL/TEST]

MCQ

VCR-wiki +

VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100]

VQA

MMLongBench-Doc+

MMLongBench_DOC

VQA (MTI)

BLINK

BLINK

MCQ (MTI)

MathVision+

MathVision
MathVision_MINI

VQA

MT-VQA+

MTVQA_TEST

VQA

MMDU+

MMDU

VQA (MTT, MTI)

Q-Bench1+

Q-Bench1_[VAL/TEST]

MCQ

A-Bench+

A-Bench_[VAL/TEST]

MCQ

DUDE+

DUDE

VQA (MTI)

SlideVQA+

SLIDEVQA
SLIDEVQA_MINI

VQA (MTI)

TaskMeAnything ImageQA Random+

TaskMeAnything_v1_imageqa_random

MCQ

MMMB and Multilingual MMBench+

MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs

MCQ

A-OKVQA+

A-OKVQA

MCQ

MuirBench

MUIRBench

MCQ

GMAI-MMBench+

GMAI-MMBench_VAL

MCQ

TableVQABench+

TableVQABench

VQA

Note

* Partial model testing results are provided here, while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

+ Testing results for this evaluation set have not yet been provided.

- VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset#

Dataset

Dataset Name

Task

MMBench-Video

MMBench-Video

VQA

MVBench

MVBench_MP4

MCQ

MLVU

MLVU

MCQ & VQA

TempCompass

TempCompass

MCQ & Y/N & Caption

LongVideoBench

LongVideoBench

MCQ

Video-MME

Video-MME

MCQ