Supported Benchmarks#

EvalScope supports a variety of datasets for evaluating different types of models, including language models, AIGC models, and other models. Below is a list of the supported datasets categorized by their respective model types.

Tip

If the dataset you need is not on the list, you may submit an issue, and we will support it as soon as possible. Alternatively, you can refer to the Benchmark Addition Guide to add datasets by yourself and submit a PR. Contributions are welcome.

For multimodal evaluation, the Native backend is recommended. It already supports 40+ mainstream VLM benchmarks including OCRBench, MMMU, MMBench, MathVista, ChartQA, DocVQA — see VLM Benchmarks for the full list. If needed, you can also use other tools integrated with this framework, such as OpenCompass for language model evaluation or VLMEvalKit for multimodal model evaluation.