VLMEvalKit#

Note

For multimodal evaluation, the Native backend is recommended. It already supports 40+ mainstream VLM benchmarks (OCRBench, MMMU, MMBench, MathVista, ChartQA, DocVQA, …) — see VLM Benchmarks. The VLMEvalKit backend is intended for users with existing VLMEvalKit workflows. The two backends differ in prompt templates, answer extraction rules and judge logic, so results may diverge — keep the backend consistent within a comparison study.

To facilitate the use of the VLMEvalKit evaluation backend, we have customized the VLMEvalKit source code, naming it ms-vlmeval. This version encapsulates the configuration and execution of evaluation tasks and supports installation via PyPI, allowing users to initiate lightweight VLMEvalKit evaluation tasks through EvalScope. Additionally, we support interface evaluation tasks based on the OpenAI API format, and you can deploy multi-modal model services using ms-swift , vLLM , LMDeploy , Ollama, etc.

1. Environment Setup#

# Install additional dependencies
pip install evalscope[vlmeval]

2. Data Preparation#

When loading a dataset, if the local dataset file does not exist, it will be automatically downloaded to the ~/LMUData/ directory.

The currently supported datasets include:

Name	Notes
A-Bench_TEST, A-Bench_VAL
AI2D_TEST, AI2D_TEST_NO_MASK
AesBench_TEST, AesBench_VAL
BLINK
CCBench
COCO_VAL
ChartQA_TEST
DUDE, DUDE_MINI
DocVQA_TEST, DocVQA_VAL	DocVQA_TEST does not provide answers; use DocVQA_VAL for automatic evaluation
GMAI_mm_bench_VAL
HallusionBench
InfoVQA_TEST, InfoVQA_VAL	InfoVQA_TEST does not provide answers; use InfoVQA_VAL for automatic evaluation
LLaVABench
MLLMGuard_DS
MMBench-Video
MMBench_DEV_CN, MMBench_DEV_CN_V11
MMBench_DEV_EN, MMBench_DEV_EN_V11
MMBench_TEST_CN, MMBench_TEST_CN_V11	MMBench_TEST_CN does not provide answers
MMBench_TEST_EN, MMBench_TEST_EN_V11	MMBench_TEST_EN does not provide answers
MMBench_dev_ar, MMBench_dev_cn, MMBench_dev_en,
MMBench_dev_pt, MMBench_dev_ru, MMBench_dev_tr
MMDU
MME
MMLongBench_DOC
MMMB, MMMB_ar, MMMB_cn, MMMB_en,
MMMB_pt, MMMB_ru, MMMB_tr
MMMU_DEV_VAL, MMMU_TEST
MMStar
MMT-Bench_ALL, MMT-Bench_ALL_MI,
MMT-Bench_VAL, MMT-Bench_VAL_MI
MMVet
MTL_MMBench_DEV
MTVQA_TEST
MVBench, MVBench_MP4
MathVision, MathVision_MINI, MathVista_MINI
OCRBench
OCRVQA_TEST, OCRVQA_TESTCORE
POPE
Q-Bench1_TEST, Q-Bench1_VAL
RealWorldQA
SEEDBench2, SEEDBench2_Plus, SEEDBench_IMG
SLIDEVQA, SLIDEVQA_MINI
ScienceQA_TEST, ScienceQA_VAL
TaskMeAnything_v1_imageqa_random
TextVQA_VAL
VCR_EN_EASY_100, VCR_EN_EASY_500, VCR_EN_EASY_ALL
VCR_EN_HARD_100, VCR_EN_HARD_500, VCR_EN_HARD_ALL
VCR_ZH_EASY_100, VCR_ZH_EASY_500, VCR_ZH_EASY_ALL
VCR_ZH_HARD_100, VCR_ZH_HARD_500, VCR_ZH_HARD_ALL
Video-MME

Note

For detailed information about the datasets, refer to the VLMEvalKit Supported Multimodal Benchmark List.

You can view the dataset name list using the following code:

from evalscope.backend.vlm_eval_kit import VLMEvalKitBackendManager
print(f'** All models from VLMEvalKit backend: {VLMEvalKitBackendManager.list_supported_datasets()}')

3. Model Evaluation#

Model evaluation can be conducted in two ways: through deployed model services or local model inference. Details are as follows:

Method 1: Deployed Model Service Evaluation#

Model Deployment#

Here are four ways to deploy model services:

vLLM Deployment

Refer to the vLLM Tutorial for more details.

List of Supported Models

Install vLLM

pip install vllm -U

Deploy Model Service

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-3B-Instruct --port 8000 --trust-remote-code --max_model_len 4096 --served-model-name Qwen2.5-VL-3B-Instruct

Tip

If you encounter the error ValueError: At most 1 image(s) may be provided in one request, try setting the parameter --limit-mm-per-prompt "image=5" and you can set the image to a larger value.

ms-swift Deployment

Deploy model services using ms-swift. For more details, refer to the ms-swift Deployment Guide.

Install ms-swift

pip install ms-swift -U

Deploy Model Service

CUDA_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2.5-VL-3B-Instruct --port 8000

LMDeploy Deployment

Refer to the LMDeploy Tutorial.

Install LMDeploy

pip install lmdeploy -U

Deploy Model Service

CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server Qwen-VL-Chat --server-port 8000

Ollama Deployment

Run ModelScope hosted models with Ollama in one click. Refer to the documentation.

ollama run modelscope.cn/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF

Configure Model Evaluation Parameters#

Write configuration:

YAML Configuration File

work_dir: outputs
eval_backend: VLMEvalKit
eval_config:
  model: 
    - type: Qwen2.5-VL-3B-Instruct
      name: CustomAPIModel 
      api_base: http://localhost:8000/v1/chat/completions
      key: EMPTY
      temperature: 0.0
      img_size: -1
      max_tokens: 1024
      video_llm: false
  data:
    - SEEDBench_IMG
    - ChartQA_TEST
  mode: all
  limit: 20
  reuse: false
  nproc: 16
  judge: exact_matching

TaskConfig Dictionary

from evalscope import TaskConfig

task_cfg_dict = TaskConfig(
    work_dir='outputs',
    eval_backend='VLMEvalKit',
    eval_config={
        'data': ['SEEDBench_IMG', 'ChartQA_TEST'],
        'limit': 20,
        'mode': 'all',
        'model': [ 
            {'api_base': 'http://localhost:8000/v1/chat/completions',
            'key': 'EMPTY',
            'name': 'CustomAPIModel',
            'temperature': 0.0,
            'type': 'Qwen2.5-VL-3B-Instruct',
            'img_size': -1,
            'video_llm': False,
            'max_tokens': 1024,}
            ],
        'reuse': False,
        'nproc': 16,
        'judge': 'exact_matching'}
)

Method 2: Local Model Inference Evaluation#

Configure model evaluation parameters directly for local inference without starting the model service.

Configure Model Evaluation Parameters#

YAML Configuration File

eval_openai_api.json#

work_dir: outputs
eval_backend: VLMEvalKit
eval_config:
  model: 
    - name: qwen_chat
      model_path: models/Qwen-VL-Chat
  data:
    - SEEDBench_IMG
    - ChartQA_TEST
  mode: all
  limit: 20
  reuse: false
  work_dir: outputs
  nproc: 16

TaskConfig Dictionary

from evalscope import TaskConfig

task_cfg_dict = TaskConfig(
    work_dir='outputs',
    eval_backend='VLMEvalKit',
    eval_config=
        {'data': ['SEEDBench_IMG', 'ChartQA_TEST'],
        'limit': 20,
        'mode': 'all',
        'model': [ 
            {'name': 'qwen_chat',
            'model_path': 'models/Qwen-VL-Chat',
            'video_llm': False,
            'max_new_tokens': 1024,
            }
          ],
        'reuse': False}
 )

Parameter Explanation#

eval_backend: Default value is VLMEvalKit, indicating the use of VLMEvalKit as the evaluation backend.
work_dir: String, the directory for saving evaluation results, logs, and summaries. Default value is outputs.
eval_config: Dictionary containing the following fields:
- data: List, refer to the currently supported datasets
- model: List of dictionaries, each specifying the following fields:
  - For remote API calls:
    - api_base: URL of the model service.
    - type: API request model name, e.g., Qwen2.5-VL-3B-Instruct.
    - name: Fixed value, must be CustomAPIModel.
    - key: OpenAI API key for the model API, default is EMPTY.
    - temperature: Temperature coefficient for model inference, default is 0.0.
    - max_tokens: Maximum number of tokens for model inference, default is 2048.
    - img_size: Image size for model inference, default is -1, meaning use the original size; set to other values, e.g., 224, to scale the image to 224x224.
    - video_llm: Boolean, default is False. Set to True to pass video_url parameter when evaluating video datasets.
  - For local model inference:
    - name: Model name, refer to VLMEvalKit supported models.
    - model_path and other parameters: refer to VLMEvalKit supported model parameters.
- mode: Options: ['all', 'infer'], all includes inference and evaluation; infer performs inference only.
- limit: Integer, number of data items to evaluate, default is None, meaning run all examples.
- reuse: Boolean, whether to reuse evaluation results, otherwise all evaluation temporary files will be deleted.
  
  Note
  
  For ms-vlmeval>=0.0.11, the parameter rerun is renamed to reuse, default is False. When set to True, you need to add use_cache in task_cfg_dict to specify the cache directory.
- nproc: Integer, number of API calls to be made in parallel.
- nframe: Integer, number of video frames for video datasets, default is 8.
- fps: Integer, frame rate for video datasets, default is -1, meaning use nframe; set greater than 0 to use fps to calculate the number of video frames.
- use_subtitle: Boolean, whether to use subtitles for video datasets, default is False.

(Optional) Deploy Judge Model#

Deploy a local language model as a judge / choice extractor, also using ms-swift to deploy the model service. For details, refer to: ms-swift LLM Deployment Guide.

Note

When no judge model is deployed, post-processing + exact matching will be used for judging; and the judge model environment variables must be configured to correctly call the model.

Deploy Judge Model#

# Deploy qwen2-7b as the judge
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct --model_id_or_path models/Qwen2-7B-Instruct --port 8866

Configure Judge Model Environment Variables#

Add the following configuration to eval_config in the yaml configuration file:

eval_config:
  # ... other configurations
  OPENAI_API_KEY: EMPTY 
  OPENAI_API_BASE: http://127.0.0.1:8866/v1/chat/completions # api_base of the judge model
  LOCAL_LLM: qwen2-7b-instruct # model_id of the judge model

4. Execute Evaluation Task#

Caution

If you want the model to perform inference again, you need to clear the model prediction results in the outputs folder before running the script. Previous prediction results will not be automatically cleared, and if they exist, the inference phase will be skipped and the results will be evaluated directly.

After configuring the configuration file, run the following script:

example_eval_openai_api.py#

from evalscope.run import run_task

def run_eval():
    # Option 1: Python dictionary
    task_cfg = task_cfg_dict
    # Option 2: YAML configuration file
    # task_cfg = 'eval_openai_api.yaml'
    
    run_task(task_cfg=task_cfg)

run_eval()

Run the following command:

python example_eval_openai_api.py