VLMEvalKit#
To facilitate the use of the VLMEvalKit evaluation backend, we have customized the VLMEvalKit source code, naming it ms-vlmeval
. This version encapsulates the configuration and execution of evaluation tasks and supports installation via PyPI, allowing users to initiate lightweight VLMEvalKit evaluation tasks through EvalScope. Additionally, we support interface evaluation tasks based on the OpenAI API format, and you can deploy multi-modal model services using ms-swift , vLLM , LMDeploy , Ollama, etc.
1. Environment Setup#
# Install additional dependencies
pip install evalscope[vlmeval]
2. Data Preparation#
When loading a dataset, if the local dataset file does not exist, it will be automatically downloaded to the ~/LMUData/
directory.
The currently supported datasets include:
Name |
Notes |
---|---|
A-Bench_TEST, A-Bench_VAL |
|
AI2D_TEST, AI2D_TEST_NO_MASK |
|
AesBench_TEST, AesBench_VAL |
|
BLINK |
|
CCBench |
|
COCO_VAL |
|
ChartQA_TEST |
|
DUDE, DUDE_MINI |
|
DocVQA_TEST, DocVQA_VAL |
DocVQA_TEST does not provide answers; use DocVQA_VAL for automatic evaluation |
GMAI_mm_bench_VAL |
|
HallusionBench |
|
InfoVQA_TEST, InfoVQA_VAL |
InfoVQA_TEST does not provide answers; use InfoVQA_VAL for automatic evaluation |
LLaVABench |
|
MLLMGuard_DS |
|
MMBench-Video |
|
MMBench_DEV_CN, MMBench_DEV_CN_V11 |
|
MMBench_DEV_EN, MMBench_DEV_EN_V11 |
|
MMBench_TEST_CN, MMBench_TEST_CN_V11 |
MMBench_TEST_CN does not provide answers |
MMBench_TEST_EN, MMBench_TEST_EN_V11 |
MMBench_TEST_EN does not provide answers |
MMBench_dev_ar, MMBench_dev_cn, MMBench_dev_en, |
|
MMBench_dev_pt, MMBench_dev_ru, MMBench_dev_tr |
|
MMDU |
|
MME |
|
MMLongBench_DOC |
|
MMMB, MMMB_ar, MMMB_cn, MMMB_en, |
|
MMMB_pt, MMMB_ru, MMMB_tr |
|
MMMU_DEV_VAL, MMMU_TEST |
|
MMStar |
|
MMT-Bench_ALL, MMT-Bench_ALL_MI, |
|
MMT-Bench_VAL, MMT-Bench_VAL_MI |
|
MMVet |
|
MTL_MMBench_DEV |
|
MTVQA_TEST |
|
MVBench, MVBench_MP4 |
|
MathVision, MathVision_MINI, MathVista_MINI |
|
OCRBench |
|
OCRVQA_TEST, OCRVQA_TESTCORE |
|
POPE |
|
Q-Bench1_TEST, Q-Bench1_VAL |
|
RealWorldQA |
|
SEEDBench2, SEEDBench2_Plus, SEEDBench_IMG |
|
SLIDEVQA, SLIDEVQA_MINI |
|
ScienceQA_TEST, ScienceQA_VAL |
|
TaskMeAnything_v1_imageqa_random |
|
TextVQA_VAL |
|
VCR_EN_EASY_100, VCR_EN_EASY_500, VCR_EN_EASY_ALL |
|
VCR_EN_HARD_100, VCR_EN_HARD_500, VCR_EN_HARD_ALL |
|
VCR_ZH_EASY_100, VCR_ZH_EASY_500, VCR_ZH_EASY_ALL |
|
VCR_ZH_HARD_100, VCR_ZH_HARD_500, VCR_ZH_HARD_ALL |
|
Video-MME |
Note
For detailed information about the datasets, refer to the VLMEvalKit Supported Multimodal Benchmark List.
You can view the dataset name list using the following code:
from evalscope.backend.vlm_eval_kit import VLMEvalKitBackendManager
print(f'** All models from VLMEvalKit backend: {VLMEvalKitBackendManager.list_supported_datasets()}')
3. Model Evaluation#
Model evaluation can be conducted in two ways: through deployed model services or local model inference. Details are as follows:
Method 1: Deployed Model Service Evaluation#
Model Deployment#
Here are four ways to deploy model services:
Use ms-swift to deploy model services. For more details, please refer to the: ms-swift Deployment Guide.
Install ms-swift
pip install ms-swift -U
Deploy Model Service
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-vl-chat --port 8000
Refer to the vLLM Tutorial for more details.
Install vLLM
pip install vllm -U
Deploy Model Service
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model InternVL2-8B --port 8000 --trust-remote-code --max_model_len 4096
Refer to LMDeploy Tutorial for more details.
Install LMDeploy
pip install lmdeploy -U
Deploy Model Services
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server Qwen-VL-Chat --server-port 8000
Note
Support for OpenAI API by Ollama is currently in an experimental state. This tutorial provides an example only; please modify it according to your actual situation.
Refer to the Ollama Tutorial.
Install Ollama
# For Linux
curl -fsSL https://ollama.com/install.sh | sh
Start Ollama
# Default port is 11434
ollama serve
Tip
If using ollama pull
to fetch a model, you can skip the following steps for creating a model; if using ollama import
to import a model, you will need to manually create a model configuration file.
Create Model Configuration File Modelfile
# Model path
FROM models/LLaVA
# Temperature coefficient
PARAMETER temperature 1
# System prompt
SYSTEM """
You are a helpful assistant.
"""
Create Model
This command will automatically convert the model to a format supported by Ollama, with support for various quantization methods.
ollama create llava -f ./Modelfile
Configure Model Evaluation Parameters#
Create configuration files:
work_dir: outputs
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat
name: CustomAPIModel
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- SEEDBench_IMG
- ChartQA_TEST
mode: all
limit: 20
reuse: false
nproc: 16
task_cfg_dict = {
'work_dir': 'outputs',
'eval_backend': 'VLMEvalKit',
'eval_config':
{'data': ['SEEDBench_IMG', 'ChartQA_TEST'],
'limit': 20,
'mode': 'all',
'model': [
{'api_base': 'http://localhost:8000/v1/chat/completions',
'key': 'EMPTY',
'name': 'CustomAPIModel',
'temperature': 0.0,
'type': 'qwen-vl-chat'}
],
'reuse': False}}
Method 2: Local Model Inference Evaluation#
This method does not involve starting a model service; instead, it directly configures model evaluation parameters for local inference.
Configure Model Evaluation Parameters#
work_dir: outputs
eval_backend: VLMEvalKit
eval_config:
model:
- name: qwen_chat
model_path: models/Qwen-VL-Chat
data:
- SEEDBench_IMG
- ChartQA_TEST
mode: all
limit: 20
reuse: false
work_dir: outputs
nproc: 16
task_cfg_dict = {
'work_dir': 'outputs',
'eval_backend': 'VLMEvalKit',
'eval_config':
{'data': ['SEEDBench_IMG', 'ChartQA_TEST'],
'limit': 20,
'mode': 'all',
'model': [
{'name': 'qwen_chat',
'model_path': 'models/Qwen-VL-Chat'}
],
'reuse': False}}
Parameters#
work_dir
: A string specifying the directory where evaluation results, logs, and summaries are saved. The default value isoutputs
.eval_backend
: Default value isVLMEvalKit
, indicating the use of the VLMEvalKit evaluation backend.eval_config
: A dictionary containing the following fields:data
: A list referencing the currently supported datasets. Certainly! Here’s the translated text in English, while maintaining the original formatting:model
: A list of dictionaries, where each dictionary can specify the following fields:When using remote API calls:
api_base
: The URL for the OpenAI API, i.e., the URL of the model service.type
: The name of the model for OpenAI API requests.If deploying with
ms-swift
, set to the value of--model_type
;If deploying the model with
vLLM
orLMDeploy
, set tomodel_id
;If deploying the model with
Ollama
, set tomodel_name
, you can check it using theollama list
command.
name
: Fixed value, must beCustomAPIModel
.key
: The OpenAI API key for the model API, default isEMPTY
.temperature
: The temperature coefficient for model inference, default is0.0
.img_size
: The image size for model inference, default is-1
, which means using the original size; setting it to another value, such as224
, means scaling the image to 224x224 size.video_llm
: Boolean value, default isFalse
. When evaluating video datasets, if you need to pass thevideo_url
parameter, set it toTrue
.
When using local model inference:
name
: Model name, refer to Models Supported by VLMEvalKit.model_path
and other parameters: Refer to VLMEvalKit Supported Model Parameters.
mode
: Options:['all', 'infer']
;all
includes inference and evaluation;infer
only performs inference.limit
: Integer indicating the number of evaluation data; default value isNone
, meaning all examples will be run.reuse
: Boolean indicating whether to reuse the evaluation, which will delete all temporary evaluation files.Note
For
ms-vlmeval>=0.0.11
, the parameterrerun
has been renamed toreuse
, with a default value ofFalse
. When set toTrue
, you need to adduse_cache
in thetask_cfg_dict
to specify the cache directory to be used.work_dir
: String specifying the directory to save evaluation results, logs, and summaries; default value isoutputs
.nproc
: Integer indicating the number of API calls in parallel.nframe
: An integer representing the number of video frames in the video dataset, with a default value of8
.fps
: An integer representing the frame rate of the video dataset, with a default value of-1
, which means to usenframe
; if set to a value greater than 0, it will usefps
to calculate the number of video frames.use_subtitle
: A boolean value indicating whether the video dataset uses subtitles, with a default value ofFalse
.
(Optional) Deploy Judge Model#
You can deploy a local language model as a judge/extractor using ms-swift. Refer to the ms-swift LLM Deployment Guide.
Note
If the judge model is not deployed, post-processing with exact matching will be used; also, you must configure the judge model environment variables for correct model invocation.
Deploy Judge Model#
# Deploy qwen2-7b as the judge
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct --model_id_or_path models/Qwen2-7B-Instruct --port 8866
Configure Judge Model Environment Variables#
Add the following configuration in the YAML configuration file:
OPENAI_API_KEY: EMPTY
OPENAI_API_BASE: http://127.0.0.1:8866/v1/chat/completions # Judge model's api_base
LOCAL_LLM: qwen2-7b-instruct # Judge model's model_id
4. Execute Evaluation Task#
Caution
If you want the model to perform inference again, you need to clear the model prediction results in the outputs
folder before running the script. Previous prediction results will not be automatically cleared, and if they exist, the inference phase will be skipped and the results will be evaluated directly.
After configuring the configuration file, run the following script:
from evalscope.run import run_task
from evalscope.summarizer import Summarizer
def run_eval():
# Option 1: Python dictionary
task_cfg = task_cfg_dict
# Option 2: YAML configuration file
# task_cfg = 'eval_openai_api.yaml'
run_task(task_cfg=task_cfg)
print('>> Start to get the report with summarizer ...')
report_list = Summarizer.get_report_from_cfg(task_cfg)
print(f'\n>> The report list: {report_list}')
run_eval()
Run the following command:
python example_eval_openai_api.py