Multimodal Large Model#
This framework supports multiple-choice questions and QA questions, two predefined dataset formats. The usage process is as follows:
Note
Custom dataset evaluation requires using VLMEvalKit
, which requires additional dependencies:
pip install evalscope[vlmeval]
Reference: Evaluation Backend with VLMEvalKit
Multiple-Choice Question Format (MCQ)#
1. Data Preparation#
The evaluation metric is accuracy, and you need to define a tsv file in the following format (using \t
as the separator):
index category answer question A B C D image_path
1 Animals A What animal is this? Dog Cat Tiger Elephant /root/LMUData/images/custom_mcq/dog.jpg
2 Buildings D What building is this? School Hospital Park Museum /root/LMUData/images/custom_mcq/AMNH.jpg
3 Cities B Which city's skyline is this? New York Tokyo Shanghai Paris /root/LMUData/images/custom_mcq/tokyo.jpg
4 Vehicles C What is the brand of this car? BMW Audi Tesla Mercedes /root/LMUData/images/custom_mcq/tesla.jpg
5 Activities A What is the person in the picture doing? Running Swimming Reading Singing /root/LMUData/images/custom_mcq/running.jpg
Where:
index
is the question numberquestion
is the questionanswer
is the answerA
,B
,C
,D
are the options, with at least two optionsanswer
is the answer optionimage_path
is the image path (absolute paths are recommended); this can also be replaced with theimage
field, which should be base64 encodedcategory
is the category (optional field)
Place this file in the ~/LMUData
path, and you can use the filename for evaluation. For example, if the filename is custom_mcq.tsv
, you can use custom_mcq
for evaluation.
2. Configuration File#
The configuration file can be in python dict
, yaml
, or json
format, for example, the following config.yaml
file:
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat # Name of the deployed model
name: CustomAPIModel # Fixed value
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_mcq # Name of the custom dataset, placed in `~/LMUData`
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
See also
VLMEvalKit Parameter Description
3. Running Evaluation#
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
The evaluation results are as follows:
---------- ----
split none
Overall 1.0
Activities 1.0
Animals 1.0
Buildings 1.0
Cities 1.0
Vehicles 1.0
---------- ----
Custom QA Question Format (VQA)#
1. Data Preparation#
Prepare a QA formatted tsv file as follows:
index answer question image_path
1 Dog What animal is this? /root/LMUData/images/custom_mcq/dog.jpg
2 Museum What building is this? /root/LMUData/images/custom_mcq/AMNH.jpg
3 Tokyo Which city's skyline is this? /root/LMUData/images/custom_mcq/tokyo.jpg
4 Tesla What is the brand of this car? /root/LMUData/images/custom_mcq/tesla.jpg
5 Running What is the person in the picture doing? /root/LMUData/images/custom_mcq/running.jpg
This file is similar to the MCQ format, where:
index
is the question numberquestion
is the questionanswer
is the answerimage_path
is the image path (absolute paths are recommended); this can also be replaced with theimage
field, which should be base64 encoded
Place this file in the ~/LMUData
path, and you can use the filename for evaluation. For example, if the filename is custom_vqa.tsv
, you can use custom_vqa
for evaluation.
2. Custom Evaluation Script#
Below is an example of a custom dataset, implementing a custom QA format evaluation script. This script will automatically load the dataset, use default prompts for QA, and finally compute accuracy as the evaluation metric.
import os
import numpy as np
from vlmeval.dataset.image_base import ImageBaseDataset
from vlmeval.dataset.image_vqa import CustomVQADataset
from vlmeval.smp import load, dump, d2df
class CustomDataset:
def load_data(self, dataset):
# Load custom dataset
data_path = os.path.join(os.path.expanduser("~/LMUData"), f'{dataset}.tsv')
return load(data_path)
def build_prompt(self, line):
msgs = ImageBaseDataset.build_prompt(self, line)
# Add prompts or custom instructions here
msgs[-1]['value'] += '\nAnswer the question in one word or phrase.'
return msgs
def evaluate(self, eval_file, **judge_kwargs):
data = load(eval_file)
assert 'answer' in data and 'prediction' in data
data['prediction'] = [str(x) for x in data['prediction']]
data['answer'] = [str(x) for x in data['answer']]
print(data)
# ========Compute the evaluation metric as needed=========
# Exact match
result = np.mean(data['answer'] == data['prediction'])
ret = {'Overall': result}
ret = d2df(ret).round(2)
# Save the result
suffix = eval_file.split('.')[-1]
result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
dump(ret, result_file)
return ret
# ========================================================
# Keep the following code and override the default dataset class
CustomVQADataset.load_data = CustomDataset.load_data
CustomVQADataset.build_prompt = CustomDataset.build_prompt
CustomVQADataset.evaluate = CustomDataset.evaluate
3. Configuration File#
The configuration file can be in python dict
, yaml
, or json
format. For example, the following config.yaml
file:
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat
name: CustomAPIModel
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_vqa # Name of the custom dataset, placed in `~/LMUData`
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
4. Running Evaluation#
The complete evaluation script is as follows:
from custom_dataset import CustomDataset # Import the custom dataset
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
The evaluation results are as follows:
{'qwen-vl-chat_custom_vqa_acc': {'Overall': '1.0'}}