Multimodal Large Models#
This framework supports two custom multimodal evaluation methods:
General-VQA Format: Based on OpenAI message format, supports multi-image/audio input, system prompts, and base64 encoding, suitable for Q&A-based multimodal evaluation tasks.
General-VMCQ Format: Similar to MMMU format, question text can contain image placeholders
<image x>, suitable for multiple-choice multimodal evaluation tasks.
General-VQA Format#
1. Data Preparation#
Prepare data files conforming to OpenAI message format, supporting JSONL or TSV formats:
JSONL Format Example (example_openai.jsonl):
{"messages": [{"role": "user", "content": [{"type": "text", "text": "What animal is this?"}, {"type": "image_url", "image_url": {"url": "custom_eval/multimodal/images/dog.jpg"}}]}], "answer": "Dog"}
{"messages": [{"role": "user", "content": [{"type": "text", "text": "What building is this?"}, {"type": "image_url", "image_url": {"url": "custom_eval/multimodal/images/AMNH.jpg"}}]}], "answer": "Museum"}
TSV Format Example (example_openai.tsv):
messages answer
[{"role": "user", "content": [{"type": "text", "text": "What animal is this?"}, {"type": "image_url", "image_url": {"url": "custom_eval/multimodal/images/dog.jpg"}}]}] Dog
[{"role": "user", "content": [{"type": "text", "text": "What building is this?"}, {"type": "image_url", "image_url": {"url": "custom_eval/multimodal/images/AMNH.jpg"}}]}] Museum
Field Descriptions:
messages: OpenAI format message array, supporting:Text content:
{"type": "text", "text": "question text"}Image URL:
{"type": "image_url", "image_url": {"url": "path or base64"}}Audio input:
{"type": "input_audio", "input_audio": {"data": "path or base64", "format": "wav"}}System message:
{"role": "system", "content": "system prompt"}
answer: Reference answer (optional, used to calculate BLEU and Rouge scores)
Supported Image Formats:
Local path:
"url": "custom_eval/multimodal/images/dog.jpg"HTTP URL:
"url": "https://example.com/image.jpg"(requires model service support)Base64 encoding:
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
Supported Audio Formats:
Local path:
"data": "custom_eval/multimodal/audio/sample.wav"Base64 encoding:
"data": "data:audio/wav;base64,UklGRiQ..."Audio format (
formatfield): supports"wav"and"mp3"
Multi-image Input
Supports using multiple images in one question:
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two images:"},
{"type": "image_url", "image_url": {"url": "image1.jpg"}},
{"type": "text", "text": "and"},
{"type": "image_url", "image_url": {"url": "image2.jpg"}},
{"type": "text", "text": "What are the differences?"}
]
}
],
"answer": "The main differences are..."
}
System Prompt
You can add system messages to set the evaluation context:
{
"messages": [
{"role": "system", "content": "You are a medical AI assistant."},
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this X-ray:"},
{"type": "image_url", "image_url": {"url": "xray.jpg", "detail": "high"}}
]
}
],
"answer": "The X-ray shows..."
}
Base64 Images
Supports directly using base64 encoded images:
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
]
}
],
"answer": "A beautiful landscape"
}
Audio Input
Supports audio content input using OpenAI input_audio format. The data field accepts either a local file path or base64-encoded data. The format field supports "wav" and "mp3":
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the content of this audio clip."},
{
"type": "input_audio",
"input_audio": {
"data": "custom_eval/multimodal/audio/sample.wav",
"format": "wav"
}
}
]
}
],
"answer": "A piano music performance."
}
You can also use base64-encoded audio data:
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is being said in this audio?"},
{
"type": "input_audio",
"input_audio": {
"data": "UklGRiQAAABXQVZFZm10IBAAAA...",
"format": "wav"
}
}
]
}
],
"answer": "Hello, world."
}
2. Configure Evaluation Task#
Evaluate using Python API or CLI:
Python API:
from evalscope.run import run_task
from evalscope.config import TaskConfig
from os import environ as env
task_cfg = TaskConfig(
model='qwen-vl-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=env.get('DASHSCOPE_API_KEY'),
eval_type='openai_api',
datasets=['general_vqa'],
dataset_args={
'general_vqa': {
'local_path': 'custom_eval/multimodal/vqa', # Dataset directory
'subset_list': ['example_openai'], # Filename (without extension)
}
},
limit=5, # Optional: limit number of evaluation samples
)
result = run_task(task_cfg=task_cfg)
CLI:
evalscope eval \
--model qwen-vl-plus \
--api-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
--api-key "$DASHSCOPE_API_KEY" \
--eval-type openai_api \
--datasets general_vqa \
--dataset-args '{"general_vqa": {"local_path": "custom_eval/multimodal/vqa", "subset_list": ["example_openai"]}}' \
--limit 5
Evaluation will output BLEU and Rouge metrics:
+--------------+-------------+----------------+----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==============+=============+================+================+=======+=========+=========+
| qwen-vl-plus | general_vqa | mean_bleu-1 | example_openai | 5 | 0.0067 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_bleu-2 | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_bleu-3 | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_bleu-4 | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-1-R | example_openai | 5 | 0.4 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-1-P | example_openai | 5 | 0.0062 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-1-F | example_openai | 5 | 0.0121 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-2-R | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-2-P | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-2-F | example_openai | 5 | 0 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-L-R | example_openai | 5 | 0.4 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-L-P | example_openai | 5 | 0.0047 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
| qwen-vl-plus | general_vqa | mean_Rouge-L-F | example_openai | 5 | 0.0093 | default |
+--------------+-------------+----------------+----------------+-------+---------+---------+
3. Configure Judge Model#
You can specify a judge model through the judge_model parameter to generate reference answers for evaluation, which will obtain accuracy metrics:
from evalscope.run import run_task
from evalscope.constants import EvalType, JudgeStrategy
from os import environ as env
task_cfg = TaskConfig(
model='qwen-vl-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=env.get('DASHSCOPE_API_KEY'),
eval_type='openai_api',
datasets=['general_vqa'],
dataset_args={
'general_vqa': {
'local_path': 'custom_eval/multimodal/vqa',
'subset_list': ['example_openai'],
}
},
limit=5,
judge_model_args={
'model_id': 'qwen-plus', # Does not need to be a multimodal model
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': env.get('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096
},
},
eval_batch_size=5,
judge_strategy=JudgeStrategy.LLM,
)
result = run_task(task_cfg=task_cfg)
CLI (equivalent):
evalscope eval \
--model qwen-vl-plus \
--api-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
--api-key "$DASHSCOPE_API_KEY" \
--eval-type openai_api \
--datasets general_vqa \
--dataset-args '{"general_vqa": {"local_path": "custom_eval/multimodal/vqa", "subset_list": ["example_openai"]}}' \
--limit 5 \
--judge-model-args '{"model_id": "qwen-plus", "api_url": "https://dashscope.aliyuncs.com/compatible-mode/v1", "api_key": "$DASHSCOPE_API_KEY", "generation_config": {"temperature": 0.0, "max_tokens": 4096}}' \
--judge-worker-num 5 \
--judge-strategy llm
Evaluation will output accuracy metrics:
+--------------+-------------+----------+----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==============+=============+==========+================+=======+=========+=========+
| qwen-vl-plus | general_vqa | mean_acc | example_openai | 5 | 1 | default |
+--------------+-------------+----------+----------------+-------+---------+---------+
General-VMCQ Format#
1. Data Preparation#
General-VMCQ adopts a structure similar to MMMU: question text can contain image placeholders <image x>; options is a Python list string, options can be text or image placeholders.
Images support two forms (both strings):
Local or remote path/URL:
"custom_eval/multimodal/images/dog.jpg"or"https://.../dog.jpg"Base64 Data URL:
"data:image/jpeg;base64,/9j/4AAQSk..."
Supports up to 100 images (image_1 to image_100). When text contains a non-existent image placeholder, parsing will stop directly (break).
JSONL Example (example.jsonl):
{"question": "Which image shows a dog?", "options": ["<image 1>", "<image 2>", "<image 3>", "<image 4>"], "image_1": "custom_eval/multimodal/images/dog.jpg", "image_2": "custom_eval/multimodal/images/AMNH.jpg", "image_3": "custom_eval/multimodal/images/tesla.jpg", "image_4": "custom_eval/multimodal/images/tokyo.jpg", "answer": "A"}
{"question": "<image 1> What building is this?", "options": ["School", "Hospital", "Park", "Museum"], "image_1": "custom_eval/multimodal/images/AMNH.jpg", "answer": "D"}
TSV Example (example.tsv):
question options answer image_1 image_2 image_3 image_4
Which image shows a dog? ["<image 1>", "<image 2>", "<image 3>", "<image 4>"] A custom_eval/multimodal/images/dog.jpg custom_eval/multimodal/images/AMNH.jpg custom_eval/multimodal/images/tesla.jpg custom_eval/multimodal/images/tokyo.jpg
<image 1> What building is this? ["School", "Hospital", "Park", "Museum"] D custom_eval/multimodal/images/AMNH.jpg
Field Descriptions:
question: Question text, can contain<image x>placeholdersoptions: List (JSON array), elements can be text (e.g.,"School") or image placeholders (e.g.,"<image 1>"), no need to add prefixes likeA.,B.answer: Correct answer letter (e.g.,"A","B")image_k: Image string (local/remote path or base64 Data URL), k ∈ [1, 100]
2. Configure Evaluation Task#
Python API:
from evalscope.run import run_task
from evalscope.config import TaskConfig
from os import environ as env
task_cfg = TaskConfig(
model='qwen-vl-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=env.get('DASHSCOPE_API_KEY'),
eval_type='openai_api',
datasets=['general_vmcq'],
dataset_args={
'general_vmcq': {
'local_path': 'custom_eval/multimodal/mcq',
'subset_list': ['example'],
}
},
limit=10,
)
result = run_task(task_cfg=task_cfg)
print(result)
CLI:
evalscope eval \
--model qwen-vl-plus \
--api-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
--api-key "$DASHSCOPE_API_KEY" \
--eval-type openai_api \
--datasets general_vmcq \
--dataset-args '{"general_vmcq": {"local_path": "custom_eval/multimodal/mcq", "subset_list": ["example"]}}' \
--limit 10
3. Evaluation Results#
Evaluation will output accuracy metrics:
+--------------+--------------+----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==============+==============+==========+==========+=======+=========+=========+
| qwen-vl-plus | general_vmcq | mean_acc | example | 3 | 1 | default |
+--------------+--------------+----------+----------+-------+---------+---------+
Based on VLMEvalKit (Deprecated)#
Warning
The following format is the Legacy version. It is recommended to use the General Multimodal Format described above.
Legacy format requires additional VLMEvalKit dependencies:
pip install evalscope[vlmeval]
Reference: Evaluating with VLMEvalKit Backend
Multiple Choice Format (MCQ)#
1. Data Preparation#
The evaluation metric is accuracy, and you need to define a tsv file in the following format (separated by \t):
index category answer question A B C D image_path
1 Animals A What animal is this? Dog Cat Tiger Elephant /root/LMUData/images/custom_mcq/dog.jpg
2 Buildings D What building is this? School Hospital Park Museum /root/LMUData/images/custom_mcq/AMNH.jpg
3 Cities B Which city's skyline is this? New York Tokyo Shanghai Paris /root/LMUData/images/custom_mcq/tokyo.jpg
4 Vehicles C What is the brand of this car? BMW Audi Tesla Mercedes /root/LMUData/images/custom_mcq/tesla.jpg
5 Activities A What is the person in the picture doing? Running Swimming Reading Singing /root/LMUData/images/custom_mcq/running.jpg
Where:
indexis the question numberquestionis the questionansweris the answerA,B,C,Dare options, must have at least two optionsansweris the answer optionimage_pathis the image path (absolute path recommended); can also be replaced withimagefield, which should be base64 encoded imagecategoryis the category (optional field)
Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_mcq.tsv, use custom_mcq for evaluation.
2. Configuration File#
The configuration file can be in python dict, yaml, or json format. For example, the following config.yaml file:
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat # Deployed model name
name: CustomAPIModel # Fixed value
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_mcq # Custom dataset name, placed in `~/LMUData` path
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
3. Run Evaluation#
Run the following code to start evaluation:
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
Evaluation results:
---------- ----
split none
Overall 1.0
Activities 1.0
Animals 1.0
Buildings 1.0
Cities 1.0
Vehicles 1.0
---------- ----
Custom Question-Answer Format (VQA)#
1. Data Preparation#
Prepare a tsv file in question-answer format as follows:
index answer question image_path
1 Dog What animal is this? /root/LMUData/images/custom_mcq/dog.jpg
2 Museum What building is this? /root/LMUData/images/custom_mcq/AMNH.jpg
3 Tokyo Which city's skyline is this? /root/LMUData/images/custom_mcq/tokyo.jpg
4 Tesla What is the brand of this car? /root/LMUData/images/custom_mcq/tesla.jpg
5 Running What is the person in the picture doing? /root/LMUData/images/custom_mcq/running.jpg
This file is the same format as the multiple-choice format, where:
indexis the question numberquestionis the questionansweris the answerimage_pathis the image path (absolute path recommended); can also be replaced withimagefield, which should be base64 encoded image
Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_vqa.tsv, use custom_vqa for evaluation.
2. Custom Evaluation Script#
The following is an example of a custom dataset. This example implements a custom evaluation script for question-answer format, which automatically loads the dataset, uses default prompts for Q&A, and finally calculates accuracy as the evaluation metric.
import os
import numpy as np
from vlmeval.dataset.image_base import ImageBaseDataset
from vlmeval.dataset.image_vqa import CustomVQADataset
from vlmeval.smp import load, dump, d2df
class CustomDataset:
def load_data(self, dataset):
# Custom dataset loading
data_path = os.path.join(os.path.expanduser("~/LMUData"), f'{dataset}.tsv')
return load(data_path)
def build_prompt(self, line):
msgs = ImageBaseDataset.build_prompt(self, line)
# Add prompts or custom instructions here
msgs[-1]['value'] += '\nAnswer the question with one word or phrase.'
return msgs
def evaluate(self, eval_file, **judge_kwargs):
data = load(eval_file)
assert 'answer' in data and 'prediction' in data
data['prediction'] = [str(x) for x in data['prediction']]
data['answer'] = [str(x) for x in data['answer']]
print(data)
# ========Calculate evaluation metrics as needed=========
# Exact match
result = np.mean(data['answer'] == data['prediction'])
ret = {'Overall': result}
ret = d2df(ret).round(2)
# Save results
suffix = eval_file.split('.')[-1]
result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
dump(ret, result_file)
return ret
# ====================================
# Need to keep the following code to override default dataset class
CustomVQADataset.load_data = CustomDataset.load_data
CustomVQADataset.build_prompt = CustomDataset.build_prompt
CustomVQADataset.evaluate = CustomDataset.evaluate
3. Configuration File#
The configuration file can be in python dict, yaml, or json format. For example, the following config.yaml file:
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat
name: CustomAPIModel
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_vqa # Custom dataset name, placed in `~/LMUData` path
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
4. Run Evaluation#
The complete evaluation script is as follows:
from custom_dataset import CustomDataset # Import custom dataset
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
Evaluation results:
{'qwen-vl-chat_custom_vqa_acc': {'Overall': '1.0'}}