Multimodal Large Model#

This framework supports multiple-choice questions and QA questions, two predefined dataset formats. The usage process is as follows:

Note

Custom dataset evaluation requires using VLMEvalKit, which requires additional dependencies:

pip install evalscope[vlmeval]

Reference: Evaluation Backend with VLMEvalKit

Multiple-Choice Question Format (MCQ)#

1. Data Preparation#

The evaluation metric is accuracy, and you need to define a tsv file in the following format (using \t as the separator):

index	category	answer	question	A	B	C	D	image_path
1	Animals	A	What animal is this?	Dog	Cat	Tiger	Elephant	/root/LMUData/images/custom_mcq/dog.jpg
2	Buildings	D	What building is this?	School	Hospital	Park	Museum	/root/LMUData/images/custom_mcq/AMNH.jpg
3	Cities	B	Which city's skyline is this?	New York	Tokyo	Shanghai	Paris	/root/LMUData/images/custom_mcq/tokyo.jpg
4	Vehicles	C	What is the brand of this car?	BMW	Audi	Tesla	Mercedes	/root/LMUData/images/custom_mcq/tesla.jpg
5	Activities	A	What is the person in the picture doing?	Running	Swimming	Reading	Singing	/root/LMUData/images/custom_mcq/running.jpg

Where:

  • index is the question number

  • question is the question

  • answer is the answer

  • A, B, C, D are the options, with at least two options

  • answer is the answer option

  • image_path is the image path (absolute paths are recommended); this can also be replaced with the image field, which should be base64 encoded

  • category is the category (optional field)

Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_mcq.tsv, you can use custom_mcq for evaluation.

2. Configuration File#

The configuration file can be in python dict, yaml, or json format, for example, the following config.yaml file:

eval_backend: VLMEvalKit
eval_config:
  model: 
    - type: qwen-vl-chat   # Name of the deployed model
      name: CustomAPIModel # Fixed value
      api_base: http://localhost:8000/v1/chat/completions
      key: EMPTY
      temperature: 0.0
      img_size: -1
  data:
    - custom_mcq # Name of the custom dataset, placed in `~/LMUData`
  mode: all
  limit: 10
  reuse: false
  work_dir: outputs
  nproc: 1

See also

VLMEvalKit Parameter Description

3. Running Evaluation#

from evalscope.run import run_task

run_task(task_cfg='config.yaml')

The evaluation results are as follows:

----------  ----
split       none
Overall     1.0
Activities  1.0
Animals     1.0
Buildings   1.0
Cities      1.0
Vehicles    1.0
----------  ----

Custom QA Question Format (VQA)#

1. Data Preparation#

Prepare a QA formatted tsv file as follows:

index	answer	question	image_path
1	Dog	What animal is this?	/root/LMUData/images/custom_mcq/dog.jpg
2	Museum	What building is this?	/root/LMUData/images/custom_mcq/AMNH.jpg
3	Tokyo	Which city's skyline is this?	/root/LMUData/images/custom_mcq/tokyo.jpg
4	Tesla	What is the brand of this car?	/root/LMUData/images/custom_mcq/tesla.jpg
5	Running	What is the person in the picture doing?	/root/LMUData/images/custom_mcq/running.jpg

This file is similar to the MCQ format, where:

  • index is the question number

  • question is the question

  • answer is the answer

  • image_path is the image path (absolute paths are recommended); this can also be replaced with the image field, which should be base64 encoded

Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_vqa.tsv, you can use custom_vqa for evaluation.

2. Custom Evaluation Script#

Below is an example of a custom dataset, implementing a custom QA format evaluation script. This script will automatically load the dataset, use default prompts for QA, and finally compute accuracy as the evaluation metric.

import os
import numpy as np
from vlmeval.dataset.image_base import ImageBaseDataset
from vlmeval.dataset.image_vqa import CustomVQADataset
from vlmeval.smp import load, dump, d2df

class CustomDataset:
    def load_data(self, dataset):
        # Load custom dataset
        data_path = os.path.join(os.path.expanduser("~/LMUData"), f'{dataset}.tsv')
        return load(data_path)
        
    def build_prompt(self, line):
        msgs = ImageBaseDataset.build_prompt(self, line)
        # Add prompts or custom instructions here
        msgs[-1]['value'] += '\nAnswer the question in one word or phrase.'
        return msgs
    
    def evaluate(self, eval_file, **judge_kwargs):
        data = load(eval_file)
        assert 'answer' in data and 'prediction' in data
        data['prediction'] = [str(x) for x in data['prediction']]
        data['answer'] = [str(x) for x in data['answer']]
        
        print(data)
        
        # ========Compute the evaluation metric as needed=========
        # Exact match
        result = np.mean(data['answer'] == data['prediction'])
        ret = {'Overall': result}
        ret = d2df(ret).round(2)
        # Save the result
        suffix = eval_file.split('.')[-1]
        result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
        dump(ret, result_file)
        return ret
        # ========================================================
        
# Keep the following code and override the default dataset class
CustomVQADataset.load_data = CustomDataset.load_data
CustomVQADataset.build_prompt = CustomDataset.build_prompt
CustomVQADataset.evaluate = CustomDataset.evaluate

3. Configuration File#

The configuration file can be in python dict, yaml, or json format. For example, the following config.yaml file:

config.yaml#
eval_backend: VLMEvalKit
eval_config:
  model: 
    - type: qwen-vl-chat   
      name: CustomAPIModel 
      api_base: http://localhost:8000/v1/chat/completions
      key: EMPTY
      temperature: 0.0
      img_size: -1
  data:
    - custom_vqa # Name of the custom dataset, placed in `~/LMUData`
  mode: all
  limit: 10
  reuse: false
  work_dir: outputs
  nproc: 1

4. Running Evaluation#

The complete evaluation script is as follows:

from custom_dataset import CustomDataset  # Import the custom dataset
from evalscope.run import run_task

run_task(task_cfg='config.yaml')

The evaluation results are as follows:

{'qwen-vl-chat_custom_vqa_acc': {'Overall': '1.0'}}