CLIP Benchmark#

This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.

Supported Datasets#

Dataset Name

Task Type

Notes

muge

zeroshot_retrieval

Chinese Multimodal Dataset

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification

Environment Preparation#

Install the required packages

pip install evalscope[rag] -U

Configure Evaluation Parameters#

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "output_dir": "outputs",
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

Parameter Description#

  • eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.

  • eval_config: A dictionary containing the following fields:

    • tool: The evaluation tool, using clip_benchmark.

    • eval: A dictionary containing the following fields:

      • models: A list of model configurations, each with the following fields:

        • model_name: str The model name or path, e.g., AI-ModelScope/chinese-clip-vit-large-patch14-336px. Supports automatic downloading from the ModelScope repository.

      • dataset_name: List[str] A list of dataset names, e.g., ["muge", "flickr8k", "mnist"]. See Task List.

      • split: str The split of the dataset to use, default is test.

      • batch_size: int Batch size for data loading, default is 128.

      • num_workers: int Number of worker threads for data loading, default is 1.

      • verbose: bool Whether to enable detailed logging, default is True.

      • skip_existing: bool Whether to skip processing if output already exists, default is False.

      • output_dir: str Output directory, default is outputs.

      • cache_dir: str Dataset cache directory, default is cache.

      • limit: Optional[int] Limit the number of samples to process, default is None, e.g., 1000.

Run Evaluation Task#

from evalscope.run import run_task
from evalscope.utils.logger import get_logger

logger = get_logger()

# Run task
run_task(task_cfg=task_cfg) 

Output Evaluation Results#

outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json#
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}

Custom Evaluation Dataset#

This framework supports custom evaluation datasets. You only need to prepare the dataset and configure it according to the following format.

Image-Text Retrieval Dataset#

1. Prepare the Dataset Prepare an image_queries.jsonl (fixed file name) image-text retrieval dataset with the following format:

custom_eval/multimodal/text-image-retrieval/image_queries.jsonl#
{"image_path": "custom_eval/multimodal/images/dog.jpg", "query": ["dog"]}
{"image_path": "custom_eval/multimodal/images/AMNH.jpg", "query": ["building"]}
{"image_path": "custom_eval/multimodal/images/tokyo.jpg", "query": ["city", "tokyo"]}
{"image_path": "custom_eval/multimodal/images/tesla.jpg", "query": ["car", "tesla"]}
{"image_path": "custom_eval/multimodal/images/running.jpg", "query": ["man", "running"]}

Where:

  • image_path: Path to the image, supports local paths.

  • query: Text descriptions for image-text retrieval, supports multiple descriptions, e.g., ["dog", "cat"].

2. Configure Evaluation Parameters

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["custom"],
            "data_dir": "custom_eval/multimodal/text-image-retrieval",
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "limit": 1000,
        },
    },
}

Where:

  • dataset_name: Dataset name, must be specified as custom.

  • data_dir: Dataset directory, containing the image_queries.jsonl file.

3. Run the Evaluation Task

from evalscope.run import run_task

run_task(task_cfg=task_cfg)

The evaluation results output as follows:

{"dataset": "custom", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 1.0, "text_retrieval_recall@5": 1.0}}

Converting Image-Text Retrieval to Text Retrieval#

To facilitate the evaluation of different multimodal retrieval methods, this framework supports converting image-text retrieval problems into text retrieval problems using multimodal large models, and then performing text retrieval evaluation.

1. Prepare the Dataset

Supports Image-Text Retrieval Dataset and Custom Image-Text Retrieval Dataset.

2. Configure Evaluation Parameters

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "internvl2-8b",
                    "api_base": "http://localhost:8008/v1",
                    "api_key": "xxx",
                    "prompt": "Describe this image in English",
                }
            ],
            "dataset_name": ["muge"],
            "split": "test",
            "task": "image_caption",
            "batch_size": 2,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "limit": 10,
        },
    },
}

Parameter Description:

  • A multimodal large model must be configured in the models list:

    • model_name: Name of the multimodal large model, e.g., internvl2-8b.

    • api_base: API address of the multimodal large model, e.g., http://localhost:8008/v1.

    • api_key: API key of the multimodal large model, e.g., xxx.

    • prompt is the prompt input for the multimodal large model, e.g., "Describe this image in English".

  • task: Evaluation task, must be specified as image_caption.

3. Run the Conversion Task

from evalscope.run import run_task
run_task(task_cfg=task_cfg)

The output results are as follows:

2024-10-22 19:56:09,832 - evalscope - INFO - Write files to outputs/internvl2-8b/muge/retrieval_data
2024-10-22 19:56:10,543 - evalscope - INFO - Evaluation results: {'dataset': 'muge', 'model': 'internvl2-8b', 'task': 'image_caption', 'metrics': {'conversion_successful': True, 'save_path': 'outputs/internvl2-8b/muge/retrieval_data'}}
2024-10-22 19:56:10,544 - evalscope - INFO - Dump results to: outputs/internvl2-8b/muge_image_caption.json

The output file directory structure is as follows:

muge
β”œβ”€β”€ retrieval_data
β”‚   β”œβ”€β”€ corpus.jsonl
β”‚   β”œβ”€β”€ queries.jsonl
β”‚   └── qrels
β”‚       └── test.tsv
└── muge_image_caption.json

The specific content of the files is as follows:

outputs/internvl2-8b/muge/retrieval_data/corpus.jsonl#
{"_id":0,"text":"This is an advertisement image showcasing the products of the brand Aoyaqi. The image contains six cans of the brand's drink, with the brand name and graphics printed on the cans. The cans are arranged on a carton, which also has the brand name and graphics. The entire package is predominantly red and yellow, giving a striking and attractive impression."}
{"_id":1,"text":"These are fashionable glasses with a metal frame in rose gold color. The temples are black, and the inner side of the temple has a brand logo that looks like 'The Row'. The design of these glasses is quite modern, suitable for everyday wear."}
{"_id":2,"text":"This image shows a woman taking a selfie with her side profile using her phone. She has long brown hair and is wearing a pair of exquisite earrings that resemble the letter 'A'. The background is an indoor environment with light blue walls and light-colored cabinets."}
{"_id":3,"text":"This is an image of a black plastic bottle with a red label on it. The label has white and yellow text, including the product name, brand, and some graphics. The bottle cap is red and gray."}
{"_id":4,"text":"This is a picture of a living room with a single armchair. The backrest and seat cushion of the chair have a zebra print pattern in black and white, and the frame is black wood with curled armrests. The legs of the chair are black, with an elegant shape. The chair is placed on a carpeted floor, with part of a sofa and decorative painting visible in the background. The decor style of the room is warm and modern."}
{"_id":5,"text":"This is an image of a disposable paper cup. The cup is cylindrical, with a smooth wall and no obvious decorations or patterns. The rim of the cup slightly flares outwards for easy gripping. The cup is light gray or off-white and looks relatively thin. This type of paper cup is often used for drinks or cold foods and is suitable for one-time use."}
{"_id":6,"text":"This image shows four cartoon characters with colorful lights in the background. From left to right, the four characters are:\n\n1. A character in blue clothing with a purple headscarf and hair accessories.\n2. A character in blue-green clothing with blue hair accessories and wings.\n3. A character in pink clothing with red headwear and wings.\n4. A character in red and white clothing with red headwear.\n\nThe background has the words 'New Grimm's Fairy Tales' and 'NEW GREEN'."}
{"_id":7,"text":"This image shows a hand holding blue grapes. The person is wearing a green sweater, and the fingers are slender. The grapes are dark blue with a smooth surface, and each grape looks plump and juicy. There are some green leaves and dry twigs for decoration. The background is a wooden table, giving a natural and fresh feeling."}
{"_id":8,"text":"This is an image of a cute little mug, with a light green body and a round handle. The cup features a cute cartoon design, including a bunny wearing headphones and the words 'Love Learning'. There are two small ears and several stars next to it. The overall design of the mug is simple and cute, suitable for daily use."}
{"_id":9,"text":"This is an image showing a large number of thread-like objects in plastic packaging. These objects are stacked together and look like some kind of fiber or hemp rope, possibly for weaving or processing."}
outputs/internvl2-8b/muge/retrieval_data/queries.jsonl#
{"_id":0,"text":"Tamarind juice drink whole box Yunnan"}
{"_id":1,"text":"Da Vinci glasses"}
{"_id":2,"text":"Rhinestone bow earrings"}
{"_id":3,"text":"Dengzhou yellow wine"}
{"_id":4,"text":"Zebra print armchair"}
{"_id":5,"text":"Pudding cup mold"}
{"_id":6,"text":"Pretty Cure figurine set"}
{"_id":7,"text":"Blueberry model"}
{"_id":8,"text":"Cute drinking cup"}
{"_id":9,"text":"Fried noodles"}
outputs/internvl2-8b/muge/retrieval_data/qrels/test.tsv#
query-id    corpus-id   score
0           0           1
1           1           1
2           2           1
3           3           1
4           4           1
5           5           1
6           6           1
7           7           1
8           8           1
9           9           1

4. Perform Text Retrieval Task

With the dataset ready, you can perform the text retrieval task following the CMTEB tutorial.