CLIP Benchmark#

This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.

Supported Datasets#

Dataset Name

Task Type

Notes

muge

zeroshot_retrieval

Chinese Multimodal Dataset

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification

Environment Preparation#

Install the required packages

pip install evalscope[rag] -U

Configure Evaluation Parameters#

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "output_dir": "outputs",
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

Parameter Description#

  • eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.

  • eval_config: A dictionary containing the following fields:

    • tool: The evaluation tool, using clip_benchmark.

    • eval: A dictionary containing the following fields:

      • models: A list of model configurations, each with the following fields:

        • model_name: str The model name or path, e.g., AI-ModelScope/chinese-clip-vit-large-patch14-336px. Supports automatic downloading from the ModelScope repository.

      • dataset_name: List[str] A list of dataset names, e.g., ["muge", "flickr8k", "mnist"]. See Task List.

      • split: str The split of the dataset to use, default is test.

      • batch_size: int Batch size for data loading, default is 128.

      • num_workers: int Number of worker threads for data loading, default is 1.

      • verbose: bool Whether to enable detailed logging, default is True.

      • skip_existing: bool Whether to skip processing if output already exists, default is False.

      • output_dir: str Output directory, default is outputs.

      • cache_dir: str Dataset cache directory, default is cache.

      • limit: Optional[int] Limit the number of samples to process, default is None, e.g., 1000.

Run Evaluation Task#

from evalscope.run import run_task
from evalscope.utils.logger import get_logger

logger = get_logger()

# Run task
run_task(task_cfg=task_cfg) 

Output Evaluation Results#

outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json#
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}

Custom Evaluation Dataset#