CLIP Benchmark#

This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks (metric: recall@k) and zero-shot classification tasks (metric: acc@k).

Environment Preparation#

Install the required packages:

pip install evalscope[rag] -U

Quick Start#

The following example shows how to evaluate a CLIP model with minimal configuration:

from evalscope.run import run_task

task_cfg = {
    "work_dir": "outputs",
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge"],
            "split": "test",
        },
    },
}

run_task(task_cfg=task_cfg)

Output evaluation results:

outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json#

{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}

Key parameters:

Parameter	Type	Description
`models`	`List[dict]`	Model configuration list. `model_name` is the model name or path, supports automatic download from ModelScope
`dataset_name`	`List[str]`	Dataset name list, see Supported Datasets
`split`	`str`	Dataset split, default `"test"`

Advanced: Multi-Model / Multi-Dataset Batch Evaluation#

When evaluating multiple models or datasets simultaneously, extend the configuration:

from evalscope.run import run_task

task_cfg = {
    "work_dir": "outputs",
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

run_task(task_cfg=task_cfg)

Additional parameters:

Parameter	Type	Default	Description
`batch_size`	`int`	`128`	Data loading batch size
`num_workers`	`int`	`1`	Number of data loading workers
`verbose`	`bool`	`True`	Enable verbose logging
`skip_existing`	`bool`	`False`	Skip processing if output already exists
`cache_dir`	`str`	`"cache"`	Dataset cache directory
`limit`	`Optional[int]`	`None`	Limit the number of samples to process

Supported Datasets#

Dataset Name	Task Type	Notes
muge	zeroshot_retrieval	Chinese Multimodal Dataset
flickr30k	zeroshot_retrieval
flickr8k	zeroshot_retrieval
mscoco_captions	zeroshot_retrieval
mscoco_captions2017	zeroshot_retrieval
imagenet1k	zeroshot_classification
imagenetv2	zeroshot_classification
imagenet_sketch	zeroshot_classification
imagenet-a	zeroshot_classification
imagenet-r	zeroshot_classification
imagenet-o	zeroshot_classification
objectnet	zeroshot_classification
fer2013	zeroshot_classification
voc2007	zeroshot_classification
voc2007_multilabel	zeroshot_classification
sun397	zeroshot_classification
cars	zeroshot_classification
fgvc_aircraft	zeroshot_classification
mnist	zeroshot_classification
stl10	zeroshot_classification
gtsrb	zeroshot_classification
country211	zeroshot_classification
renderedsst2	zeroshot_classification
vtab_caltech101	zeroshot_classification
vtab_cifar10	zeroshot_classification
vtab_cifar100	zeroshot_classification
vtab_clevr_count_all	zeroshot_classification
vtab_clevr_closest_object_distance	zeroshot_classification
vtab_diabetic_retinopathy	zeroshot_classification
vtab_dmlab	zeroshot_classification
vtab_dsprites_label_orientation	zeroshot_classification
vtab_dsprites_label_x_position	zeroshot_classification
vtab_dsprites_label_y_position	zeroshot_classification
vtab_dtd	zeroshot_classification
vtab_eurosat	zeroshot_classification
vtab_kitti_closest_vehicle_distance	zeroshot_classification
vtab_flowers	zeroshot_classification
vtab_pets	zeroshot_classification
vtab_pcam	zeroshot_classification
vtab_resisc45	zeroshot_classification
vtab_smallnorb_label_azimuth	zeroshot_classification
vtab_smallnorb_label_elevation	zeroshot_classification
vtab_svhn	zeroshot_classification

Full Parameter Reference#

eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.
eval_config: A dictionary containing the following fields:
- tool: Evaluation tool, using clip_benchmark.
- eval: Evaluation configuration, containing the following fields:

Parameter	Type	Default	Description
`models`	`List[dict]`	`[]`	Model configuration list. `model_name` is the model name or path, supports automatic download from ModelScope
`dataset_name`	`List[str]`	`[]`	Dataset name list, see Supported Datasets
`split`	`str`	`"test"`	Dataset split
`task`	`Optional[str]`	`None`	Task type, auto-inferred by default
`batch_size`	`int`	`128`	Data loading batch size
`num_workers`	`int`	`1`	Number of data loading workers
`verbose`	`bool`	`True`	Enable verbose logging
`output_dir`	`str`	`"outputs"`	Output directory for evaluation results
`cache_dir`	`str`	`"cache"`	Dataset cache directory
`skip_existing`	`bool`	`False`	Skip processing if output already exists
`data_dir`	`Optional[str]`	`None`	Custom data directory
`limit`	`Optional[int]`	`None`	Limit the number of samples to process

FAQ#

Dataset Download Failure#

If ModelScope dataset download fails, try configuring a mirror or manually downloading the dataset, then specify the local path via the data_dir parameter.

Slow Evaluation#

Increase batch_size (default 128) to improve throughput, mind the GPU memory limit
Increase num_workers (default 1) to speed up data loading
Use skip_existing: true to skip already completed evaluations

Metric Definitions#

zeroshot_classification: reports acc1 (Top-1 accuracy) and acc5 (Top-5 accuracy)
zeroshot_retrieval: reports text_retrieval_recall@k and image_retrieval_recall@k