CLIP Benchmark#

This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.

Supported Datasets#

Dataset Name	Task Type	Notes
muge	zeroshot_retrieval	Chinese Multimodal Dataset
flickr30k	zeroshot_retrieval
flickr8k	zeroshot_retrieval
mscoco_captions	zeroshot_retrieval
mscoco_captions2017	zeroshot_retrieval
imagenet1k	zeroshot_classification
imagenetv2	zeroshot_classification
imagenet_sketch	zeroshot_classification
imagenet-a	zeroshot_classification
imagenet-r	zeroshot_classification
imagenet-o	zeroshot_classification
objectnet	zeroshot_classification
fer2013	zeroshot_classification
voc2007	zeroshot_classification
voc2007_multilabel	zeroshot_classification
sun397	zeroshot_classification
cars	zeroshot_classification
fgvc_aircraft	zeroshot_classification
mnist	zeroshot_classification
stl10	zeroshot_classification
gtsrb	zeroshot_classification
country211	zeroshot_classification
renderedsst2	zeroshot_classification
vtab_caltech101	zeroshot_classification
vtab_cifar10	zeroshot_classification
vtab_cifar100	zeroshot_classification
vtab_clevr_count_all	zeroshot_classification
vtab_clevr_closest_object_distance	zeroshot_classification
vtab_diabetic_retinopathy	zeroshot_classification
vtab_dmlab	zeroshot_classification
vtab_dsprites_label_orientation	zeroshot_classification
vtab_dsprites_label_x_position	zeroshot_classification
vtab_dsprites_label_y_position	zeroshot_classification
vtab_dtd	zeroshot_classification
vtab_eurosat	zeroshot_classification
vtab_kitti_closest_vehicle_distance	zeroshot_classification
vtab_flowers	zeroshot_classification
vtab_pets	zeroshot_classification
vtab_pcam	zeroshot_classification
vtab_resisc45	zeroshot_classification
vtab_smallnorb_label_azimuth	zeroshot_classification
vtab_smallnorb_label_elevation	zeroshot_classification
vtab_svhn	zeroshot_classification

Environment Preparation#

Install the required packages

pip install evalscope[rag] -U

Configure Evaluation Parameters#

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "output_dir": "outputs",
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

Parameter Description#

eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.
eval_config: A dictionary containing the following fields:
- tool: The evaluation tool, using clip_benchmark.
- eval: A dictionary containing the following fields:
  - models: A list of model configurations, each with the following fields:
    - model_name: str The model name or path, e.g., AI-ModelScope/chinese-clip-vit-large-patch14-336px. Supports automatic downloading from the ModelScope repository.
  - dataset_name: List[str] A list of dataset names, e.g., ["muge", "flickr8k", "mnist"]. See Task List.
  - split: str The split of the dataset to use, default is test.
  - batch_size: int Batch size for data loading, default is 128.
  - num_workers: int Number of worker threads for data loading, default is 1.
  - verbose: bool Whether to enable detailed logging, default is True.
  - skip_existing: bool Whether to skip processing if output already exists, default is False.
  - output_dir: str Output directory, default is outputs.
  - cache_dir: str Dataset cache directory, default is cache.
  - limit: Optional[int] Limit the number of samples to process, default is None, e.g., 1000.

Run Evaluation Task#

from evalscope.run import run_task
from evalscope.utils.logger import get_logger

logger = get_logger()

# Run task
run_task(task_cfg=task_cfg) 

Output Evaluation Results#

outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json#

{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}