CLIP Benchmark#

本框架支持CLIP Benchmark,其旨在为评测和分析CLIP(Contrastive Language-Image Pretraining)及其变体提供一个统一的框架和基准,目前框架支持43个评测数据集,包括zero-shot retireval任务,评价指标为recall@k;zero-shot classification任务,评价指标为acc@k。

支持的数据集#

数据集名称

任务类型

备注

muge

zeroshot_retrieval

中文多模态图文数据集

flickr30k

zeroshot_retrieval

flickr8k

zeroshot_retrieval

mscoco_captions

zeroshot_retrieval

mscoco_captions2017

zeroshot_retrieval

imagenet1k

zeroshot_classification

imagenetv2

zeroshot_classification

imagenet_sketch

zeroshot_classification

imagenet-a

zeroshot_classification

imagenet-r

zeroshot_classification

imagenet-o

zeroshot_classification

objectnet

zeroshot_classification

fer2013

zeroshot_classification

voc2007

zeroshot_classification

voc2007_multilabel

zeroshot_classification

sun397

zeroshot_classification

cars

zeroshot_classification

fgvc_aircraft

zeroshot_classification

mnist

zeroshot_classification

stl10

zeroshot_classification

gtsrb

zeroshot_classification

country211

zeroshot_classification

renderedsst2

zeroshot_classification

vtab_caltech101

zeroshot_classification

vtab_cifar10

zeroshot_classification

vtab_cifar100

zeroshot_classification

vtab_clevr_count_all

zeroshot_classification

vtab_clevr_closest_object_distance

zeroshot_classification

vtab_diabetic_retinopathy

zeroshot_classification

vtab_dmlab

zeroshot_classification

vtab_dsprites_label_orientation

zeroshot_classification

vtab_dsprites_label_x_position

zeroshot_classification

vtab_dsprites_label_y_position

zeroshot_classification

vtab_dtd

zeroshot_classification

vtab_eurosat

zeroshot_classification

vtab_kitti_closest_vehicle_distance

zeroshot_classification

vtab_flowers

zeroshot_classification

vtab_pets

zeroshot_classification

vtab_pcam

zeroshot_classification

vtab_resisc45

zeroshot_classification

vtab_smallnorb_label_azimuth

zeroshot_classification

vtab_smallnorb_label_elevation

zeroshot_classification

vtab_svhn

zeroshot_classification

环境准备#

安装依赖包

pip install evalscope[rag] -U

配置评测参数#

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "output_dir": "outputs",
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

参数说明:

  • eval_backend:默认值为 RAGEval,表示使用 RAGEval 评测后端。

  • eval_config:字典,包含以下字段:

    • tool:评测工具,使用 clip_benchmark

    • eval:字典,包含以下字段:

      • models:模型配置列表,包含以下字段:

        • model_name: str 模型名称或路径,例如 AI-ModelScope/chinese-clip-vit-large-patch14-336px,支持从modelscope仓库自动下载模型。

      • dataset_name: List[str] 数据集名称列表,例如 ["muge", "flickr8k", "mnist"],参见任务列表

      • split: str 数据集的划分部分,默认为 test

      • batch_size: int 数据加载的批量大小,默认为 128

      • num_workers: int 数据加载的工作线程数,默认为 1

      • verbose: bool 是否启用详细日志记录,默认为 True

      • skip_existing: bool 如果输出已经存在,是否跳过处理,默认为 False

      • output_dir: str 输出目录,默认为 outputs

      • cache_dir: str 数据集缓存目录,默认为 cache

      • limit: Optional[int] 限制处理样本的数量,默认为 None,例如 1000

运行评测任务#

from evalscope.run import run_task

run_task(task_cfg=task_cfg) 

输出评测结果如下:

outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json#
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}

自定义评测数据集#