CLIP Benchmark#
This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks (metric: recall@k) and zero-shot classification tasks (metric: acc@k).
Environment Preparation#
Install the required packages:
pip install evalscope[rag] -U
Quick Start#
The following example shows how to evaluate a CLIP model with minimal configuration:
from evalscope.run import run_task
task_cfg = {
"work_dir": "outputs",
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["muge"],
"split": "test",
},
},
}
run_task(task_cfg=task_cfg)
Output evaluation results:
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}
Key parameters:
Parameter |
Type |
Description |
|---|---|---|
|
|
Model configuration list. |
|
|
Dataset name list, see Supported Datasets |
|
|
Dataset split, default |
Advanced: Multi-Model / Multi-Dataset Batch Evaluation#
When evaluating multiple models or datasets simultaneously, extend the configuration:
from evalscope.run import run_task
task_cfg = {
"work_dir": "outputs",
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["muge", "flickr8k"],
"split": "test",
"batch_size": 128,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"cache_dir": "cache",
"limit": 1000,
},
},
}
run_task(task_cfg=task_cfg)
Additional parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Data loading batch size |
|
|
|
Number of data loading workers |
|
|
|
Enable verbose logging |
|
|
|
Skip processing if output already exists |
|
|
|
Dataset cache directory |
|
|
|
Limit the number of samples to process |
Supported Datasets#
Dataset Name |
Task Type |
Notes |
|---|---|---|
zeroshot_retrieval |
Chinese Multimodal Dataset |
|
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
Full Parameter Reference#
eval_backend: Default value isRAGEval, indicating the use of the RAGEval evaluation backend.eval_config: A dictionary containing the following fields:tool: Evaluation tool, usingclip_benchmark.eval: Evaluation configuration, containing the following fields:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model configuration list. |
|
|
|
Dataset name list, see Supported Datasets |
|
|
|
Dataset split |
|
|
|
Task type, auto-inferred by default |
|
|
|
Data loading batch size |
|
|
|
Number of data loading workers |
|
|
|
Enable verbose logging |
|
|
|
Output directory for evaluation results |
|
|
|
Dataset cache directory |
|
|
|
Skip processing if output already exists |
|
|
|
Custom data directory |
|
|
|
Limit the number of samples to process |
FAQ#
Dataset Download Failure#
If ModelScope dataset download fails, try configuring a mirror or manually downloading the dataset, then specify the local path via the data_dir parameter.
Slow Evaluation#
Increase
batch_size(default 128) to improve throughput, mind the GPU memory limitIncrease
num_workers(default 1) to speed up data loadingUse
skip_existing: trueto skip already completed evaluations
Metric Definitions#
zeroshot_classification: reports
acc1(Top-1 accuracy) andacc5(Top-5 accuracy)zeroshot_retrieval: reports
text_retrieval_recall@kandimage_retrieval_recall@k
Custom Evaluation Dataset#
See also