CLIP Benchmark#
This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.
Supported Datasets#
Dataset Name |
Task Type |
Notes |
---|---|---|
zeroshot_retrieval |
Chinese Multimodal Dataset |
|
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
Environment Preparation#
Install the required packages
pip install evalscope[rag] -U
Configure Evaluation Parameters#
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["muge", "flickr8k"],
"split": "test",
"batch_size": 128,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"output_dir": "outputs",
"cache_dir": "cache",
"limit": 1000,
},
},
}
Parameter Description#
eval_backend
: Default value isRAGEval
, indicating the use of the RAGEval evaluation backend.eval_config
: A dictionary containing the following fields:tool
: The evaluation tool, usingclip_benchmark
.eval
: A dictionary containing the following fields:models
: A list of model configurations, each with the following fields:model_name
:str
The model name or path, e.g.,AI-ModelScope/chinese-clip-vit-large-patch14-336px
. Supports automatic downloading from the ModelScope repository.
dataset_name
:List[str]
A list of dataset names, e.g.,["muge", "flickr8k", "mnist"]
. See Task List.split
:str
The split of the dataset to use, default istest
.batch_size
:int
Batch size for data loading, default is128
.num_workers
:int
Number of worker threads for data loading, default is1
.verbose
:bool
Whether to enable detailed logging, default isTrue
.skip_existing
:bool
Whether to skip processing if output already exists, default isFalse
.output_dir
:str
Output directory, default isoutputs
.cache_dir
:str
Dataset cache directory, default iscache
.limit
:Optional[int]
Limit the number of samples to process, default isNone
, e.g.,1000
.
Run Evaluation Task#
from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()
# Run task
run_task(task_cfg=task_cfg)
Output Evaluation Results#
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}
Custom Evaluation Dataset#
See also