CLIP Benchmark#
This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.
Supported Datasets#
Dataset Name |
Task Type |
Notes |
|---|---|---|
zeroshot_retrieval |
Chinese Multimodal Dataset |
|
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_retrieval |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
||
zeroshot_classification |
Environment Preparation#
Install the required packages
pip install evalscope[rag] -U
Configure Evaluation Parameters#
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["muge", "flickr8k"],
"split": "test",
"batch_size": 128,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"output_dir": "outputs",
"cache_dir": "cache",
"limit": 1000,
},
},
}
Parameter Description#
eval_backend: Default value isRAGEval, indicating the use of the RAGEval evaluation backend.eval_config: A dictionary containing the following fields:tool: The evaluation tool, usingclip_benchmark.eval: A dictionary containing the following fields:models: A list of model configurations, each with the following fields:model_name:strThe model name or path, e.g.,AI-ModelScope/chinese-clip-vit-large-patch14-336px. Supports automatic downloading from the ModelScope repository.
dataset_name:List[str]A list of dataset names, e.g.,["muge", "flickr8k", "mnist"]. See Task List.split:strThe split of the dataset to use, default istest.batch_size:intBatch size for data loading, default is128.num_workers:intNumber of worker threads for data loading, default is1.verbose:boolWhether to enable detailed logging, default isTrue.skip_existing:boolWhether to skip processing if output already exists, default isFalse.output_dir:strOutput directory, default isoutputs.cache_dir:strDataset cache directory, default iscache.limit:Optional[int]Limit the number of samples to process, default isNone, e.g.,1000.
Run Evaluation Task#
from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()
# Run task
run_task(task_cfg=task_cfg)
Output Evaluation Results#
{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}
Custom Evaluation Dataset#
This framework supports custom evaluation datasets. You only need to prepare the dataset and configure it according to the following format.
Image-Text Retrieval Dataset#
1. Prepare the Dataset
Prepare an image_queries.jsonl (fixed file name) image-text retrieval dataset with the following format:
{"image_path": "custom_eval/multimodal/images/dog.jpg", "query": ["dog"]}
{"image_path": "custom_eval/multimodal/images/AMNH.jpg", "query": ["building"]}
{"image_path": "custom_eval/multimodal/images/tokyo.jpg", "query": ["city", "tokyo"]}
{"image_path": "custom_eval/multimodal/images/tesla.jpg", "query": ["car", "tesla"]}
{"image_path": "custom_eval/multimodal/images/running.jpg", "query": ["man", "running"]}
Where:
image_path: Path to the image, supports local paths.query: Text descriptions for image-text retrieval, supports multiple descriptions, e.g.,["dog", "cat"].
2. Configure Evaluation Parameters
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["custom"],
"data_dir": "custom_eval/multimodal/text-image-retrieval",
"split": "test",
"batch_size": 128,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"limit": 1000,
},
},
}
Where:
dataset_name: Dataset name, must be specified ascustom.data_dir: Dataset directory, containing theimage_queries.jsonlfile.
3. Run the Evaluation Task
from evalscope.run import run_task
run_task(task_cfg=task_cfg)
The evaluation results output as follows:
{"dataset": "custom", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 1.0, "text_retrieval_recall@5": 1.0}}
Converting Image-Text Retrieval to Text Retrieval#
To facilitate the evaluation of different multimodal retrieval methods, this framework supports converting image-text retrieval problems into text retrieval problems using multimodal large models, and then performing text retrieval evaluation.
1. Prepare the Dataset
Supports Image-Text Retrieval Dataset and Custom Image-Text Retrieval Dataset.
2. Configure Evaluation Parameters
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "internvl2-8b",
"api_base": "http://localhost:8008/v1",
"api_key": "xxx",
"prompt": "Describe this image in English",
}
],
"dataset_name": ["muge"],
"split": "test",
"task": "image_caption",
"batch_size": 2,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"limit": 10,
},
},
}
Parameter Description:
A multimodal large model must be configured in the
modelslist:model_name: Name of the multimodal large model, e.g.,internvl2-8b.api_base: API address of the multimodal large model, e.g.,http://localhost:8008/v1.api_key: API key of the multimodal large model, e.g.,xxx.promptis the prompt input for the multimodal large model, e.g.,"Describe this image in English".
task: Evaluation task, must be specified asimage_caption.
3. Run the Conversion Task
from evalscope.run import run_task
run_task(task_cfg=task_cfg)
The output results are as follows:
2024-10-22 19:56:09,832 - evalscope - INFO - Write files to outputs/internvl2-8b/muge/retrieval_data
2024-10-22 19:56:10,543 - evalscope - INFO - Evaluation results: {'dataset': 'muge', 'model': 'internvl2-8b', 'task': 'image_caption', 'metrics': {'conversion_successful': True, 'save_path': 'outputs/internvl2-8b/muge/retrieval_data'}}
2024-10-22 19:56:10,544 - evalscope - INFO - Dump results to: outputs/internvl2-8b/muge_image_caption.json
The output file directory structure is as follows:
muge
βββ retrieval_data
β βββ corpus.jsonl
β βββ queries.jsonl
β βββ qrels
β βββ test.tsv
βββ muge_image_caption.json
The specific content of the files is as follows:
{"_id":0,"text":"This is an advertisement image showcasing the products of the brand Aoyaqi. The image contains six cans of the brand's drink, with the brand name and graphics printed on the cans. The cans are arranged on a carton, which also has the brand name and graphics. The entire package is predominantly red and yellow, giving a striking and attractive impression."}
{"_id":1,"text":"These are fashionable glasses with a metal frame in rose gold color. The temples are black, and the inner side of the temple has a brand logo that looks like 'The Row'. The design of these glasses is quite modern, suitable for everyday wear."}
{"_id":2,"text":"This image shows a woman taking a selfie with her side profile using her phone. She has long brown hair and is wearing a pair of exquisite earrings that resemble the letter 'A'. The background is an indoor environment with light blue walls and light-colored cabinets."}
{"_id":3,"text":"This is an image of a black plastic bottle with a red label on it. The label has white and yellow text, including the product name, brand, and some graphics. The bottle cap is red and gray."}
{"_id":4,"text":"This is a picture of a living room with a single armchair. The backrest and seat cushion of the chair have a zebra print pattern in black and white, and the frame is black wood with curled armrests. The legs of the chair are black, with an elegant shape. The chair is placed on a carpeted floor, with part of a sofa and decorative painting visible in the background. The decor style of the room is warm and modern."}
{"_id":5,"text":"This is an image of a disposable paper cup. The cup is cylindrical, with a smooth wall and no obvious decorations or patterns. The rim of the cup slightly flares outwards for easy gripping. The cup is light gray or off-white and looks relatively thin. This type of paper cup is often used for drinks or cold foods and is suitable for one-time use."}
{"_id":6,"text":"This image shows four cartoon characters with colorful lights in the background. From left to right, the four characters are:\n\n1. A character in blue clothing with a purple headscarf and hair accessories.\n2. A character in blue-green clothing with blue hair accessories and wings.\n3. A character in pink clothing with red headwear and wings.\n4. A character in red and white clothing with red headwear.\n\nThe background has the words 'New Grimm's Fairy Tales' and 'NEW GREEN'."}
{"_id":7,"text":"This image shows a hand holding blue grapes. The person is wearing a green sweater, and the fingers are slender. The grapes are dark blue with a smooth surface, and each grape looks plump and juicy. There are some green leaves and dry twigs for decoration. The background is a wooden table, giving a natural and fresh feeling."}
{"_id":8,"text":"This is an image of a cute little mug, with a light green body and a round handle. The cup features a cute cartoon design, including a bunny wearing headphones and the words 'Love Learning'. There are two small ears and several stars next to it. The overall design of the mug is simple and cute, suitable for daily use."}
{"_id":9,"text":"This is an image showing a large number of thread-like objects in plastic packaging. These objects are stacked together and look like some kind of fiber or hemp rope, possibly for weaving or processing."}
{"_id":0,"text":"Tamarind juice drink whole box Yunnan"}
{"_id":1,"text":"Da Vinci glasses"}
{"_id":2,"text":"Rhinestone bow earrings"}
{"_id":3,"text":"Dengzhou yellow wine"}
{"_id":4,"text":"Zebra print armchair"}
{"_id":5,"text":"Pudding cup mold"}
{"_id":6,"text":"Pretty Cure figurine set"}
{"_id":7,"text":"Blueberry model"}
{"_id":8,"text":"Cute drinking cup"}
{"_id":9,"text":"Fried noodles"}
query-id corpus-id score
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
4. Perform Text Retrieval Task
With the dataset ready, you can perform the text retrieval task following the CMTEB tutorial.
See also
Refer to Custom Text Retrieval Evaluation