Image Editing Evaluation#
The image editing task evaluates the model’s ability to comprehend and generate image content. By inputting an image and a directive, the model generates a corresponding image based on the instruction. The EvalScope framework supports the evaluation of various image editing models, including Qwen-Image-Edit and FLUX.1-Kontext. Users can assess these models through the EvalScope framework to obtain performance metrics.
Supported Evaluation Datasets#
Please refer to the documentation and search using the ImageEditing tag.
Installation of Dependencies#
Users can install the necessary dependencies using the following command:
pip install evalscope[aigc] -U
End-to-End Evaluation#
Users can configure image editing models and complete model downloading, dataset downloading, model inference, and automatic evaluation with a single command. Taking the Qwen-Image-Edit model and gedit evaluation benchmark as an example:
from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType, JudgeStrategy, ModelTask
task_config = TaskConfig(
model='Qwen/Qwen-Image-Edit', # Model ID or local path
model_args={
'pipeline_cls': 'QwenImageEditPipeline', # Pipeline class in diffusers
'precision': 'bfloat16', # Model precision
'device_map': 'cuda:2' # Device mapping, Qwen Image Edit requires approximately 60G of VRAM
},
model_task=ModelTask.IMAGE_GENERATION, # Model task type
eval_type=EvalType.IMAGE_EDITING, # Evaluation task type
generation_config={ # Inference parameters
'true_cfg_scale': 4.0,
'num_inference_steps': 50,
'negative_prompt': ' ',
},
datasets=['gedit'], # Benchmark used
dataset_args={ # Specific parameters for the benchmark
'gedit':{
'extra_params':{
'language': 'cn', # Use Chinese instructions
}
}
},
eval_batch_size=5,
limit=5,
judge_strategy=JudgeStrategy.AUTO,
judge_model_args={ # Configure a VLM model for automatic scoring
'model_id': 'qwen2.5-vl-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': env.get('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096,
}
},
)
run_task(task_config)
Example output:
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=================+===========+======================+===================+=======+=========+=========+
| Qwen-Image-Edit | gedit | Semantic Consistency | background_change | 5 | 8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | color_alter | 5 | 8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | material_alter | 5 | 6.4 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | motion_change | 5 | 5.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | ps_human | 5 | 6.4 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | style_change | 5 | 6.2 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | subject-add | 5 | 8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | subject-remove | 5 | 8.4 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | subject-replace | 5 | 8.4 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | text_change | 5 | 8.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | tone_transfer | 5 | 7.6 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Semantic Consistency | OVERALL | 55 | 7.4545 | - |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | background_change | 5 | 7.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | color_alter | 5 | 6.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | material_alter | 5 | 7.6 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | motion_change | 5 | 7.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | ps_human | 5 | 6.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | style_change | 5 | 8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | subject-add | 5 | 7 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | subject-remove | 5 | 7.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | subject-replace | 5 | 7.8 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | text_change | 5 | 7.2 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | tone_transfer | 5 | 6.2 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Perceptual Quality | OVERALL | 55 | 7.3455 | - |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | background_change | 5 | 7.8967 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | color_alter | 5 | 7.317 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | material_alter | 5 | 6.7933 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | motion_change | 5 | 6.5765 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | ps_human | 5 | 5.6765 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | style_change | 5 | 6.2967 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | subject-add | 5 | 7.3798 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | subject-remove | 5 | 8.0908 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | subject-replace | 5 | 8.0827 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | text_change | 5 | 7.9547 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | tone_transfer | 5 | 6.7417 | default |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
| Qwen-Image-Edit | gedit | Overall | OVERALL | 55 | 7.1642 | - |
+-----------------+-----------+----------------------+-------------------+-------+---------+---------+
Custom Evaluation#
Given the complexity of current AIGC model usage, some developers rely on visual workflow tools like ComfyUI for multi-module generation or use API interfaces to call cloud model services (such as Stable Diffusion WebUI, commercial API platforms). For these scenarios, EvalScope supports a “model-free” evaluation mode, where users only need to provide the storage path of the generated images and the corresponding benchmark id to start the evaluation process without downloading model weights or performing inference locally.
Data Format#
For gedit, you need to provide the following jsonl file:
{"id": "c18278ee2b0b3d8bd18c5279f4a8c636", "image_path": "Qwen-Image-Edit/images/c18278ee2b0b3d8bd18c5279f4a8c636_3.png"}
{"id": "15f01f7d55ad4f8695218594277e451f", "image_path": "Qwen-Image-Edit/images/15f01f7d55ad4f8695218594277e451f_0.png"}
{"id": "d83dad4db56f5c6c1270708a74311725", "image_path": "Qwen-Image-Edit/images/d83dad4db56f5c6c1270708a74311725_3.png"}
{"id": "caabd082c0ed1757df58db3eaea5ac73", "image_path": "Qwen-Image-Edit/images/caabd082c0ed1757df58db3eaea5ac73_1.png"}
{"id": "4a7d36259ad94d238a6e7e7e0bd6b643", "image_path": "Qwen-Image-Edit/images/4a7d36259ad94d238a6e7e7e0bd6b643_0.png"}
Where:
idis the unique sample ID in the evaluation benchmarkimage_pathis the storage path of the generated image
Task Configuration#
Use the following script to load local data for evaluation:
from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType, JudgeStrategy, ModelTask
task_config = TaskConfig(
model_id='offline-model', # No need to configure a model, just provide a custom ID
model_task=ModelTask.IMAGE_GENERATION, # Model task type
eval_type=EvalType.IMAGE_EDITING, # Evaluation task type
datasets=['gedit'], # Benchmark used
dataset_args={ # Specific parameters for the benchmark
'gedit':{
'subset_list': ['color_alter', 'material_alter'], # Select evaluation subsets
'extra_params':{
'language': 'cn', # Use Chinese instructions
'local_file': 'outputs/example_edit.jsonl' # Use locally generated images
}
}
},
eval_batch_size=5,
limit=5,
judge_strategy=JudgeStrategy.AUTO,
judge_model_args={ # Configure a VLM model for automatic scoring
'model_id': 'qwen2.5-vl-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': env.get('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096,
}
},
)
run_task(task_config)
Example output:
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+===============+===========+======================+================+=======+=========+=========+
| offline-model | gedit | Semantic Consistency | color_alter | 1 | 8 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Semantic Consistency | material_alter | 1 | 2 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Semantic Consistency | OVERALL | 2 | 5 | - |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Perceptual Quality | color_alter | 1 | 8 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Perceptual Quality | material_alter | 1 | 8 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Perceptual Quality | OVERALL | 2 | 8 | - |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Overall | color_alter | 1 | 8 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Overall | material_alter | 1 | 4 | default |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
| offline-model | gedit | Overall | OVERALL | 2 | 6 | - |
+---------------+-----------+----------------------+----------------+-------+---------+---------+
Visualization#
The EvalScope framework supports visualization of evaluation results, allowing users to generate visual reports for easy comparison of the image editing effects of models:
evalscope app
For documentation, please refer to the visualization documentation.
Examples:
Score Distribution
Effect Comparison