GEdit-Bench#
Overview#
GEdit-Bench (Grounded Edit Benchmark) is an image editing benchmark grounded in real-world usage scenarios. It provides comprehensive evaluation of image editing models across diverse editing tasks with LLM-based judging.
Task Description#
Task Type: Image Editing Evaluation
Input: Source image + editing instruction
Output: Edited image evaluated by LLM judge
Languages: English (en) and Chinese (cn)
Key Features#
Real-world editing scenarios (background change, color alter, style transfer, etc.)
11 editing task categories
LLM-based evaluation for semantic consistency and perceptual quality
Supports both English and Chinese instructions
Comprehensive scoring: Semantic Consistency, Perceptual Quality, Overall
Evaluation Notes#
Default configuration uses 0-shot evaluation
Evaluates on train split (contains test samples)
Metrics: Semantic Consistency, Perceptual Similarity (via LLM judge)
Overall score: geometric mean of SC and PQ scores
Configure language via
extra_params['language'](en/cn)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
606 |
Prompt Length (Mean) |
42.46 chars |
Prompt Length (Min/Max) |
11 / 158 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
40 |
50.2 |
29 |
158 |
|
40 |
41.5 |
23 |
143 |
|
40 |
40.8 |
18 |
60 |
|
40 |
44.05 |
20 |
87 |
|
70 |
34.17 |
16 |
89 |
|
60 |
46.27 |
20 |
116 |
|
60 |
51.13 |
14 |
148 |
|
57 |
37.3 |
15 |
110 |
|
60 |
48.95 |
27 |
96 |
|
99 |
39.71 |
11 |
116 |
|
40 |
36 |
21 |
63 |
Image Statistics:
Metric |
Value |
|---|---|
Total Images |
606 |
Images per Sample |
min: 1, max: 1, mean: 1 |
Resolution Range |
384x640 - 416x672 |
Formats |
png |
Sample Example#
Subset: background_change
{
"input": [
{
"id": "4c309b59",
"content": [
{
"text": "Change the background to a city street."
},
{
"image": "[BASE64_IMAGE: png, ~495.7KB]"
}
]
}
],
"id": 0,
"group_id": 0,
"subset_key": "background_change",
"metadata": {
"task_type": "background_change",
"key": "4a7d36259ad94d238a6e7e7e0bd6b643",
"instruction": "Change the background to a city street.",
"instruction_language": "en",
"input_image": "[BASE64_IMAGE: png, ~495.7KB]",
"Intersection_exist": true,
"id": "4a7d36259ad94d238a6e7e7e0bd6b643"
}
}
Prompt Template#
No prompt template defined.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Language of the instruction. Choices: [‘en’, ‘cn’]. Choices: [‘en’, ‘cn’] |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets gedit \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['gedit'],
dataset_args={
'gedit': {
# subset_list: ['background_change', 'color_alter', 'material_alter'] # optional, evaluate specific subsets
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)