GEdit-Bench#

Overview#

GEdit-Bench (Grounded Edit Benchmark) is an image editing benchmark grounded in real-world usage scenarios. It provides comprehensive evaluation of image editing models across diverse editing tasks with LLM-based judging.

Task Description#

Task Type: Image Editing Evaluation
Input: Source image + editing instruction
Output: Edited image evaluated by LLM judge
Languages: English (en) and Chinese (cn)

Key Features#

Real-world editing scenarios (background change, color alter, style transfer, etc.)
11 editing task categories
LLM-based evaluation for semantic consistency and perceptual quality
Supports both English and Chinese instructions
Comprehensive scoring: Semantic Consistency, Perceptual Quality, Overall

Evaluation Notes#

Default configuration uses 0-shot evaluation
Evaluates on train split (contains test samples)
Metrics: Semantic Consistency, Perceptual Similarity (via LLM judge)
Overall score: geometric mean of SC and PQ scores
Configure language via extra_params['language'] (en/cn)

Properties#

Property	Value
Benchmark Name	`gedit`
Dataset ID	stepfun-ai/GEdit-Bench
Paper	N/A
Tags	`ImageEditing`
Metrics	`Semantic Consistency`, `Perceptual Similarity`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	606
Prompt Length (Mean)	42.46 chars
Prompt Length (Min/Max)	11 / 158 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`background_change`	40	50.2	29	158
`color_alter`	40	41.5	23	143
`material_alter`	40	40.8	18	60
`motion_change`	40	44.05	20	87
`ps_human`	70	34.17	16	89
`style_change`	60	46.27	20	116
`subject-add`	60	51.13	14	148
`subject-remove`	57	37.3	15	110
`subject-replace`	60	48.95	27	96
`text_change`	99	39.71	11	116
`tone_transfer`	40	36	21	63

Image Statistics:

Metric	Value
Total Images	606
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	384x640 - 416x672
Formats	png

Sample Example#

Subset: background_change

{
  "input": [
    {
      "id": "4c309b59",
      "content": [
        {
          "text": "Change the background to a city street."
        },
        {
          "image": "[BASE64_IMAGE: png, ~495.7KB]"
        }
      ]
    }
  ],
  "id": 0,
  "group_id": 0,
  "subset_key": "background_change",
  "metadata": {
    "task_type": "background_change",
    "key": "4a7d36259ad94d238a6e7e7e0bd6b643",
    "instruction": "Change the background to a city street.",
    "instruction_language": "en",
    "input_image": "[BASE64_IMAGE: png, ~495.7KB]",
    "Intersection_exist": true,
    "id": "4a7d36259ad94d238a6e7e7e0bd6b643"
  }
}

Prompt Template#

No prompt template defined.

Extra Parameters#

Parameter	Type	Default	Description
`language`	`str`	`en`	Language of the instruction. Choices: [‘en’, ‘cn’]. Choices: [‘en’, ‘cn’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets gedit \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['gedit'],
    dataset_args={
        'gedit': {
            # subset_list: ['background_change', 'color_alter', 'material_alter']  # optional, evaluate specific subsets
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)