SWE-bench_Verified#

Overview#

SWE-bench Verified is a human-validated subset of 500 samples from SWE-bench, designed to test systems’ ability to automatically resolve real-world GitHub issues. Each sample represents a genuine bug fix or feature implementation from popular Python repositories.

Task Description#

  • Task Type: Automated Software Engineering / Bug Fixing

  • Input: GitHub issue description with repository context

  • Output: Code patch (diff format) that resolves the issue

  • Repositories: 12 popular Python projects (Django, Flask, Requests, etc.)

Key Features#

  • 500 human-validated Issue-Pull Request pairs

  • Real-world bugs from production Python repositories

  • Evaluation via unit test verification

  • Docker-based isolated execution environments

  • Tests both bug understanding and code modification skills

Evaluation Notes#

  • Requires pip install swebench==4.1.0 before evaluation

  • Docker images are built/pulled automatically for each repository

  • Timeout of 1800 seconds (30 min) per instance

  • See the usage documentation for detailed setup instructions

  • Supports both local image building and remote image pulling

Properties#

Property

Value

Benchmark Name

swe_bench_verified

Dataset ID

princeton-nlp/SWE-bench_Verified

Paper

N/A

Tags

Coding

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Statistics not available.

Sample Example#

Sample example not available.

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter

Type

Default

Description

inference_dataset_id

str

princeton-nlp/SWE-bench_oracle

Oracle dataset ID used to fetch inference context.

build_docker_images

bool

True

Build Docker images locally for each sample.

pull_remote_images_if_available

bool

True

Attempt to pull existing remote Docker images before building.

force_arch

str

``

Optionally force the docker images to be pulled/built for a specific architecture. Choices: [‘’, ‘arm64’, ‘x86_64’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets swe_bench_verified \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['swe_bench_verified'],
    dataset_args={
        'swe_bench_verified': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)