SWE-bench_Lite#

Overview#

SWE-bench Lite is a focused subset of SWE-bench containing 300 Issue-Pull Request pairs from 11 popular Python repositories. It provides a more accessible entry point for evaluating automated software engineering capabilities.

Task Description#

  • Task Type: Automated Software Engineering / Bug Fixing

  • Input: GitHub issue description with repository context

  • Output: Code patch (diff format) that resolves the issue

  • Size: 300 carefully selected test instances

Key Features#

  • 300 test Issue-Pull Request pairs

  • 11 popular Python repositories covered

  • Real-world bugs with verified solutions

  • Evaluation via unit test verification

  • More manageable than full SWE-bench while still challenging

Evaluation Notes#

  • Requires pip install swebench==4.1.0 before evaluation

  • Docker images are built/pulled automatically for each repository

  • See the usage documentation for detailed setup instructions

  • Popular benchmark variant for initial model comparison

Properties#

Property

Value

Benchmark Name

swe_bench_lite

Dataset ID

princeton-nlp/SWE-bench_Lite

Paper

N/A

Tags

Coding

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Statistics not available.

Sample Example#

Sample example not available.

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter

Type

Default

Description

build_docker_images

bool

True

Build Docker images locally for each sample.

pull_remote_images_if_available

bool

True

Attempt to pull existing remote Docker images before building.

inference_dataset_id

str

princeton-nlp/SWE-bench_oracle

Oracle dataset ID used to fetch inference context.

force_arch

str

``

Optionally force the docker images to be pulled/built for a specific architecture. Choices: [‘’, ‘arm64’, ‘x86_64’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets swe_bench_lite \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['swe_bench_lite'],
    dataset_args={
        'swe_bench_lite': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)