SWE-bench_Verified_mini#

Overview#

SWE-bench Verified Mini is a compact subset of SWE-bench Verified, containing 50 carefully selected samples that maintain the same distribution of performance, test pass rates, and difficulty as the full dataset while requiring only 5GB of storage instead of 130GB.

Task Description#

Task Type: Automated Software Engineering / Bug Fixing
Input: GitHub issue description with repository context
Output: Code patch (diff format) that resolves the issue
Size: 50 samples (vs 500 in full Verified set)

Key Features#

Representative 50-sample subset of SWE-bench Verified
Same difficulty distribution as full dataset
Dramatically reduced storage requirements (5GB vs 130GB)
Ideal for quick evaluation and development iteration
Maintains statistical validity for benchmarking

Evaluation Notes#

Requires pip install swebench==4.1.0 before evaluation
Docker images are built/pulled automatically
See the usage documentation for detailed setup
Good for rapid prototyping and initial model assessment

Properties#

Property	Value
Benchmark Name	`swe_bench_verified_mini`
Dataset ID	evalscope/swe-bench-verified-mini
Paper	N/A
Tags	`Coding`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Statistics not available.

Sample Example#

Sample example not available.

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter	Type	Default	Description
`build_docker_images`	`bool`	`True`	Build Docker images locally for each sample.
`pull_remote_images_if_available`	`bool`	`True`	Attempt to pull existing remote Docker images before building.
`inference_dataset_id`	`str`	`princeton-nlp/SWE-bench_oracle`	Oracle dataset ID used to fetch inference context.
`force_arch`	`str`	``	Optionally force the docker images to be pulled/built for a specific architecture. Choices: [‘’, ‘arm64’, ‘x86_64’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets swe_bench_verified_mini \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['swe_bench_verified_mini'],
    dataset_args={
        'swe_bench_verified_mini': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)