SWE-bench_Verified#
Overview#
SWE-bench Verified is a human-validated subset of 500 samples from SWE-bench, designed to test systems’ ability to automatically resolve real-world GitHub issues. Each sample represents a genuine bug fix or feature implementation from popular Python repositories.
Task Description#
Task Type: Automated Software Engineering / Bug Fixing
Input: GitHub issue description with repository context
Output: Code patch (diff format) that resolves the issue
Repositories: 12 popular Python projects (Django, Flask, Requests, etc.)
Key Features#
500 human-validated Issue-Pull Request pairs
Real-world bugs from production Python repositories
Evaluation via unit test verification
Docker-based isolated execution environments
Tests both bug understanding and code modification skills
Evaluation Notes#
Requires
pip install swebench==4.1.0before evaluationDocker images are built/pulled automatically for each repository
Timeout of 1800 seconds (30 min) per instance
See the usage documentation for detailed setup instructions
Supports both local image building and remote image pulling
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Statistics not available.
Sample Example#
Sample example not available.
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Oracle dataset ID used to fetch inference context. |
|
|
|
Build Docker images locally for each sample. |
|
|
|
Attempt to pull existing remote Docker images before building. |
|
|
`` |
Optionally force the docker images to be pulled/built for a specific architecture. Choices: [‘’, ‘arm64’, ‘x86_64’] |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets swe_bench_verified \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['swe_bench_verified'],
dataset_args={
'swe_bench_verified': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)