SWE-bench#
Introduction#
SWE-bench is a benchmark suite for evaluating Large Language Models (LLMs) on real-world software engineering tasks. Each sample corresponds to a real Issue–PR pair from a GitHub repository, and the model is required to produce a code patch that passes the hidden unit tests.
EvalScope’s SWE-bench integration provides two evaluation modes:
Oracle single-turn mode: The retrieved relevant code context is provided to the model in one shot, and the model directly generates a patch.
Agentic multi-turn mode: Only the issue description is provided. The model drives a multi-turn agent loop inside a per-instance SWE-bench Docker container, exploring the codebase via
bash, editing source files and finally submitting a patch (compatible with the mini-swe-agent pipeline).
Comparison Between the Two Modes#
Dimension |
Oracle Single-turn |
Agentic Multi-turn |
|---|---|---|
Input |
|
|
Context source |
Pre-injected via |
Model autonomously explores |
Model interaction |
Single request, patch produced directly |
Multi-turn agent loop (default cap 250 steps) |
Tool use |
None |
|
Submission mechanism |
Model’s diff output is the submission |
Model prints sentinel |
Docker stage |
Evaluation stage only (running tests) |
Both inference and evaluation stages |
Datasets |
|
|
Cost |
Lower, single-turn inference |
Higher, multi-turn inference + long-running containers |
When to choose which
Oracle mode: Suitable for quickly establishing a baseline, comparing different retrieval strategies, or running low-cost trials when compute/time is limited.
Agentic mode: Suitable for evaluating the model’s end-to-end capability as a “software engineering agent” (autonomous code exploration, patch authoring, iterative debugging) — closer to a real development workflow.
Datasets#
Oracle-only: inference datasets (inference_dataset_id)#
Differently-retrieval-formatted SWE-bench versions; they differ in the retrieval method and the size limit on code context:
princeton-nlp/SWE-bench_oracle: Oracle retrieval (idealized retrieval as upper-bound baseline). Default value.princeton-nlp/SWE-bench_bm25_13K: BM25 retrieval, code context capped at 13,000 tokens (cl100k_base).princeton-nlp/SWE-bench_bm25_27K: BM25 retrieval, capped at 27,000 tokens.princeton-nlp/SWE-bench_bm25_40K: BM25 retrieval, capped at 40,000 tokens.
Agentic-only: evaluation datasets#
Append the _agentic suffix to a core dataset name to trigger the agentic multi-turn evaluation:
swe_bench_verified_agentic— SWE-bench Verified (500 samples)swe_bench_verified_mini_agentic— SWE-bench Verified Mini (50 samples, ~5GB)swe_bench_lite_agentic— SWE-bench Lite (300 samples)
Install Dependencies#
SWE-bench uses Docker to ensure evaluation reproducibility.
Install Docker: Refer to the Docker Installation Guide to install Docker on your machine.
Additional configuration for Linux users: It is recommended to follow the post-installation steps for a better experience.
Install the Python dependencies:
pip install evalscope
pip install swebench==4.1.0
Tip: A properly configured Docker is a prerequisite for running SWE-bench evaluations. Make sure the Docker service is running after installation.
Note
On the first run of a swe_bench task, the system needs to build/download the required Docker images, which is resource-intensive:
Time: Building images for the full SWE-bench Verified can take several hours.
Storage: ~130GB for the full set; ~5GB for the Mini set.
Recommendation: Make sure you have enough disk space and time budget for the first run, and prefer an environment with a stable network.
Oracle Single-turn Evaluation#
Below is an example of evaluating swe_bench_verified with the qwen-plus model:
import os
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='openai_api', # Use API model service
datasets=['swe_bench_verified'], # Can also be 'swe_bench_verified_mini' or 'swe_bench_lite'
dataset_args={
'swe_bench_verified': {
'extra_params': {
'build_docker_images': True, # Build Docker images required for evaluation. Recommended for first run.
'pull_remote_images_if_available': True, # Prefer pulling pre-built remote images when available. Recommended.
'inference_dataset_id': 'princeton-nlp/SWE-bench_oracle' # Inference dataset: BM25_13K / 27K / 40K / oracle
}
}
},
eval_batch_size=5, # Inference batch size / number of parallel evaluation workers (Docker containers)
limit=5, # Limit samples for quick testing; remove for formal evaluation
generation_config={
'temperature': 0.1,
}
)
run_task(task_cfg=task_cfg)
Intermediate evaluation artifacts are saved under outputs/xxxxx/swebench_log, including patch.diff and other files. Example final result:
+-----------+--------------------+----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+===========+====================+==========+==========+=======+=========+=========+
| qwen-plus | swe_bench_verified | mean_acc | default | 5 | 0.2 | default |
+-----------+--------------------+----------+----------+-------+---------+---------+
Agentic Multi-turn Evaluation#
In agentic mode, the model receives only the raw problem_statement (no oracle context) and drives a multi-turn agent loop inside a per-instance SWE-bench Docker container: it uses bash to explore /testbed, edit source files, and finally submits a git diff by printing the sentinel COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT.
Action Protocol#
The action_protocol parameter controls how the model interacts with bash:
Protocol |
Description |
Best for |
|---|---|---|
|
OpenAI function-calling with a single |
Models with function-calling support |
|
Model wraps commands in |
Models without function-calling |
Run Example#
The example below mirrors test_swe_bench_verified_mini_agentic in tests/benchmark/test_agent.py:
import os
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen3-max',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='openai_api',
datasets=['swe_bench_verified_mini_agentic'], # Can also be 'swe_bench_verified_agentic' or 'swe_bench_lite_agentic'
dataset_args={
'swe_bench_verified_mini_agentic': {
'extra_params': {
'action_protocol': 'toolcall', # Action protocol: 'toolcall' or 'backticks'
'max_steps': 250, # Maximum agent loop steps
'command_timeout': 60.0, # Per-bash-command timeout (seconds)
'working_dir': '/testbed', # Working directory inside the container
'build_docker_images': True, # Prepare images required for evaluation; recommended for first run
'pull_remote_images_if_available': True, # Prefer pulling pre-built remote images
# 'force_arch': 'arm64', # Apple Silicon users may explicitly set 'arm64'; default '' = auto-detect
}
}
},
eval_batch_size=5, # Number of parallel containers
limit=3, # Limit samples for quick testing; remove for formal evaluation
generation_config={
'temperature': 0.7,
'parallel_tool_calls': True,
'stream': True,
}
)
run_task(task_cfg=task_cfg)
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Bash interaction protocol: |
|
int |
|
Maximum agent loop steps per sample |
|
float |
|
Per-bash-command timeout in seconds |
|
str |
|
Working directory inside the container (SWE-bench image convention) |
|
bool |
|
Whether to prepare Docker images (build or pull) |
|
bool |
|
Prefer pulling pre-built remote images when available |
|
str |
|
Force the image architecture: |
Note
Agentic mode requires Docker to be running during both the inference and evaluation stages. Each sample spawns its own container from a pre-built SWE-bench image, so eval_batch_size effectively controls the number of concurrently running containers — tune it according to your machine resources.