SWE-bench#

Introduction#

SWE-bench is a benchmark suite for evaluating Large Language Models (LLMs) on real-world software engineering tasks. Each sample corresponds to a real Issue–PR pair from a GitHub repository, and the model is required to produce a code patch that passes the hidden unit tests.

EvalScope’s SWE-bench integration provides two evaluation modes:

Oracle single-turn mode: The retrieved relevant code context is provided to the model in one shot, and the model directly generates a patch.
Agentic multi-turn mode: Only the issue description is provided. The model drives a multi-turn agent loop inside a per-instance SWE-bench Docker container, exploring the codebase via bash, editing source files and finally submitting a patch (compatible with the mini-swe-agent pipeline).

Comparison Between the Two Modes#

Dimension	Oracle Single-turn	Agentic Multi-turn
Input	`problem_statement` + pre-retrieved code context	`problem_statement` only
Context source	Pre-injected via `inference_dataset_id` (oracle / BM25)	Model autonomously explores `/testbed` via `bash`
Model interaction	Single request, patch produced directly	Multi-turn agent loop (default cap 250 steps)
Tool use	None	`bash` tool, supports `toolcall` / `backticks` protocols
Submission mechanism	Model’s diff output is the submission	Model prints sentinel `COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`, system collects `git diff`
Docker stage	Evaluation stage only (running tests)	Both inference and evaluation stages
Datasets	`swe_bench_verified` / `swe_bench_verified_mini` / `swe_bench_lite`	`swe_bench_verified_agentic` / `swe_bench_verified_mini_agentic` / `swe_bench_lite_agentic`
Cost	Lower, single-turn inference	Higher, multi-turn inference + long-running containers

When to choose which

Oracle mode: Suitable for quickly establishing a baseline, comparing different retrieval strategies, or running low-cost trials when compute/time is limited.
Agentic mode: Suitable for evaluating the model’s end-to-end capability as a “software engineering agent” (autonomous code exploration, patch authoring, iterative debugging) — closer to a real development workflow.

Datasets#

Core evaluation datasets (shared name roots)#

These datasets evaluate models through Issue–PR pairing and unit-test verification, ranging from full to curated subsets:

SWE-bench Verified (swe_bench_verified): 500 manually verified samples from the SWE-bench test set, with strictly controlled quality.
SWE-bench Verified Mini (swe_bench_verified_mini): A 50-sample lightweight subset, ~5GB storage (vs ~130GB for the full set), preserving the difficulty distribution of the original.
SWE-bench Lite (swe_bench_lite): 300 Issue–PR pairs from 11 popular Python projects.

Oracle-only: inference datasets (`inference_dataset_id`)#

Differently-retrieval-formatted SWE-bench versions; they differ in the retrieval method and the size limit on code context:

princeton-nlp/SWE-bench_oracle: Oracle retrieval (idealized retrieval as upper-bound baseline). Default value.
princeton-nlp/SWE-bench_bm25_13K: BM25 retrieval, code context capped at 13,000 tokens (cl100k_base).
princeton-nlp/SWE-bench_bm25_27K: BM25 retrieval, capped at 27,000 tokens.
princeton-nlp/SWE-bench_bm25_40K: BM25 retrieval, capped at 40,000 tokens.

Agentic-only: evaluation datasets#

Append the _agentic suffix to a core dataset name to trigger the agentic multi-turn evaluation:

swe_bench_verified_agentic — SWE-bench Verified (500 samples)
swe_bench_verified_mini_agentic — SWE-bench Verified Mini (50 samples, ~5GB)
swe_bench_lite_agentic — SWE-bench Lite (300 samples)

Install Dependencies#

SWE-bench uses Docker to ensure evaluation reproducibility.

Install Docker: Refer to the Docker Installation Guide to install Docker on your machine.
Additional configuration for Linux users: It is recommended to follow the post-installation steps for a better experience.
Install the Python dependencies:

pip install evalscope
pip install swebench==4.1.0

Tip: A properly configured Docker is a prerequisite for running SWE-bench evaluations. Make sure the Docker service is running after installation.

Note

On the first run of a swe_bench task, the system needs to build/download the required Docker images, which is resource-intensive:

Time: Building images for the full SWE-bench Verified can take several hours.
Storage: ~130GB for the full set; ~5GB for the Mini set.

Recommendation: Make sure you have enough disk space and time budget for the first run, and prefer an environment with a stable network.

Oracle Single-turn Evaluation#

Below is an example of evaluating swe_bench_verified with the qwen-plus model:

import os
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='qwen-plus',
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    eval_type='openai_api',  # Use API model service
    datasets=['swe_bench_verified'],  # Can also be 'swe_bench_verified_mini' or 'swe_bench_lite'
    dataset_args={
        'swe_bench_verified': {
            'extra_params': {
                'build_docker_images': True,                              # Build Docker images required for evaluation. Recommended for first run.
                'pull_remote_images_if_available': True,                  # Prefer pulling pre-built remote images when available. Recommended.
                'inference_dataset_id': 'princeton-nlp/SWE-bench_oracle'  # Inference dataset: BM25_13K / 27K / 40K / oracle
            }
        }
    },
    eval_batch_size=5,  # Inference batch size / number of parallel evaluation workers (Docker containers)
    limit=5,            # Limit samples for quick testing; remove for formal evaluation
    generation_config={
        'temperature': 0.1,
    }
)
run_task(task_cfg=task_cfg)

Intermediate evaluation artifacts are saved under outputs/xxxxx/swebench_log, including patch.diff and other files. Example final result:

+-----------+--------------------+----------+----------+-------+---------+---------+
| Model     | Dataset            | Metric   | Subset   |   Num |   Score | Cat.0   |
+===========+====================+==========+==========+=======+=========+=========+
| qwen-plus | swe_bench_verified | mean_acc | default  |     5 |     0.2 | default |
+-----------+--------------------+----------+----------+-------+---------+---------+

Agentic Multi-turn Evaluation#

In agentic mode, the model receives only the raw problem_statement (no oracle context) and drives a multi-turn agent loop inside a per-instance SWE-bench Docker container: it uses bash to explore /testbed, edit source files, and finally submits a git diff by printing the sentinel COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT.

Action Protocol#

The action_protocol parameter controls how the model interacts with bash:

Protocol	Description	Best for
`toolcall` (default)	OpenAI function-calling with a single `bash` tool, supports parallel tool calls	Models with function-calling support
`backticks`	Model wraps commands in ```mswea_bash_command ... ``` fenced blocks, one command per turn	Models without function-calling

Run Example#

The example below mirrors test_swe_bench_verified_mini_agentic in tests/benchmark/test_agent.py:

import os
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='qwen3-max',
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    eval_type='openai_api',
    datasets=['swe_bench_verified_mini_agentic'],  # Can also be 'swe_bench_verified_agentic' or 'swe_bench_lite_agentic'
    dataset_args={
        'swe_bench_verified_mini_agentic': {
            'extra_params': {
                'action_protocol': 'toolcall',           # Action protocol: 'toolcall' or 'backticks'
                'max_steps': 250,                        # Maximum agent loop steps
                'command_timeout': 60.0,                 # Per-bash-command timeout (seconds)
                'working_dir': '/testbed',               # Working directory inside the container
                'build_docker_images': True,             # Prepare images required for evaluation; recommended for first run
                'pull_remote_images_if_available': True, # Prefer pulling pre-built remote images
                # 'force_arch': 'arm64',                 # Apple Silicon users may explicitly set 'arm64'; default '' = auto-detect
            }
        }
    },
    eval_batch_size=5,  # Number of parallel containers
    limit=3,            # Limit samples for quick testing; remove for formal evaluation
    generation_config={
        'temperature': 0.7,
        'parallel_tool_calls': True,
        'stream': True,
    }
)
run_task(task_cfg=task_cfg)

Parameters#

Parameter	Type	Default	Description
`action_protocol`	str	`toolcall`	Bash interaction protocol: `toolcall` or `backticks`
`max_steps`	int	`250`	Maximum agent loop steps per sample
`command_timeout`	float	`60.0`	Per-bash-command timeout in seconds
`working_dir`	str	`/testbed`	Working directory inside the container (SWE-bench image convention)
`build_docker_images`	bool	`True`	Whether to prepare Docker images (build or pull)
`pull_remote_images_if_available`	bool	`True`	Prefer pulling pre-built remote images when available
`force_arch`	str	`''`	Force the image architecture: `''` / `arm64` / `x86_64`. Empty means auto-detect.