SWE-bench_Verified_Agentic#
Overview#
SWE-bench Verified Agentic is the agentic-mode evaluation of SWE-bench Verified, a human-validated subset of 500 samples from SWE-bench. Unlike the oracle single-turn variant, the model must autonomously explore the repository, run shell commands, edit source files, and submit a patch through a multi-turn agent loop driven inside a per-instance Docker container.
Task Description#
Task Type: Automated Software Engineering / Bug Fixing (Agentic)
Input: GitHub issue description (no oracle file context)
Output: Code patch (diff format) collected from
git diffafter autonomous editingRepositories: 12 popular Python projects (Django, Flask, Requests, etc.)
Key Features#
500 human-validated Issue-Pull Request pairs
Multi-turn agent loop (mini-swe-agent
swebench.yamlcompatible)Per-instance SWE-bench Docker container as the execution sandbox
Sentinel-based patch submission protocol (
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT)Supports both function-calling (
toolcall) and text-based (backticks) action protocols
Evaluation Notes#
Requires
pip install swebench==4.1.0before evaluationDocker images are built/pulled automatically for each repository
Timeout of 1800 seconds (30 min) per instance for final patch validation
See the usage documentation for detailed setup instructions
Supports both local image building and remote image pulling
Agentic Mode#
This benchmark drives a multi-turn agent loop (mirrors mini-swe-agent’s
swebench.yaml) inside a per-instance SWE-bench Docker container. The
model issues bash commands to explore /testbed, edit source files,
and finally submits its git diff patch by printing the sentinel
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT followed by the patch contents.
extra_params.action_protocol selects between:
toolcall(default): OpenAI function-calling protocol with a singlebashtool. Recommended for any model that supports tool calling.backticks: text-based fallback expecting one```mswea_bash_command ```block per turn. For models without function-calling support.
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
500 |
Prompt Length (Mean) |
1699.73 chars |
Prompt Length (Min/Max) |
143 / 24770 chars |
Sample Example#
Subset: default
{
"input": [
{
"id": "360d65af",
"content": "Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels\nConsider the following model:\r\n\r\n```python\r\nfrom astropy.modeling import models as m\r\nfrom astropy.modeling.separable import separability_matri ... [TRUNCATED 762 chars] ... [ True, True, False, False],\r\n [False, False, True, True],\r\n [False, False, True, True]])\r\n```\r\nSuddenly the inputs and outputs are no longer separable?\r\n\r\nThis feels like a bug to me, but I might be missing something?\n"
}
],
"id": 0,
"group_id": 0,
"tools": [
{
"name": "bash",
"description": "Execute a bash command inside the sandbox environment. Returns the combined stdout / stderr output of the command.",
"parameters": {
"properties": {
"command": {
"type": "string",
"description": "The bash command to execute."
},
"timeout": {
"type": "number",
"description": "Maximum execution time in seconds (default: 60).",
"default": 60
}
},
"required": [
"command"
]
}
}
],
"metadata": {
"problem_statement": "Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels\nConsider the following model:\r\n\r\n```python\r\nfrom astropy.modeling import models as m\r\nfrom astropy.modeling.separable import separability_matri ... [TRUNCATED 762 chars] ... [ True, True, False, False],\r\n [False, False, True, True],\r\n [False, False, True, True]])\r\n```\r\nSuddenly the inputs and outputs are no longer separable?\r\n\r\nThis feels like a bug to me, but I might be missing something?\n",
"instance_id": "astropy__astropy-12907",
"base_commit": "d16bfe05a744909de4b27f5875fe0d4ed41ce607",
"patch": "diff --git a/astropy/modeling/separable.py b/astropy/modeling/separable.py\n--- a/astropy/modeling/separable.py\n+++ b/astropy/modeling/separable.py\n@@ -242,7 +242,7 @@ def _cstack(left, right):\n cright = _coord_matrix(right, 'right', noutp)\n else:\n cright = np.zeros((noutp, right.shape[1]))\n- cright[-right.shape[0]:, -right.shape[1]:] = 1\n+ cright[-right.shape[0]:, -right.shape[1]:] = right\n \n return np.hstack([cleft, cright])\n \n",
"PASS_TO_PASS": [
"astropy/modeling/tests/test_separable.py::test_coord_matrix",
"astropy/modeling/tests/test_separable.py::test_cdot",
"astropy/modeling/tests/test_separable.py::test_cstack",
"astropy/modeling/tests/test_separable.py::test_arith_oper",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model0-result0]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model1-result1]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model2-result2]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model3-result3]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model4-result4]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model5-result5]",
"... [TRUNCATED 3 more items] ..."
],
"FAIL_TO_PASS": [
"astropy/modeling/tests/test_separable.py::test_separable[compound_model6-result6]",
"astropy/modeling/tests/test_separable.py::test_separable[compound_model9-result9]"
],
"test_patch": "diff --git a/astropy/modeling/tests/test_separable.py b/astropy/modeling/tests/test_separable.py\n--- a/astropy/modeling/tests/test_separable.py\n+++ b/astropy/modeling/tests/test_separable.py\n@@ -28,6 +28,13 @@\n p1 = models.Polynomial1D(1, nam ... [TRUNCATED 931 chars] ... [True, True, False, False, False],\n+ [False, False, True, False, False],\n+ [False, False, False, True, False],\n+ [False, False, False, False, True]]))),\n }\n \n \n",
"version": "4.3",
"repo": "astropy/astropy",
"environment_setup_commit": "298ccb478e6bf092953bca67a3d29dc6c35f6752",
"hints_text": "",
"created_at": "2022-03-03T15:14:54Z",
"docker_image": "swebench/sweb.eval.arm64.astropy_1776_astropy-12907:latest"
}
}
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Agent action protocol: “toolcall” (mainline OpenAI function-calling, mirrors mini-swe-agent swebench.yaml) or “backticks” (textbased mswea_bash_command fallback for models without function-calling support). Choices: [‘toolcall’, ‘backticks’] |
|
|
|
Maximum number of agent steps per sample. |
|
|
|
Default per-bash-command timeout in seconds. |
|
|
|
Working directory inside the SWE-bench container. |
|
|
|
Build Docker images locally for each sample. |
|
|
|
Attempt to pull existing remote Docker images before building. |
|
|
`` |
Optionally force a specific architecture for image build/pull. Choices: [‘’, ‘arm64’, ‘x86_64’] |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets swe_bench_verified_agentic \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['swe_bench_verified_agentic'],
dataset_args={
'swe_bench_verified_agentic': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)