SWE-bench_Verified_Agentic#

Overview#

SWE-bench Verified Agentic is the agentic-mode evaluation of SWE-bench Verified, a human-validated subset of 500 samples from SWE-bench. Unlike the oracle single-turn variant, the model must autonomously explore the repository, run shell commands, edit source files, and submit a patch through a multi-turn agent loop driven inside a per-instance Docker container.

Task Description#

Task Type: Automated Software Engineering / Bug Fixing (Agentic)
Input: GitHub issue description (no oracle file context)
Output: Code patch (diff format) collected from git diff after autonomous editing
Repositories: 12 popular Python projects (Django, Flask, Requests, etc.)

Key Features#

500 human-validated Issue-Pull Request pairs
Multi-turn agent loop (mini-swe-agent swebench.yaml compatible)
Per-instance SWE-bench Docker container as the execution sandbox
Sentinel-based patch submission protocol (COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT)
Supports both function-calling (toolcall) and text-based (backticks) action protocols

Evaluation Notes#

Requires pip install swebench==4.1.0 before evaluation
Docker images are built/pulled automatically for each repository
Timeout of 1800 seconds (30 min) per instance for final patch validation
See the usage documentation for detailed setup instructions
Supports both local image building and remote image pulling

Agentic Mode#

This benchmark drives a multi-turn agent loop (mirrors mini-swe-agent’s swebench.yaml) inside a per-instance SWE-bench Docker container. The model issues bash commands to explore /testbed, edit source files, and finally submits its git diff patch by printing the sentinel COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT followed by the patch contents.

extra_params.action_protocol selects between:

toolcall (default): OpenAI function-calling protocol with a single bash tool. Recommended for any model that supports tool calling.
backticks: text-based fallback expecting one ```mswea_bash_command ``` block per turn. For models without function-calling support.

Properties#

Property	Value
Benchmark Name	`swe_bench_verified_agentic`
Dataset ID	princeton-nlp/SWE-bench_Verified
Paper	N/A
Tags	`Coding`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	500
Prompt Length (Mean)	1699.73 chars
Prompt Length (Min/Max)	143 / 24770 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "360d65af",
      "content": "Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels\nConsider the following model:\r\n\r\n```python\r\nfrom astropy.modeling import models as m\r\nfrom astropy.modeling.separable import separability_matri ... [TRUNCATED 762 chars] ...       [ True,  True, False, False],\r\n       [False, False,  True,  True],\r\n       [False, False,  True,  True]])\r\n```\r\nSuddenly the inputs and outputs are no longer separable?\r\n\r\nThis feels like a bug to me, but I might be missing something?\n"
    }
  ],
  "id": 0,
  "group_id": 0,
  "tools": [
    {
      "name": "bash",
      "description": "Execute a bash command inside the sandbox environment. Returns the combined stdout / stderr output of the command.",
      "parameters": {
        "properties": {
          "command": {
            "type": "string",
            "description": "The bash command to execute."
          },
          "timeout": {
            "type": "number",
            "description": "Maximum execution time in seconds (default: 60).",
            "default": 60
          }
        },
        "required": [
          "command"
        ]
      }
    }
  ],
  "metadata": {
    "problem_statement": "Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels\nConsider the following model:\r\n\r\n```python\r\nfrom astropy.modeling import models as m\r\nfrom astropy.modeling.separable import separability_matri ... [TRUNCATED 762 chars] ...       [ True,  True, False, False],\r\n       [False, False,  True,  True],\r\n       [False, False,  True,  True]])\r\n```\r\nSuddenly the inputs and outputs are no longer separable?\r\n\r\nThis feels like a bug to me, but I might be missing something?\n",
    "instance_id": "astropy__astropy-12907",
    "base_commit": "d16bfe05a744909de4b27f5875fe0d4ed41ce607",
    "patch": "diff --git a/astropy/modeling/separable.py b/astropy/modeling/separable.py\n--- a/astropy/modeling/separable.py\n+++ b/astropy/modeling/separable.py\n@@ -242,7 +242,7 @@ def _cstack(left, right):\n         cright = _coord_matrix(right, 'right', noutp)\n     else:\n         cright = np.zeros((noutp, right.shape[1]))\n-        cright[-right.shape[0]:, -right.shape[1]:] = 1\n+        cright[-right.shape[0]:, -right.shape[1]:] = right\n \n     return np.hstack([cleft, cright])\n \n",
    "PASS_TO_PASS": [
      "astropy/modeling/tests/test_separable.py::test_coord_matrix",
      "astropy/modeling/tests/test_separable.py::test_cdot",
      "astropy/modeling/tests/test_separable.py::test_cstack",
      "astropy/modeling/tests/test_separable.py::test_arith_oper",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model0-result0]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model1-result1]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model2-result2]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model3-result3]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model4-result4]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model5-result5]",
      "... [TRUNCATED 3 more items] ..."
    ],
    "FAIL_TO_PASS": [
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model6-result6]",
      "astropy/modeling/tests/test_separable.py::test_separable[compound_model9-result9]"
    ],
    "test_patch": "diff --git a/astropy/modeling/tests/test_separable.py b/astropy/modeling/tests/test_separable.py\n--- a/astropy/modeling/tests/test_separable.py\n+++ b/astropy/modeling/tests/test_separable.py\n@@ -28,6 +28,13 @@\n p1 = models.Polynomial1D(1, nam ... [TRUNCATED 931 chars] ...          [True,  True,  False, False, False],\n+                        [False, False, True,  False, False],\n+                        [False, False, False, True,  False],\n+                        [False, False, False, False, True]]))),\n }\n \n \n",
    "version": "4.3",
    "repo": "astropy/astropy",
    "environment_setup_commit": "298ccb478e6bf092953bca67a3d29dc6c35f6752",
    "hints_text": "",
    "created_at": "2022-03-03T15:14:54Z",
    "docker_image": "swebench/sweb.eval.arm64.astropy_1776_astropy-12907:latest"
  }
}

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter	Type	Default	Description
`action_protocol`	`str`	`toolcall`	Agent action protocol: “toolcall” (mainline OpenAI function-calling, mirrors mini-swe-agent swebench.yaml) or “backticks” (textbased mswea_bash_command fallback for models without function-calling support). Choices: [‘toolcall’, ‘backticks’]
`max_steps`	`int`	`250`	Maximum number of agent steps per sample.
`command_timeout`	`float`	`60.0`	Default per-bash-command timeout in seconds.
`working_dir`	`str`	`/testbed`	Working directory inside the SWE-bench container.
`build_docker_images`	`bool`	`True`	Build Docker images locally for each sample.
`pull_remote_images_if_available`	`bool`	`True`	Attempt to pull existing remote Docker images before building.
`force_arch`	`str`	``	Optionally force a specific architecture for image build/pull. Choices: [‘’, ‘arm64’, ‘x86_64’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets swe_bench_verified_agentic \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['swe_bench_verified_agentic'],
    dataset_args={
        'swe_bench_verified_agentic': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)