SWE-bench_Verified_Mini_Agentic#

Overview#

SWE-bench Verified Mini Agentic is the agentic-mode evaluation of SWE-bench Verified Mini, a compact 50-sample subset that maintains the same distribution of performance, test pass rates, and difficulty as the full Verified set while requiring only 5GB of storage instead of 130GB. The model must autonomously explore, edit, and submit a patch through a multi-turn agent loop.

Task Description#

Task Type: Automated Software Engineering / Bug Fixing (Agentic)
Input: GitHub issue description (no oracle file context)
Output: Code patch (diff format) collected from git diff after autonomous editing
Size: 50 samples (vs 500 in full Verified set)

Key Features#

Representative 50-sample subset of SWE-bench Verified
Same difficulty distribution as the full dataset
Dramatically reduced storage requirements (5GB vs 130GB)
Multi-turn agent loop with per-instance Docker sandbox
Ideal for quick agentic evaluation and development iteration

Evaluation Notes#

Requires pip install swebench==4.1.0 before evaluation
Docker images are built/pulled automatically
See the usage documentation for detailed setup
Good for rapid prototyping of agent strategies and initial model assessment

Agentic Mode#

This benchmark drives a multi-turn agent loop (mirrors mini-swe-agent’s swebench.yaml) inside a per-instance SWE-bench Docker container. The model issues bash commands to explore /testbed, edit source files, and finally submits its git diff patch by printing the sentinel COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT followed by the patch contents.

extra_params.action_protocol selects between:

toolcall (default): OpenAI function-calling protocol with a single bash tool. Recommended for any model that supports tool calling.
backticks: text-based fallback expecting one ```mswea_bash_command ``` block per turn. For models without function-calling support.

Properties#

Property	Value
Benchmark Name	`swe_bench_verified_mini_agentic`
Dataset ID	evalscope/swe-bench-verified-mini
Paper	N/A
Tags	`Coding`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	50
Prompt Length (Mean)	1268.5 chars
Prompt Length (Min/Max)	257 / 5362 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "363bb967",
      "content": "AuthenticationForm's username field doesn't set maxlength HTML attribute.\nDescription\n\t\nAuthenticationForm's username field doesn't render with maxlength HTML attribute anymore.\nRegression introduced in #27515 and 5ceaf14686ce626404afb6a5fbd3d8286410bf13.\n​https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/django-developers/qnfSqro0DlA\n​https://forum.djangoproject.com/t/possible-authenticationform-max-length-regression-in-django-2-1/241\n"
    }
  ],
  "id": 0,
  "group_id": 0,
  "tools": [
    {
      "name": "bash",
      "description": "Execute a bash command inside the sandbox environment. Returns the combined stdout / stderr output of the command.",
      "parameters": {
        "properties": {
          "command": {
            "type": "string",
            "description": "The bash command to execute."
          },
          "timeout": {
            "type": "number",
            "description": "Maximum execution time in seconds (default: 60).",
            "default": 60
          }
        },
        "required": [
          "command"
        ]
      }
    }
  ],
  "metadata": {
    "problem_statement": "AuthenticationForm's username field doesn't set maxlength HTML attribute.\nDescription\n\t\nAuthenticationForm's username field doesn't render with maxlength HTML attribute anymore.\nRegression introduced in #27515 and 5ceaf14686ce626404afb6a5fbd3d8286410bf13.\n​https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/django-developers/qnfSqro0DlA\n​https://forum.djangoproject.com/t/possible-authenticationform-max-length-regression-in-django-2-1/241\n",
    "instance_id": "django__django-11790",
    "base_commit": "b1d6b35e146aea83b171c1b921178bbaae2795ed",
    "patch": "diff --git a/django/contrib/auth/forms.py b/django/contrib/auth/forms.py\n--- a/django/contrib/auth/forms.py\n+++ b/django/contrib/auth/forms.py\n@@ -191,7 +191,9 @@ def __init__(self, request=None, *args, **kwargs):\n \n         # Set the max len ... [TRUNCATED 322 chars] ... username_max_length\n+        self.fields['username'].widget.attrs['maxlength'] = username_max_length\n         if self.fields['username'].label is None:\n             self.fields['username'].label = capfirst(self.username_field.verbose_name)\n \n",
    "PASS_TO_PASS": [
      "test_html_autocomplete_attributes (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_missing_passwords (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_non_matching_passwords (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_one_password (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_password_whitespace_not_stripped (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_success (auth_tests.test_forms.AdminPasswordChangeFormTest)",
      "test_field_order (auth_tests.test_forms.PasswordChangeFormTest)",
      "test_html_autocomplete_attributes (auth_tests.test_forms.PasswordChangeFormTest)",
      "test_incorrect_password (auth_tests.test_forms.PasswordChangeFormTest)",
      "test_password_verification (auth_tests.test_forms.PasswordChangeFormTest)",
      "... [TRUNCATED 67 more items] ..."
    ],
    "FAIL_TO_PASS": [
      "test_username_field_max_length_defaults_to_254 (auth_tests.test_forms.AuthenticationFormTest)",
      "test_username_field_max_length_matches_user_model (auth_tests.test_forms.AuthenticationFormTest)"
    ],
    "test_patch": "diff --git a/tests/auth_tests/test_forms.py b/tests/auth_tests/test_forms.py\n--- a/tests/auth_tests/test_forms.py\n+++ b/tests/auth_tests/test_forms.py\n@@ -423,6 +423,7 @@ def test_username_field_max_length_matches_user_model(self):\n         C ... [TRUNCATED 543 chars] ... )\n         self.assertEqual(form.fields['username'].max_length, 254)\n+        self.assertEqual(form.fields['username'].widget.attrs.get('maxlength'), 254)\n         self.assertEqual(form.errors, {})\n \n     def test_username_field_label(self):\n",
    "version": "3.1",
    "repo": "django/django",
    "environment_setup_commit": "0668164b4ac93a5be79f5b87fae83c657124d9ab",
    "hints_text": "Regression test.",
    "created_at": "2019-09-17T14:33:44Z",
    "docker_image": "swebench/sweb.eval.arm64.django_1776_django-11790:latest"
  }
}

Prompt Template#

Prompt Template:

{question}

Extra Parameters#

Parameter	Type	Default	Description
`action_protocol`	`str`	`toolcall`	Agent action protocol: “toolcall” (mainline OpenAI function-calling, mirrors mini-swe-agent swebench.yaml) or “backticks” (textbased mswea_bash_command fallback for models without function-calling support). Choices: [‘toolcall’, ‘backticks’]
`max_steps`	`int`	`250`	Maximum number of agent steps per sample.
`command_timeout`	`float`	`60.0`	Default per-bash-command timeout in seconds.
`working_dir`	`str`	`/testbed`	Working directory inside the SWE-bench container.
`build_docker_images`	`bool`	`True`	Build Docker images locally for each sample.
`pull_remote_images_if_available`	`bool`	`True`	Attempt to pull existing remote Docker images before building.
`force_arch`	`str`	``	Optionally force a specific architecture for image build/pull. Choices: [‘’, ‘arm64’, ‘x86_64’]

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets swe_bench_verified_mini_agentic \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['swe_bench_verified_mini_agentic'],
    dataset_args={
        'swe_bench_verified_mini_agentic': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)