SWE-bench_Multilingual_Agentic#
Overview#
SWE-bench Multilingual Agentic is the agentic-mode evaluation of SWE-bench Multilingual, a 300-task SWE-bench-style benchmark spanning 42 repositories and 9 programming languages. The model autonomously explores, edits, and submits a patch through a multi-turn agent loop inside a per-instance Docker container.
Task Description#
Task Type: Automated Software Engineering / Bug Fixing (Agentic, Multilingual)
Input: GitHub issue description (no oracle file context)
Output: Code patch (diff format) collected from
git diffafter autonomous editingLanguages: C, C++, Go, Java, JavaScript/TypeScript, PHP, Ruby, and Rust
Key Features#
300 curated Issue-Pull Request tasks
42 real-world repositories across 9 programming languages
Multi-turn agent loop with per-instance SWE-bench Docker sandbox
SWE-bench-compatible patch evaluation using fail-to-pass and pass-to-pass tests
Evaluation Notes#
Requires
pip install swebench==4.1.0before evaluationUses the official SWE-bench Multilingual x86_64 instance images and sets Docker platform to
linux/amd64automaticallyDocker images are built/pulled automatically for each instance
Timeout of 1800 seconds (30 min) per instance for final patch validation
See the usage documentation for detailed setup instructions
Supports both local image building and remote image pulling
Agentic Mode#
This benchmark drives a multi-turn agent loop (mirrors mini-swe-agent’s
swebench.yaml) inside a per-instance SWE-bench Docker container. The
model issues bash commands to explore /testbed, edit source files,
and finally submits its git diff patch by printing the sentinel
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT followed by the patch contents.
extra_params.action_protocol selects between:
toolcall(default): OpenAI function-calling protocol with a singlebashtool. Recommended for any model that supports tool calling.backticks: text-based fallback expecting one```mswea_bash_command ```block per turn. For models without function-calling support.
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
300 |
Prompt Length (Mean) |
2197.94 chars |
Prompt Length (Min/Max) |
124 / 69351 chars |
Sample Example#
Subset: default
{
"input": [
{
"id": "52ece3cf",
"content": "Support Post aggregation function pow(f1,f2) to cater for square, cube , square root.\n### Description\r\n\r\nPlease describe the feature or change with as much detail as possible. \r\n\r\nAs of now the only supported arithmetic functions are +, -, *, ... [TRUNCATED 284 chars] ... \r\n\r\nThe proposal is to add a `pow` function which enables all the about usecase . Square of a number can be represent by pow(f1,2) , Cube can be represented as power(f1 ,3) , Squar root of a number can be represented by power(f1,0.5) ,\r\n\r\n\n"
}
],
"id": 0,
"group_id": 0,
"tools": [
{
"name": "bash",
"description": "Execute a bash command inside the sandbox environment. Returns the combined stdout / stderr output of the command.",
"parameters": {
"properties": {
"command": {
"type": "string",
"description": "The bash command to execute."
},
"timeout": {
"type": "number",
"description": "Maximum execution time in seconds (default: 60).",
"default": 60
}
},
"required": [
"command"
]
}
}
],
"metadata": {
"problem_statement": "Support Post aggregation function pow(f1,f2) to cater for square, cube , square root.\n### Description\r\n\r\nPlease describe the feature or change with as much detail as possible. \r\n\r\nAs of now the only supported arithmetic functions are +, -, *, ... [TRUNCATED 284 chars] ... \r\n\r\nThe proposal is to add a `pow` function which enables all the about usecase . Square of a number can be represent by pow(f1,2) , Cube can be represented as power(f1 ,3) , Squar root of a number can be represented by power(f1,0.5) ,\r\n\r\n\n",
"instance_id": "apache__druid-13704",
"base_commit": "51dfde02840017092486fb75be2b16566aff6a19",
"patch": "diff --git a/docs/querying/post-aggregations.md b/docs/querying/post-aggregations.md\nindex c75d122eb20a..935ca8fbce16 100644\n--- a/docs/querying/post-aggregations.md\n+++ b/docs/querying/post-aggregations.md\n@@ -36,7 +36,7 @@ There are several ... [TRUNCATED 1110 chars] ... }\n+ },\n+\n+ POW(\"pow\") {\n+ @Override\n+ public double compute(double lhs, double rhs)\n+ {\n+ return Math.pow(lhs, rhs);\n+ }\n };\n \n private static final Map<String, Ops> LOOKUP_MAP = new HashMap<>();\n",
"PASS_TO_PASS": [
"org.apache.druid.query.aggregation.post.ArithmeticPostAggregatorTest#testDiv",
"org.apache.druid.query.aggregation.post.ArithmeticPostAggregatorTest#testQuotient"
],
"FAIL_TO_PASS": [
"org.apache.druid.query.aggregation.post.ArithmeticPostAggregatorTest#testPow"
],
"test_patch": "diff --git a/processing/src/test/java/org/apache/druid/query/aggregation/post/ArithmeticPostAggregatorTest.java b/processing/src/test/java/org/apache/druid/query/aggregation/post/ArithmeticPostAggregatorTest.java\nindex a93034427539..7e1d4d112 ... [TRUNCATED 1358 chars] ... s(1.0, agg.compute(ImmutableMap.of(\"value\", 1)));\n+ Assert.assertEquals(1.0, agg.compute(ImmutableMap.of(\"value\", -1)));\n+ Assert.assertEquals(1.0, agg.compute(ImmutableMap.of(\"value\", .5)));\n+ }\n @Test\n public void testDiv()\n {\n",
"version": "13704",
"repo": "apache/druid",
"environment_setup_commit": null,
"hints_text": "",
"created_at": "2023-01-23 04:10:47",
"docker_image": "swebench/sweb.eval.x86_64.apache_1776_druid-13704:latest"
}
}
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Agent action protocol: “toolcall” (mainline OpenAI function-calling, mirrors mini-swe-agent swebench.yaml) or “backticks” (textbased mswea_bash_command fallback for models without function-calling support). Choices: [‘toolcall’, ‘backticks’] |
|
|
|
Maximum number of agent steps per sample. |
|
|
|
Default per-bash-command timeout in seconds. |
|
|
|
Build Docker images locally for each sample. |
|
|
|
Attempt to pull existing remote Docker images before building. |
|
|
|
DockerHub user/org namespace for remote SWE-bench images. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets swe_bench_multilingual_agentic \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['swe_bench_multilingual_agentic'],
dataset_args={
'swe_bench_multilingual_agentic': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)