SWE-bench_Verified_mini#
Overview#
SWE-bench Verified Mini is a compact subset of SWE-bench Verified, containing 50 carefully selected samples that maintain the same distribution of performance, test pass rates, and difficulty as the full dataset while requiring only 5GB of storage instead of 130GB.
Task Description#
Task Type: Automated Software Engineering / Bug Fixing
Input: GitHub issue description with repository context
Output: Code patch (diff format) that resolves the issue
Size: 50 samples (vs 500 in full Verified set)
Key Features#
Representative 50-sample subset of SWE-bench Verified
Same difficulty distribution as full dataset
Dramatically reduced storage requirements (5GB vs 130GB)
Ideal for quick evaluation and development iteration
Maintains statistical validity for benchmarking
Evaluation Notes#
Requires
pip install swebench==4.1.0before evaluationDocker images are built/pulled automatically
See the usage documentation for detailed setup
Good for rapid prototyping and initial model assessment
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Statistics not available.
Sample Example#
Sample example not available.
Prompt Template#
Prompt Template:
{question}
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Build Docker images locally for each sample. |
|
|
|
Attempt to pull existing remote Docker images before building. |
|
|
|
Oracle dataset ID used to fetch inference context. |
|
|
`` |
Optionally force the docker images to be pulled/built for a specific architecture. Choices: [‘’, ‘arm64’, ‘x86_64’] |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets swe_bench_verified_mini \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['swe_bench_verified_mini'],
dataset_args={
'swe_bench_verified_mini': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)