MBPP-Plus#
Overview#
MBPP Plus is a fortified version of the MBPP benchmark, created to improve evaluation reliability for basic Python programming synthesis. It addresses quality issues in the original dataset and significantly increases test coverage for each problem.
Task Description#
Task Type: Code Generation (Python)
Input: Programming task description with test assertions
Output: Python function implementation
Improvements: Sanitized data with expanded test suites
Key Features#
Corrected reference code and clarified problem statements
Massively expanded test cases per problem (automated generation)
LLM-based and mutation-based test generation strategies
Exposes edge-case bugs missed by original MBPP
More rigorous and accurate code correctness evaluation
Evaluation Notes#
Default configuration uses 0-shot evaluation
Security Warning: By default, code is executed in the local environment. We strongly recommend using sandbox execution. See the sandbox documentation for details.
Supports
pass@kmetric calculationDefault timeout is 20 seconds per problem
Stricter evaluation than original MBPP benchmark
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Aggregation |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
378 |
Prompt Length (Mean) |
375.53 chars |
Prompt Length (Min/Max) |
222 / 2801 chars |
Sample Example#
Subset: default
{
"input": [
{
"id": "aa126494",
"content": "You are an expert Python programmer, and here is your task: Write a function to find the shared elements from the given two lists. Your code should pass these tests:\n\nassert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))\nassert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))\nassert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
}
],
"target": "\ndef similar_elements(test_tup1, test_tup2):\n return tuple(set(test_tup1) & set(test_tup2))\n",
"id": 0,
"group_id": 0,
"metadata": {
"task_id": 2,
"source_file": "Benchmark Questions Verification V2.ipynb",
"test_imports": [],
"test_list": [
"assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))",
"assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))",
"assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
],
"test": "import numpy as np\nfrom math import inf\n\ndef is_floats(x) -> bool:\n # check if it is float; List[float]; Tuple[float]\n if isinstance(x, float):\n return True\n if isinstance(x, (list, tuple)):\n return all(isinstance(i, fl ... [TRUNCATED] ... wvS', 'tUqF', 'ITntCqEvPi', 'kx'), (1, 2, 3, 4, 5, 6, 7, 8, 9), (11, 12, 13, 14, 15, 16, 17, 19, 20), (19, 5), (1, 2, 3, 6, 7, 8, 9, 10, 12)]\nfor i, (inp, exp) in enumerate(zip(inputs, results)):\n assertion(similar_elements(*inp), exp, 0)\n"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
You are an expert Python programmer, and here is your task: {question} Your code should pass these tests:
{tests}
Sandbox Configuration#
This benchmark requires a sandbox environment for code execution.
{
"image": "python:3.11-slim",
"tools_config": {
"shell_executor": {},
"python_executor": {}
}
}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mbpp_plus \
--use-sandbox \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mbpp_plus'],
use_sandbox=True,
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)