MBPP-Plus#

Overview#

MBPP Plus is a fortified version of the MBPP benchmark, created to improve evaluation reliability for basic Python programming synthesis. It addresses quality issues in the original dataset and significantly increases test coverage for each problem.

Task Description#

Task Type: Code Generation (Python)
Input: Programming task description with test assertions
Output: Python function implementation
Improvements: Sanitized data with expanded test suites

Key Features#

Corrected reference code and clarified problem statements
Massively expanded test cases per problem (automated generation)
LLM-based and mutation-based test generation strategies
Exposes edge-case bugs missed by original MBPP
More rigorous and accurate code correctness evaluation

Evaluation Notes#

Default configuration uses 0-shot evaluation
Security Warning: By default, code is executed in the local environment. We strongly recommend using sandbox execution. See the sandbox documentation for details.
Supports pass@k metric calculation
Default timeout is 20 seconds per problem
Stricter evaluation than original MBPP benchmark

Properties#

Property	Value
Benchmark Name	`mbpp_plus`
Dataset ID	evalscope/mbppplus
Paper	N/A
Tags	`Coding`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`
Aggregation	`mean_and_pass_at_k`

Data Statistics#

Metric	Value
Total Samples	378
Prompt Length (Mean)	375.53 chars
Prompt Length (Min/Max)	222 / 2801 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "aa126494",
      "content": "You are an expert Python programmer, and here is your task: Write a function to find the shared elements from the given two lists. Your code should pass these tests:\n\nassert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))\nassert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))\nassert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
    }
  ],
  "target": "\ndef similar_elements(test_tup1, test_tup2):\n  return tuple(set(test_tup1) & set(test_tup2))\n",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "task_id": 2,
    "source_file": "Benchmark Questions Verification V2.ipynb",
    "test_imports": [],
    "test_list": [
      "assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))",
      "assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))",
      "assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
    ],
    "test": "import numpy as np\nfrom math import inf\n\ndef is_floats(x) -> bool:\n    # check if it is float; List[float]; Tuple[float]\n    if isinstance(x, float):\n        return True\n    if isinstance(x, (list, tuple)):\n        return all(isinstance(i, fl ... [TRUNCATED] ... wvS', 'tUqF', 'ITntCqEvPi', 'kx'), (1, 2, 3, 4, 5, 6, 7, 8, 9), (11, 12, 13, 14, 15, 16, 17, 19, 20), (19, 5), (1, 2, 3, 6, 7, 8, 9, 10, 12)]\nfor i, (inp, exp) in enumerate(zip(inputs, results)):\n    assertion(similar_elements(*inp), exp, 0)\n"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

You are an expert Python programmer, and here is your task: {question} Your code should pass these tests:

{tests}

Sandbox Configuration#

This benchmark requires a sandbox environment for code execution.

{
  "image": "python:3.11-slim",
  "tools_config": {
    "shell_executor": {},
    "python_executor": {}
  }
}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mbpp_plus \
    --use-sandbox \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mbpp_plus'],
    use_sandbox=True,
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)