Live-Code-Bench#

Overview#

LiveCodeBench is a contamination-free benchmark for evaluating code generation models on real-world competitive programming problems. It continuously collects new problems from coding platforms to ensure models haven’t seen the test data during training.

Task Description#

Task Type: Competitive Programming / Code Generation
Input: Programming problem description with input/output format
Output: Complete solution code
Source: Problems from LeetCode, Codeforces, and AtCoder

Key Features#

Continuously updated with new problems post model training cutoff
Problems from major competitive programming platforms
Multiple test cases per problem for thorough evaluation
Date-based filtering to control for data contamination
Supports both local and sandbox code execution

Evaluation Notes#

Default configuration uses 0-shot evaluation
Security Warning: By default, code is executed in the local environment. We recommend using sandbox execution. See the sandbox documentation for details.
Use start_date and end_date parameters to filter problems by date
Default timeout is 6 seconds per test case
Supports pass@k metric calculation

Properties#

Property	Value
Benchmark Name	`live_code_bench`
Dataset ID	AI-ModelScope/code_generation_lite
Paper	N/A
Tags	`Coding`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`
Aggregation	`mean_and_pass_at_k`

Data Statistics#

Metric	Value
Total Samples	1,055
Prompt Length (Mean)	1700.76 chars
Prompt Length (Min/Max)	717 / 4234 chars

Sample Example#

Subset: release_latest

{
  "input": [
    {
      "id": "6b6a6121",
      "content": "### Question:\nThere are three cards with letters $\\texttt{a}$, $\\texttt{b}$, $\\texttt{c}$ placed in a row in some order. You can do the following operation at most once: \n\n \n-  Pick two cards, and swap them.  Is it possible that the row becom ... [TRUNCATED] ... s from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows.\n```python\n# YOUR CODE HERE\n```\n\n ### Answer: (use the provided format with backticks)\n\n"
    }
  ],
  "target": "",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "evaluation_sample": "{\"inputs\": [\"6\\nabc\\nacb\\nbac\\nbca\\ncab\\ncba\\n\", \"1\\nabc\\n\", \"3\\nabc\\nabc\\nabc\\n\", \"5\\ncab\\nacb\\ncba\\nbac\\nbca\\n\", \"6\\nabc\\nabc\\nabc\\nabc\\nabc\\nabc\\n\"], \"outputs\": [\"YES\\nYES\\nYES\\nNO\\nNO\\nYES\\n\", \"YES\\n\", \"YES\\nYES\\nYES\\n\", \"NO\\nYES\\nYES\\nYES\\nNO\\n\", \"YES\\nYES\\nYES\\nYES\\nYES\\nYES\\n\"], \"fn_name\": null}",
    "contest_date": "2023-08-21T00:00:00"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

### Question:
{question_content}

{format_prompt} ### Answer: (use the provided format with backticks)

Extra Parameters#

Parameter	Type	Default	Description
`start_date`	`str	null`	`None`
`end_date`	`str	null`	`None`
`debug`	`bool`	`False`	Enable verbose debug logging and bypass certain safety checks.

Sandbox Configuration#

This benchmark requires a sandbox environment for code execution.

{
  "image": "python:3.11-slim",
  "tools_config": {
    "shell_executor": {},
    "python_executor": {}
  }
}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets live_code_bench \
    --use-sandbox \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['live_code_bench'],
    use_sandbox=True,
    dataset_args={
        'live_code_bench': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)