Sandbox Environment Usage#

To complete LLM code capability evaluation, we need to set up an independent evaluation environment to avoid executing erroneous code in the development environment and causing unavoidable losses. Currently, EvalScope has integrated the ms-enclave sandbox environment, allowing users to evaluate model code capabilities in a controlled environment, such as using evaluation benchmarks like HumanEval and LiveCodeBench.

The following introduces two different sandbox usage methods:

  • Local usage: Set up the sandbox environment on a local machine and conduct evaluation locally, requiring Docker support on the local machine;

  • Remote usage: Set up the sandbox environment on a remote server and conduct evaluation through API interfaces, requiring Docker support on the remote machine.

Tip

  • The default sandbox environment configuration for datasets can be viewed in the dataset configuration, such as HumanEval.

  • The number of sandbox worker processes needs to be adjusted according to the resource situation of the local machine. It is recommended to set it to no more than half of the local CPU cores. For example, if the local machine has 8 CPU cores, it is recommended to set it to 4. Especially for multi-language images (volcengine/sandbox-fusion), which consume more resources, it is recommended to appropriately reduce the number of worker processes.

  • The same sandbox infrastructure also powers environment='docker' in Agent Evaluation; fields such as image / timeout can be reused in agent_config.environment_extra.

0. Unified Sandbox Configuration#

EvalScope manages sandbox settings via the nested field TaskConfig.sandbox (mapped to SandboxTaskConfig).

SandboxTaskConfig fields:

Field

Type

Description

Default

enabled

bool

Whether to enable the sandbox

false

engine

str

Sandbox engine: docker / volcengine, etc.

docker

default_config

dict

Task-level sandbox config; merged with BenchmarkMeta.sandbox_config, and used as the default per-sample environment config in Agent mode

{}

manager_config

dict

Forwarded to the ms_enclave manager (e.g. base_url for remote docker, volcengine credentials)

{}

pool_size

int | None

Warmup pool size for pooled execution; falls back to eval_batch_size when None

None

sandbox accepts both a SandboxTaskConfig instance and an equivalent dict — the two forms behave identically:

from evalscope.config import SandboxTaskConfig

# Option A: use SandboxTaskConfig (recommended; type hints available)
TaskConfig(
    sandbox=SandboxTaskConfig(
        enabled=True,
        engine='docker',
        manager_config={'base_url': 'http://remote:1234'},
    ),
)

# Option B: pass a plain dict
TaskConfig(
    sandbox={
        'enabled': True,
        'engine': 'docker',
        'manager_config': {'base_url': 'http://remote:1234'},
    },
)

1. Local Usage#

Use Docker to set up a sandbox environment on a local machine and conduct evaluation locally, requiring Docker support on the local machine.

Environment Setup#

  1. Install Docker: Please ensure Docker is installed on your machine. You can download and install Docker from the Docker official website.

  2. Install sandbox environment dependencies: Install packages like ms-enclave in your local Python environment:

pip install evalscope[sandbox]

Parameter Configuration#

Use the sandbox field on TaskConfig to enable the sandbox; other parameters remain the same as regular evaluations:

Here’s a complete example code for model evaluation on HumanEval:

from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task

task_config = TaskConfig(
    model='qwen-plus',
    datasets=['humaneval'],
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=env.get('DASHSCOPE_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=5,
    limit=5,
    generation_config={
        'max_tokens': 4096,
        'temperature': 0.0,
        'seed': 42,
    },
    sandbox={
        'enabled': True,    # enable sandbox
        'engine': 'docker', # specify sandbox engine
    },
)

run_task(task_config)

During model evaluation, EvalScope will automatically start and manage the sandbox environment, ensuring code runs in an isolated environment. The console will display output like:

[INFO:ms_enclave] Local sandbox manager started
...

2. Remote Usage#

Set up the sandbox environment on a remote server and conduct evaluation through API interfaces, requiring Docker support on the remote machine.

Environment Setup#

You need to install and configure separately on both the remote machine and local machine.

Remote Machine#

The environment installation on the remote machine is similar to the local usage method described above:

  1. Install Docker: Please ensure Docker is installed on your machine. You can download and install Docker from the Docker official website.

  2. Install sandbox environment dependencies: Install packages like ms-enclave in remote Python environment:

pip install evalscope[sandbox]
  1. Start sandbox server: Run the following command to start the sandbox server:

ms-enclave server --host 0.0.0.0 --port 1234

Local Machine#

The local machine does not need Docker installation at this point, but needs to install EvalScope:

pip install evalscope[sandbox]

Parameter Configuration#

Use the sandbox field on TaskConfig to enable the sandbox, and specify the remote sandbox server’s API address in manager_config:

Complete example code is as follows:

from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task

task_config = TaskConfig(
    model='qwen-plus',
    datasets=['humaneval'],
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=env.get('DASHSCOPE_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=5,
    limit=5,
    generation_config={
        'max_tokens': 4096,
        'temperature': 0.0,
        'seed': 42,
    },
    sandbox={
        'enabled': True,    # enable sandbox
        'engine': 'docker', # specify sandbox engine
        'manager_config': {
            'base_url': 'http://<remote_host>:1234'  # remote sandbox manager URL
        },
    },
)

run_task(task_config)

During model evaluation, EvalScope will communicate with the remote sandbox server through API, ensuring code runs in an isolated environment. The console will display output like:

[INFO:ms_enclave] HTTP sandbox manager started, connected to http://<remote_host>:1234
...

3. Additional Supported Sandbox#

Using VolcEngine Sandbox Environment#

EvalScope also supports using the VolcEngine Sandbox Environment. Users can choose to use the sandbox images provided by VolcEngine for code evaluation.

  1. Install Docker: Please ensure that Docker is installed on your machine. You can download and install Docker from the Docker official website.

  2. Start VolcEngine Sandbox Server: Use the following command to start the VolcEngine sandbox server:

docker run -it -p 8080:8080 vemlp-cn-beijing.cr.volces.com/preset-images/code-sandbox:server-20250609
  1. Configure Evaluation Parameters: When using the VolcEngine sandbox environment, set the engine parameter to volcengine, and ensure that the VolcEngine sandbox server has been installed and is running on the remote machine.

    ...
    sandbox={
        'enabled': True,        # Enable sandbox
        'engine': 'volcengine', # Use the VolcEngine sandbox
        'manager_config': {
            'base_url': 'http://<remote_host>:8080',  # Remote VolcEngine sandbox manager URL
            'dataset_language_map': {  # Optional, specify programming languages for datasets
                'r': 'R',
                'd_ut': 'D_ut',
                'ts': 'typescript',
            },
        },
    },
    ...

Console output will be as follows:

2026-01-20 08:07:22 [debug    ] running command python /tmp/tmpyhpyii_k/tmpf_q4hcoq.py [sandbox.runners.base]
2026-01-20 08:07:23 [debug    ] stop running command python /tmp/tmpv1hp4llu/tmpowprm7uz.py [sandbox.runners.base]
2026-01-20 08:07:23 [debug    ] stop running command python /tmp/tmpeytkvisz/tmpxkproid8.py [sandbox.runners.base]
2026-01-20 08:07:23 [debug    ] stop running command python /tmp/tmpyhpyii_k/tmpf_q4hcoq.py [sandbox.runners.base]
2026-01-20 08:07:23 [debug    ] stop running command python /tmp/tmpqcbuep0x/tmplxz4z9cw.py [sandbox.runners.base]
2026-01-20 08:07:23 [debug    ] stop running command python /tmp/tmp3msgd871/tmplldfzrp9.py [sandbox.runners.base]
...