Sandbox Environment Usage#

To complete LLM code capability evaluation, we need to set up an independent evaluation environment to avoid executing erroneous code in the development environment and causing unavoidable losses. Currently, EvalScope has integrated the ms-enclave sandbox environment, allowing users to evaluate model code capabilities in a controlled environment, such as using evaluation benchmarks like HumanEval and LiveCodeBench.

The following introduces two different sandbox usage methods:

Local usage: Set up the sandbox environment on a local machine and conduct evaluation locally, requiring Docker support on the local machine;
Remote usage: Set up the sandbox environment on a remote server and conduct evaluation through API interfaces, requiring Docker support on the remote machine.

1. Local Usage#

Use Docker to set up a sandbox environment on a local machine and conduct evaluation locally, requiring Docker support on the local machine.

Environment Setup#

Install Docker: Please ensure Docker is installed on your machine. You can download and install Docker from the Docker official website.
Install sandbox environment dependencies: Install packages like ms-enclave in your local Python environment:

pip install evalscope[sandbox]

Parameter Configuration#

When running evaluations, add the use_sandbox and sandbox_type parameters to automatically enable the sandbox environment. Other parameters remain the same as regular evaluations:

Here’s a complete example code for model evaluation on HumanEval:

from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task

task_config = TaskConfig(
    model='qwen-plus',
    datasets=['humaneval'],
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=env.get('DASHSCOPE_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=5,
    limit=5,
    generation_config={
        'max_tokens': 4096,
        'temperature': 0.0,
        'seed': 42,
    },
    use_sandbox=True, # enable sandbox
    sandbox_type='docker', # specify sandbox type
    judge_worker_num=5, # specify number of sandbox workers during evaluation
)

run_task(task_config)

During model evaluation, EvalScope will automatically start and manage the sandbox environment, ensuring code runs in an isolated environment. The console will display output like:

[INFO:ms_enclave] Local sandbox manager started
...

2. Remote Usage#

Set up the sandbox environment on a remote server and conduct evaluation through API interfaces, requiring Docker support on the remote machine.

Environment Setup#

You need to install and configure separately on both the remote machine and local machine.

Remote Machine#

The environment installation on the remote machine is similar to the local usage method described above:

Install Docker: Please ensure Docker is installed on your machine. You can download and install Docker from the Docker official website.
Install sandbox environment dependencies: Install packages like ms-enclave in remote Python environment:

pip install evalscope[sandbox]

Start sandbox server: Run the following command to start the sandbox server:

ms-enclave server --host 0.0.0.0 --port 1234

Local Machine#

The local machine does not need Docker installation at this point, but needs to install EvalScope:

pip install evalscope[sandbox]

Parameter Configuration#

When running evaluations, add the use_sandbox parameter to automatically enable the sandbox environment, and specify the remote sandbox server’s API address in sandbox_manager_config:

Complete example code is as follows:

from dotenv import dotenv_values
env = dotenv_values('.env')
from evalscope import TaskConfig, run_task

task_config = TaskConfig(
    model='qwen-plus',
    datasets=['humaneval'],
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=env.get('DASHSCOPE_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=5,
    limit=5,
    generation_config={
        'max_tokens': 4096,
        'temperature': 0.0,
        'seed': 42,
    },
    use_sandbox=True, # enable sandbox
    sandbox_type='docker', # specify sandbox type
    sandbox_manager_config={
        'base_url': 'http://<remote_host>:1234'  # remote sandbox manager URL
    },
    judge_worker_num=5, # specify number of sandbox workers during evaluation
)

run_task(task_config)

During model evaluation, EvalScope will communicate with the remote sandbox server through API, ensuring code runs in an isolated environment. The console will display output like:

[INFO:ms_enclave] HTTP sandbox manager started, connected to http://<remote_host>:1234
...