GPT-OSS Model Evaluation#

On August 6, 2025, OpenAI released two open-source models:

  • gpt-oss-120b โ€” Suitable for production environments, general-purpose tasks, and scenarios requiring high reasoning capabilities. Can run on a single H100 GPU (117B parameters, including 5.1B activation parameters).

  • gpt-oss-20b โ€” Suitable for low-latency, local, or specific-use scenarios (21B parameters, including 3.6B activation parameters).

Letโ€™s use the EvalScope model evaluation framework to quickly test the inference speed and benchmark performance of these models.

Environment Setup#

To make model deployment easier and improve inference speed, we use vLLM to launch a web service compatible with the OpenAI API format.

โš ๏ธ Note: As of August 6, 2025, vLLM version 0.10.1, which supports the gpt-oss models, has not been officially released yet. You need to install vLLM and gpt-oss dependencies from source. It is recommended to start a new Python 3.12 environment to avoid affecting your existing environment.

  1. Create and activate a new conda environment:

conda create -n gpt_oss_vllm python=3.12
conda activate gpt_oss_vllm
  1. Install the necessary dependencies:

# Install PyTorch-nightly and vLLM
pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128
# Install FlashInfer
pip install flashinfer-python==0.2.10
# Install evalscope
pip install evalscope[perf] -U
  1. Start the model service

We successfully launched the gpt-oss-20b model service on an H20 GPU:

To download the model via ModelScope (recommended):

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 VLLM_USE_MODELSCOPE=true vllm serve openai-mirror/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801

To download the model via HuggingFace:

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801

Inference Speed Test#

We use EvalScopeโ€™s inference speed testing feature to evaluate the modelโ€™s inference speed.

Test environment:

  • GPU: H20-96GB * 1

  • vLLM version: 0.10.1+gptoss

  • Prompt length: 1024 tokens

  • Output length: 1024 tokens

Run the test script:

evalscope perf \
  --parallel 1 10 50 100 \
  --number 5 20 100 200 \
  --model gpt-oss-20b \
  --url http://127.0.0.1:8801/v1/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --log-every-n-query 20 \
  --tokenizer-path openai-mirror/gpt-oss-20b \
  --extra-args '{"ignore_eos": true}'

Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Performance Test Summary Report                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Basic Information:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Model                 โ”‚ gpt-oss-20b                      โ”‚
โ”‚ Total Generated       โ”‚ 332,800.0 tokens                 โ”‚
โ”‚ Total Test Time       โ”‚ 154.57 seconds                   โ”‚
โ”‚ Avg Output Rate       โ”‚ 2153.10 tokens/sec               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


                                    Detailed Performance Metrics                                    
โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ      โ”ƒ      โ”ƒ      Avg โ”ƒ      P99 โ”ƒ    Gen. โ”ƒ      Avg โ”ƒ     P99 โ”ƒ      Avg โ”ƒ     P99 โ”ƒ   Successโ”ƒ
โ”ƒConc. โ”ƒ  RPS โ”ƒ  Lat.(s) โ”ƒ  Lat.(s) โ”ƒ  toks/s โ”ƒ  TTFT(s) โ”ƒ TTFT(s) โ”ƒ  TPOT(s) โ”ƒ TPOT(s) โ”ƒ      Rateโ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚    1 โ”‚ 0.15 โ”‚    6.811 โ”‚    6.854 โ”‚  150.34 โ”‚    0.094 โ”‚   0.096 โ”‚    0.007 โ”‚   0.007 โ”‚    100.0%โ”‚
โ”‚   10 โ”‚ 0.96 โ”‚   10.374 โ”‚   10.708 โ”‚  986.63 โ”‚    0.865 โ”‚   1.278 โ”‚    0.009 โ”‚   0.010 โ”‚    100.0%โ”‚
โ”‚   50 โ”‚ 2.47 โ”‚   20.222 โ”‚   22.612 โ”‚ 2529.14 โ”‚    2.051 โ”‚   5.446 โ”‚    0.018 โ”‚   0.020 โ”‚    100.0%โ”‚
โ”‚  100 โ”‚ 3.37 โ”‚   29.570 โ”‚   35.594 โ”‚ 3455.61 โ”‚    2.354 โ”‚   6.936 โ”‚    0.027 โ”‚   0.028 โ”‚    100.0%โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


               Best Performance Configuration               
 Highest RPS         Concurrency 100 (3.37 req/sec)         
 Lowest Latency      Concurrency 1 (6.811 seconds)          

Performance Recommendations:
โ€ข The system seems not to have reached its performance bottleneck, try higher concurrency

Benchmark Evaluation#

We use EvalScopeโ€™s benchmark testing function to evaluate the modelโ€™s abilities. Here we use the AIME2025 mathematical reasoning benchmark as an example to test the modelโ€™s capabilities.

Run the test script:

from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='gpt-oss-20b',  # Model name
    api_url='http://127.0.0.1:8801/v1',  # Model service address
    eval_type='openai_api', # Evaluation type, here using openai_api evaluation
    datasets=['aime25'],  # Dataset to test
    generation_config={
        'extra_body': {"reasoning_effort": "high"}  # Model generation parameters, set to high reasoning level
    },
    eval_batch_size=10, # Concurrent batch size
    timeout=60000, # Timeout in seconds
)

run_task(task_cfg=task_cfg)

Sample output:

The test result here is 0.8. You can try different model generation parameters and test multiple times to see the results.

+-------------+-----------+---------------+-------------+-------+---------+---------+
| Model       | Dataset   | Metric        | Subset      |   Num |   Score | Cat.0   |
+=============+===========+===============+=============+=======+=========+=========+
| gpt-oss-20b | aime25    | AveragePass@1 | AIME2025-I  |    15 |     0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25    | AveragePass@1 | AIME2025-II |    15 |     0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25    | AveragePass@1 | OVERALL     |    30 |     0.8 | -       |
+-------------+-----------+---------------+-------------+-------+---------+---------+ 

For more supported benchmarks, please refer to the EvalScope documentation.

Result Visualization#

EvalScope supports visualizing results so you can see the modelโ€™s specific outputs.

pip install 'evalscope[app]'
evalscope app --lang en

Summary#

Through the above steps, we have successfully tested the inference speed and benchmark capabilities of the GPT-OSS model using EvalScope. GPT-OSS performs excellently in both inference speed and benchmarking, making it suitable for production and high-performance scenarios.