Skip to content

QwQ-32B-Preview#

QwQ-32B-Preview is an experimental research model developed by the Qwen team, aimed at enhancing the reasoning capabilities of artificial intelligence. Model Link

The Speed Benchmark tool was used to test the GPU memory usage and inference speed of the QwQ-32B-Preview model under different configurations. The following tests measure the speed and memory usage when generating 2048 tokens, with input lengths of 1, 6144, 14336, and 30720:

Local Transformers Inference Speed#

Test Environment#

NVIDIA A100 80GB * 1
CUDA 12.1
Pytorch 2.3.1
Flash Attention 2.5.8
Transformers 4.46.0
EvalScope 0.7.0

Stress Testing Command#

pip install evalscope[perf] -U

CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/QwQ-32B-Preview \
 --attn-implementation flash_attention_2 \
 --log-every-n-query 1 \
 --connect-timeout 60000 \
 --read-timeout 60000\
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local \
 --dataset speed_benchmark

Test Results#

+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
|       1       |      17.92      |     61.58      |
|     6144      |      12.61      |     63.72      |
|     14336     |      9.01       |     67.31      |
|     30720     |      5.61       |     74.47      |
+---------------+-----------------+----------------+

vLLM Inference Speed#

Test Environment#

NVIDIA A100 80GB * 2
CUDA 12.1
vLLM 0.6.3
Pytorch 2.4.0
Flash Attention 2.6.3
Transformers 4.46.0

Test Command#

CUDA_VISIBLE_DEVICES=0,1 evalscope perf \
 --parallel 1 \
 --model Qwen/QwQ-32B-Preview \
 --log-every-n-query 1 \
 --connect-timeout 60000 \
 --read-timeout 60000\
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local_vllm \
 --dataset speed_benchmark

Test Results#

+---------------+-----------------+
| Prompt Tokens | Speed(tokens/s) |
+---------------+-----------------+
|       1       |      38.17      |
|     6144      |      36.63      |
|     14336     |      35.01      |
|     30720     |      31.68      |
+---------------+-----------------+

Speed Benchmarking

Welcome to the EvalScope Blogs!