Examples#
Using Local Model Inference#
This project supports inference using local transformers and vllm (vllm needs to be installed first). The --model
can be filled with a modelscope model name, such as Qwen/Qwen2.5-0.5B-Instruct
; or you can directly specify the model weight path, such as /path/to/model_weights
, without needing to specify the --url
parameter.
Inference using transformers
evalscope perf \
--model 'Qwen/Qwen2.5-0.5B-Instruct' \
--attn-implementation flash_attention_2 \ # Optional, or choose from [flash_attention_2|eager|sdpa]
--number 20 \
--parallel 2 \
--api local \
--dataset openqa
Inference using vllm
evalscope perf \
--model 'Qwen/Qwen2.5-0.5B-Instruct' \
--number 20 \
--parallel 2 \
--api local_vllm \
--dataset openqa
Using prompt
#
evalscope perf \
--url 'http://127.0.0.1:8000/v1/chat/completions' \
--parallel 2 \
--model 'qwen2.5' \
--log-every-n-query 10 \
--number 20 \
--api openai \
--temperature 0.9 \
--max-tokens 1024 \
--prompt 'Write a science fiction story, please begin your performance'
You can also use a local file as a prompt:
evalscope perf \
--url 'http://127.0.0.1:8000/v1/chat/completions' \
--parallel 2 \
--model 'qwen2.5' \
--log-every-n-query 10 \
--number 20 \
--api openai \
--temperature 0.9 \
--max-tokens 1024 \
--prompt @prompt.txt
Complex Requests#
Using stop
, stream
, temperature
, etc.:
evalscope perf \
--url 'http://127.0.0.1:8000/v1/chat/completions' \
--parallel 2 \
--model 'qwen2.5' \
--log-every-n-query 10 \
--read-timeout 120 \
--connect-timeout 120 \
--number 20 \
--max-prompt-length 128000 \
--min-prompt-length 128 \
--api openai \
--temperature 0.7 \
--max-tokens 1024 \
--stop '<|im_end|>' \
--dataset openqa \
--stream
Using query-template
#
You can set request parameters in the query-template
:
evalscope perf \
--url 'http://127.0.0.1:8000/v1/chat/completions' \
--parallel 2 \
--model 'qwen2.5' \
--log-every-n-query 10 \
--read-timeout 120 \
--connect-timeout 120 \
--number 20 \
--max-prompt-length 128000 \
--min-prompt-length 128 \
--api openai \
--query-template '{"model": "%m", "messages": [{"role": "user","content": "%p"}], "stream": true, "skip_special_tokens": false, "stop": ["<|im_end|>"], "temperature": 0.7, "max_tokens": 1024}' \
--dataset openqa
Where %m
and %p
will be replaced by the model name and the prompt.
You can set request parameters in the query-template:
{
"model":"%m",
"messages":[
{
"role":"user",
"content":"%p"
}
],
"stream":true,
"skip_special_tokens":false,
"stop":[
"<|im_end|>"
],
"temperature":0.7,
"max_tokens":1024
}
evalscope perf \
--url 'http://127.0.0.1:8000/v1/chat/completions' \
--parallel 2 \
--model 'qwen2.5' \
--log-every-n-query 10 \
--read-timeout 120 \
--connect-timeout 120 \
--number 20 \
--max-prompt-length 128000 \
--min-prompt-length 128 \
--api openai \
--query-template @template.json \
--dataset openqa
Using the Random Dataset#
Randomly generate prompts based on prefix-length
, max-prompt-length
, and min-prompt-length
. It is necessary to specify tokenizer-path
. The number of tokens in the generated prompt is uniformly distributed between prefix_length + min-prompt-length
and prefix_length + max-prompt-length
. In a single test, all requests have the same prefix portion.
Note
Due to the influence of chat_template and tokenization algorithms, there may be some discrepancies in the number of tokens in the generated prompts, and it is not an exact specified token count.
Execute the following command:
evalscope perf \
--parallel 20 \
--model Qwen2.5-0.5B-Instruct \
--url http://127.0.0.1:8801/v1/chat/completions \
--api openai \
--dataset random \
--min-tokens 128 \
--max-tokens 128 \
--prefix-length 64 \
--min-prompt-length 1024 \
--max-prompt-length 2048 \
--number 100 \
--tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \
--debug
Using wandb to Record Test Results#
Please install wandb:
pip install wandb
When starting, add the following parameters:
--wandb-api-key 'wandb_api_key'
--name 'name_of_wandb_log'
Using swanlab to Record Test Results#
Please install swanlab:
pip install swanlab
When starting, add the following parameters:
--swanlab-api-key 'swanlab_api_key'
--name 'name_of_swanlab_log'
Debugging Requests#
Use the --debug
option to output the requests and responses.
Non-stream
Mode Output Example
2024-11-27 11:25:34,161 - evalscope - http_client.py - on_request_start - 116 - DEBUG - Starting request: <TraceRequestStartParams(method='POST', url=URL('http://127.0.0.1:8000/v1/completions'), headers=<CIMultiDict('Content-Type': 'application/json', 'user-agent': 'modelscope_bench', 'Authorization': 'Bearer EMPTY')>)>
2024-11-27 11:25:34,163 - evalscope - http_client.py - on_request_chunk_sent - 128 - DEBUG - Request sent: <method='POST', url=URL('http://127.0.0.1:8000/v1/completions'), truncated_chunk='{"prompt": "hello", "model": "qwen2.5"}'>
2024-11-27 11:25:38,172 - evalscope - http_client.py - on_response_chunk_received - 140 - DEBUG - Request received: <method='POST', url=URL('http://127.0.0.1:8000/v1/completions'), truncated_chunk='{"id":"cmpl-a4565eb4fc6b4a5697f38c0adaf9b70b","object":"text_completion","created":1732677934,"model":"qwen2.5","choices":[{"index":0,"text":",everyone!今天我给您撒个谎哦。 ))\\n\\n今天开心的事。","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":1,"total_tokens":17,"completion_tokens":16}}'>
stream
Mode Output Example
2024-11-27 20:02:24,760 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"重要的"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:24,803 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null}
2024-11-27 20:02:24,847 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":",以便"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:24,890 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"及时"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:24,933 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"得到"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:24,976 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"帮助"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:25,023 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"和支持"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:25,066 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null}
2024-11-27 20:02:25,109 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null}
2024-11-27 20:02:25,111 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"。<|im_end|>"},"finish_reason":null}],"usage":null}
2024-11-27 20:02:25,113 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":50,"completion_tokens":260,"total_tokens":310}}
2024-11-27 20:02:25,113 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: [DONE]