LongBench-Write#
Description#
The LongWriter supports 10,000+ Word Generation From Long Context LLMs. We can use the benchmark LongBench-Write focuses more on measuring the long output quality as well as the output length.
GitHub: LongWriter
Technical Report: Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
Usage#
Installation#
pip install evalscope[framework] -U
pip install vllm -U
Task configuration#
There are few ways to configure the task: dict, json and yaml.
Configuration with dict:
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
model='ZhipuAI/LongWriter-glm4-9b',
input_data_path=None,
output_dir='./outputs',
infer_config={
'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
'is_chat': True,
'verbose': False,
'generation_kwargs': {
'max_new_tokens': 32768,
'temperature': 0.5,
'repetition_penalty': 1.0
},
'proc_num': 16,
},
eval_config={
# No need to set OpenAI info if skipping the stage `eval_q`
'openai_api_key': None,
'openai_api_base': 'https://api.openai.com/v1/chat/completions',
'openai_gpt_model': 'gpt-4o-2024-05-13',
'generation_kwargs': {
'max_new_tokens': 1024,
'temperature': 0.5,
'stop': None
},
'proc_num': 8
}
)
Arguments:
stage
: To run multiple stages,infer
–run the inference process.eval_l
–run eval length process.eval_q
–run eval quality process with the model-as-judge.model
: model id on the ModelScope hub, or local model dir. Refer to LongWriter-glm4-9b for more details.input_data_path
: input data path, default toNone
, it means to use longbench_writeoutput_dir
: output root directory.openai_api_key
: openai_api_key when enabling the stageeval_q
to useModel-as-Judge
. Default to None if not needed.openai_gpt_model
: Judge model name from OpenAI. Default togpt-4o-2024-05-13
generation_kwargs
: The generation configs.proc_num
: process number for inference and evaluation.
Configuration with json (Optional):
{
"stage": [
"infer",
"eval_l",
"eval_q"
],
"model": "ZhipuAI/LongWriter-glm4-9b",
"input_data_path": null,
"output_dir": "./outputs",
"infer_config": {
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"is_chat": true,
"verbose": false,
"generation_kwargs": {
"max_new_tokens": 32768,
"temperature": 0.5,
"repetition_penalty": 1.0
},
"proc_num": 16
},
"eval_config": {
"openai_api_key": null,
"openai_api_base": "https://api.openai.com/v1/chat/completions",
"openai_gpt_model": "gpt-4o-2024-05-13",
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.5,
"stop": null
},
"proc_num": 8
}
}
Refer to default_task.json for more details.
Configuration with yaml (Optional):
stage:
- infer
- eval_l
- eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
is_chat: true
verbose: false
generation_kwargs:
max_new_tokens: 32768
temperature: 0.5
repetition_penalty: 1.0
proc_num: 16
eval_config:
openai_api_key: null
openai_api_base: "https://api.openai.com/v1/chat/completions"
openai_gpt_model: "gpt-4o-2024-05-13"
generation_kwargs:
max_new_tokens: 1024
temperature: 0.5
stop: null
proc_num: 8
Refer to default_task.yaml for more details.
Run Model Inference#
We recommend to use the vLLM to deploy the model.
Environment:
A100(80G) x 1
To start vLLM server, run the following command:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
Arguments:
max-model-len
: The maximum length of the model input.gpu_memory_utilization
: The GPU memory utilization.trust-remote-code
: Whether to trust the remote code.model
: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.
Note: You can use multiple GPUs by setting
CUDA_VISIBLE_DEVICES=0,1,2,3
alternatively.
Run Evaluation#
from evalscope.third_party.longbench_write import run_task
run_task(task_cfg=task_cfg)
Results and metrics#
See eval_length.jsonl
and eval_quality.jsonl
in the outputs dir.
Metrics:
score_l
: The average score of the length evaluation.score_q
: The average score of the quality evaluation.