LongBench-Write#
描述#
LongWriter 使得LLM能够生成超过10000个单词的内容。我们可以使用基准测试 LongBench-Write 来评估长输出的质量以及按需生成相应长度文本的能力。
GitHub: LongWriter
技术报告: Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
使用方式#
安装#
pip install evalscope[framework] -U
pip install vllm -U
任务配置#
支持的任务配置格式: dict, json, yaml.
配置dict:
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
model='ZhipuAI/LongWriter-glm4-9b',
input_data_path=None,
output_dir='./outputs',
infer_config={
'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
'is_chat': True,
'verbose': False,
'generation_kwargs': {
'max_new_tokens': 32768,
'temperature': 0.5,
'repetition_penalty': 1.0
},
'proc_num': 16,
},
eval_config={
# No need to set OpenAI info if skipping the stage `eval_q`
'openai_api_key': None,
'openai_api_base': 'https://api.openai.com/v1/chat/completions',
'openai_gpt_model': 'gpt-4o-2024-05-13',
'generation_kwargs': {
'max_new_tokens': 1024,
'temperature': 0.5,
'stop': None
},
'proc_num': 8
}
)
Arguments:
stage
: To run multiple stages,infer
--run the inference process.eval_l
--run eval length process.eval_q
--run eval quality process with the model-as-judge.model
: model id on the ModelScope hub, or local model dir. Refer to LongWriter-glm4-9b for more details.input_data_path
: input data path, default toNone
, it means to use longbench_writeoutput_dir
: output root directory.openai_api_key
: openai_api_key when enabling the stageeval_q
to useModel-as-Judge
. Default to None if not needed.openai_gpt_model
: Judge model name from OpenAI. Default togpt-4o-2024-05-13
generation_kwargs
: The generation configs.proc_num
: process number for inference and evaluation.
配置json (可选):
{
"stage": [
"infer",
"eval_l",
"eval_q"
],
"model": "ZhipuAI/LongWriter-glm4-9b",
"input_data_path": null,
"output_dir": "./outputs",
"infer_config": {
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"is_chat": true,
"verbose": false,
"generation_kwargs": {
"max_new_tokens": 32768,
"temperature": 0.5,
"repetition_penalty": 1.0
},
"proc_num": 16
},
"eval_config": {
"openai_api_key": null,
"openai_api_base": "https://api.openai.com/v1/chat/completions",
"openai_gpt_model": "gpt-4o-2024-05-13",
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.5,
"stop": null
},
"proc_num": 8
}
}
参考 default_task.json for more details.
配置yaml (可选):
stage:
- infer
- eval_l
- eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
is_chat: true
verbose: false
generation_kwargs:
max_new_tokens: 32768
temperature: 0.5
repetition_penalty: 1.0
proc_num: 16
eval_config:
openai_api_key: null
openai_api_base: "https://api.openai.com/v1/chat/completions"
openai_gpt_model: "gpt-4o-2024-05-13"
generation_kwargs:
max_new_tokens: 1024
temperature: 0.5
stop: null
proc_num: 8
参考 default_task.yaml for more details.
启动模型预测服务#
推荐使用 vLLM 来部署模型服务。
硬件环境:
A100(80G) x 1
启动vLLM服务:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
Arguments:
max-model-len
: The maximum length of the model input.gpu_memory_utilization
: The GPU memory utilization.trust-remote-code
: Whether to trust the remote code.model
: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.
提示:可以使用
CUDA_VISIBLE_DEVICES=0,1,2,3
来配置多GPUs环境。
运行评测任务#
from evalscope.third_party.longbench_write import run_task
run_task(task_cfg=task_cfg)
结果和指标#
See eval_length.jsonl
and eval_quality.jsonl
in the outputs dir.
Metrics:
score_l
: The average score of the length evaluation.score_q
: The average score of the quality evaluation.