LongBench-Write#
Description#
The LongWriter supports 10,000+ Word Generation From Long Context LLMs. We can use the benchmark LongBench-Write focuses more on measuring the long output quality as well as the output length.
Refer to LongWriter
Usage#
Installation#
pip install evalscope[framework]
Task configuration#
There are few ways to configure the task: dict, json and yaml.
Configuration with dict:
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
model='ZhipuAI/LongWriter-glm4-9b',
input_data_path=None,
output_dir='./outputs',
openai_api_key=None,
openai_gpt_model='gpt-4o-2024-05-13',
infer_generation_kwargs={
'max_new_tokens': 32768,
'temperature': 0.5
},
eval_generation_kwargs={
'max_new_tokens': 1024,
'temperature': 0.5,
'stop': None
},
proc_num=8)
Arguments:
stage: To run multiple stages,infer–run the inference process.eval_l–run eval length process.eval_q–run eval quality process.model: model id on the ModelScope hub, or local model dir.input_data_path: input data path, default toNone, it means to use longbench_writeoutput_dir: output root directory.openai_api_key: openai_api_key when enabling the stageeval_qto useModel-as-Judge. Default to None if not needed.openai_gpt_model: Judge model name from OpenAI. Default togpt-4o-2024-05-13infer_generation_kwargs: The generation kwargs for models to be evaluated.eval_generation_kwargs: The generation kwargs for judge-models.proc_num: proc num.
Configuration with json (Optional):
{
"stage": ["infer", "eval_l", "eval_q"],
"model": "ZhipuAI/LongWriter-glm4-9b",
"input_data_path": null,
"output_dir": "./outputs",
"openai_api_key": null,
"openai_gpt_model": "gpt-4o-2024-05-13",
"infer_generation_kwargs": {
"max_new_tokens": 32768,
"temperature": 0.5
},
"eval_generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.5,
"stop": null
},
"proc_num": 8
}
Refer to default_task.json for more details.
Configuration with yaml (Optional):
stage:
- infer
- eval_l
- eval_q
model: ZhipuAI/LongWriter-glm4-9b
input_data_path: null
output_dir: ./outputs
openai_api_key: null
openai_gpt_model: gpt-4o-2024-05-13
infer_generation_kwargs:
max_new_tokens: 32768
temperature: 0.5
eval_generation_kwargs:
max_new_tokens: 1024
temperature: 0.5
stop: null
proc_num: 8
Refer to default_task.yaml for more details.
Run the task#
from evalscope.third_party.longbench_write import run_task
run_task(task_cfg=task_cfg)
Results and metrics#
See eval_length.jsonl and eval_quality.jsonl in the outputs dir.
Metrics:
score_l: The average score of the length evaluation.score_q: The average score of the quality evaluation.