LongBench-Write#

Description#

The LongWriter supports 10,000+ Word Generation From Long Context LLMs. We can use the benchmark LongBench-Write focuses more on measuring the long output quality as well as the output length.

GitHub: LongWriter

Technical Report: Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key

Usage#

Installation#

pip install evalscope[framework] -U
pip install vllm -U

Task configuration#

There are few ways to configure the task: dict, json and yaml.

  1. Configuration with dict:

task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
                model='ZhipuAI/LongWriter-glm4-9b',
                input_data_path=None,
                output_dir='./outputs',
                infer_config={
                    'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions', 
                    'is_chat': True, 
                    'verbose': False, 
                    'generation_kwargs': {
                        'max_new_tokens': 32768, 
                        'temperature': 0.5, 
                        'repetition_penalty': 1.0
                    },
                    'proc_num': 16,
                },
                eval_config={
                    # No need to set OpenAI info if skipping the stage `eval_q`
                    'openai_api_key': None,   
                    'openai_api_base': 'https://api.openai.com/v1/chat/completions', 
                    'openai_gpt_model': 'gpt-4o-2024-05-13', 
                    'generation_kwargs': {
                        'max_new_tokens': 1024, 
                        'temperature': 0.5, 
                        'stop': None
                    }, 
                    'proc_num': 8
                }
            )

  • Arguments:

    • stage: To run multiple stages, infer–run the inference process. eval_l–run eval length process. eval_q–run eval quality process with the model-as-judge.

    • model: model id on the ModelScope hub, or local model dir. Refer to LongWriter-glm4-9b for more details.

    • input_data_path: input data path, default to None, it means to use longbench_write

    • output_dir: output root directory.

    • openai_api_key: openai_api_key when enabling the stage eval_q to use Model-as-Judge. Default to None if not needed.

    • openai_gpt_model: Judge model name from OpenAI. Default to gpt-4o-2024-05-13

    • generation_kwargs: The generation configs.

    • proc_num: process number for inference and evaluation.

  1. Configuration with json (Optional):

{
    "stage": [
        "infer",
        "eval_l",
        "eval_q"
    ],
    "model": "ZhipuAI/LongWriter-glm4-9b",
    "input_data_path": null,
    "output_dir": "./outputs",
    "infer_config": {
        "openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
        "is_chat": true,
        "verbose": false,
        "generation_kwargs": {
            "max_new_tokens": 32768,
            "temperature": 0.5,
            "repetition_penalty": 1.0
        },
        "proc_num": 16
    },
    "eval_config": {
        "openai_api_key": null,
        "openai_api_base": "https://api.openai.com/v1/chat/completions",
        "openai_gpt_model": "gpt-4o-2024-05-13",
        "generation_kwargs": {
            "max_new_tokens": 1024,
            "temperature": 0.5,
            "stop": null
        },
        "proc_num": 8
    }
}

Refer to default_task.json for more details.

  1. Configuration with yaml (Optional):

stage:
  - infer
  - eval_l
  - eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
  openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
  is_chat: true
  verbose: false
  generation_kwargs:
    max_new_tokens: 32768
    temperature: 0.5
    repetition_penalty: 1.0
  proc_num: 16
eval_config:
  openai_api_key: null
  openai_api_base: "https://api.openai.com/v1/chat/completions"
  openai_gpt_model: "gpt-4o-2024-05-13"
  generation_kwargs:
    max_new_tokens: 1024
    temperature: 0.5
    stop: null
  proc_num: 8

Refer to default_task.yaml for more details.

Run Model Inference#

We recommend to use the vLLM to deploy the model.

Environment:

  • A100(80G) x 1

To start vLLM server, run the following command:

CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
  • Arguments:

    • max-model-len: The maximum length of the model input.

    • gpu_memory_utilization: The GPU memory utilization.

    • trust-remote-code: Whether to trust the remote code.

    • model: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.

  • Note: You can use multiple GPUs by setting CUDA_VISIBLE_DEVICES=0,1,2,3 alternatively.

Run Evaluation#

from evalscope.third_party.longbench_write import run_task

run_task(task_cfg=task_cfg)

Results and metrics#

See eval_length.jsonl and eval_quality.jsonl in the outputs dir.

  • Metrics:

    • score_l: The average score of the length evaluation.

    • score_q: The average score of the quality evaluation.