统一评测#

在得到采样数据后,可以进行统一评测。

评测配置#

配置评测任务,例如:

from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='qwen2.5',
    api_url='http://127.0.0.1:8801/v1/chat/completions',
    api_key='EMPTY',
    eval_type=EvalType.SERVICE,
    datasets=['data_collection'],
    dataset_args={'data_collection': {
        'local_path': 'outputs/mixed_data.jsonl'
    }},
)
run_task(task_cfg=task_cfg)

需要注意的是,其中:

  • datasets 中指定的数据集名称固定为 data_collection,表示评测混合数据集

  • dataset_args 中需要指定 local_path,表示评测数据集的本地路径

评测结果#

评测结果默认保存在 outputs/ 目录下,包含4个层级的报告:

  • subset_level:每个子集的平均得分和数量

  • dataset_level:每个数据集的平均得分和数量

  • task_level:每个任务的平均得分和数量

  • tag_level:每个标签的平均得分和数量,schema的名称也作为标签,放在tags列中

例如,评测结果如下:

2024-12-30 20:03:54,582 - evalscope - INFO - subset_level Report:
+-----------+------------------+---------------+---------------+-------+
| task_type |   dataset_name   |  subset_name  | average_score | count |
+-----------+------------------+---------------+---------------+-------+
|   math    | competition_math |    default    |      0.0      |  38   |
| reasoning |       race       |     high      |    0.3704     |  27   |
| reasoning |       race       |    middle     |      0.5      |  12   |
| reasoning |       arc        |   ARC-Easy    |    0.5833     |  12   |
|   math    |      gsm8k       |     main      |    0.1667     |   6   |
| reasoning |       arc        | ARC-Challenge |      0.4      |   5   |
+-----------+------------------+---------------+---------------+-------+
2024-12-30 20:03:54,582 - evalscope - INFO - dataset_level Report:
+-----------+------------------+---------------+-------+
| task_type |   dataset_name   | average_score | count |
+-----------+------------------+---------------+-------+
| reasoning |       race       |    0.4103     |  39   |
|   math    | competition_math |      0.0      |  38   |
| reasoning |       arc        |    0.5294     |  17   |
|   math    |      gsm8k       |    0.1667     |   6   |
+-----------+------------------+---------------+-------+
2024-12-30 20:03:54,582 - evalscope - INFO - task_level Report:
+-----------+---------------+-------+
| task_type | average_score | count |
+-----------+---------------+-------+
| reasoning |    0.4464     |  56   |
|   math    |    0.0227     |  44   |
+-----------+---------------+-------+
2024-12-30 20:03:54,583 - evalscope - INFO - tag_level Report:
+----------------+---------------+-------+
|      tags      | average_score | count |
+----------------+---------------+-------+
|       en       |     0.26      |  100  |
| math&reasoning |     0.26      |  100  |
|   reasoning    |    0.4464     |  56   |
|      math      |    0.0227     |  44   |
+----------------+---------------+-------+