统一评测#
在得到采样数据后,可以进行统一评测。
评测配置#
配置评测任务,例如:
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen2.5',
api_url='http://127.0.0.1:8801/v1/chat/completions',
api_key='EMPTY',
eval_type=EvalType.SERVICE,
datasets=['data_collection'],
dataset_args={'data_collection': {
'local_path': 'outputs/mixed_data.jsonl'
}},
)
run_task(task_cfg=task_cfg)
需要注意的是,其中:
datasets
中指定的数据集名称固定为data_collection
,表示评测混合数据集dataset_args
中需要指定local_path
,表示评测数据集的本地路径
评测结果#
评测结果默认保存在 outputs/
目录下,包含4个层级的报告:
subset_level
:每个子集的平均得分和数量dataset_level
:每个数据集的平均得分和数量task_level
:每个任务的平均得分和数量tag_level
:每个标签的平均得分和数量,schema的名称也作为标签,放在tags
列中
例如,评测结果如下:
2024-12-30 20:03:54,582 - evalscope - INFO - subset_level Report:
+-----------+------------------+---------------+---------------+-------+
| task_type | dataset_name | subset_name | average_score | count |
+-----------+------------------+---------------+---------------+-------+
| math | competition_math | default | 0.0 | 38 |
| reasoning | race | high | 0.3704 | 27 |
| reasoning | race | middle | 0.5 | 12 |
| reasoning | arc | ARC-Easy | 0.5833 | 12 |
| math | gsm8k | main | 0.1667 | 6 |
| reasoning | arc | ARC-Challenge | 0.4 | 5 |
+-----------+------------------+---------------+---------------+-------+
2024-12-30 20:03:54,582 - evalscope - INFO - dataset_level Report:
+-----------+------------------+---------------+-------+
| task_type | dataset_name | average_score | count |
+-----------+------------------+---------------+-------+
| reasoning | race | 0.4103 | 39 |
| math | competition_math | 0.0 | 38 |
| reasoning | arc | 0.5294 | 17 |
| math | gsm8k | 0.1667 | 6 |
+-----------+------------------+---------------+-------+
2024-12-30 20:03:54,582 - evalscope - INFO - task_level Report:
+-----------+---------------+-------+
| task_type | average_score | count |
+-----------+---------------+-------+
| reasoning | 0.4464 | 56 |
| math | 0.0227 | 44 |
+-----------+---------------+-------+
2024-12-30 20:03:54,583 - evalscope - INFO - tag_level Report:
+----------------+---------------+-------+
| tags | average_score | count |
+----------------+---------------+-------+
| en | 0.26 | 100 |
| math&reasoning | 0.26 | 100 |
| reasoning | 0.4464 | 56 |
| math | 0.0227 | 44 |
+----------------+---------------+-------+