定义数据混合schema#
数据混合schema定义使用哪些数据进行评测,以及数据如何分组,是数据混合评测的第一步。
创建schema#
数据混合schema (CollectionSchema)示例如下:
简单示例
from evalscope.collections import CollectionSchema, DatasetInfo
simple_schema = CollectionSchema(name='reasoning', datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']})
])
其中:
name
是数据混合schema的名称datasets
是数据集列表,每个数据集(DatasetInfo)包含name
、weight
、task_type
、tags
和args
等属性。
复杂示例
complex_schema = CollectionSchema(name='math&reasoning', datasets=[
CollectionSchema(name='math', weight=3, datasets=[
DatasetInfo(name='gsm8k', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='competition_math', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='cmmlu', weight=1, task_type='math_examination', tags=['zh'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']}),
DatasetInfo(name='ceval', weight=1, task_type='math_examination', tags=['zh'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']}),
]),
CollectionSchema(name='reasoning', weight=1, datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning_examination', tags=['zh'], args={'subset_list': ['logic']}),
DatasetInfo(name='race', weight=1, task_type='reasoning', tags=['en']),
]),
])
weight
是数据混合schema的权重,类型为float,用于加权采样,默认为1.0,采样时所有数据会归一化(数值需要大于0)datasets
中可以包含CollectionSchema,从而实现数据集的嵌套;在评测时,CollectionSchema
的名称会递归添加到每个样本的tag中
使用schema#
查看创建的schema:
print(simple_schema)
{
"name": "reasoning",
"datasets": [
{
"name": "arc",
"weight": 1,
"task_type": "reasoning",
"tags": [
"en",
"reasoning"
],
"args": {}
},
{
"name": "ceval",
"weight": 1,
"task_type": "reasoning",
"tags": [
"zh",
"reasoning"
],
"args": {
"subset_list": [
"logic"
]
}
}
]
}
查看schema的flatten结果(自动归一化权重):
print(complex_schema.flatten())
DatasetInfo(name='gsm8k', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='competition_math', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='cmmlu', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']})
DatasetInfo(name='ceval', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']})
DatasetInfo(name='arc', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
DatasetInfo(name='ceval', weight=0.08333333333333333, task_type='reasoning', tags=['zh', 'math&reasoning', 'reasoning'], args={'subset_list': ['logic']})
DatasetInfo(name='race', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
保存schema:
schema.dump_json('outputs/schema.json')
从json文件中加载schema:
schema = CollectionSchema.from_json('outputs/schema.json')