Defining the Data Mixing Schema#
The data mixing schema defines which datasets are used for evaluation and how the data is grouped. This is the first step in the mixed data evaluation process.
Creating the Schema#
An example of a data mixing schema (CollectionSchema) is shown below:
Simple Example
from evalscope.collections import CollectionSchema, DatasetInfo
simple_schema = CollectionSchema(name='reasoning', datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']})
])
Where:
name
is the name of the data mixing schema.datasets
is a list of datasets, where each dataset (DatasetInfo) includes attributes such asname
,weight
,task_type
,tags
, andargs
.name
is the name of the dataset. Supported dataset names can be found in the dataset list.weight
is the weight of the dataset, used for weighted sampling. The default is 1.0, and all data will be normalized during sampling. (The value must be greater than 0)task_type
is the task type of the dataset and can be filled in as needed.tags
are labels for the dataset, which can also be filled in as needed.args
are parameters for the dataset, and the configurable parameters can be found in the dataset parameters.hierarchy
is the hierarchy of the dataset, which is automatically generated by the schema.
Complex Example
complex_schema = CollectionSchema(name='math&reasoning', datasets=[
CollectionSchema(name='math', weight=3, datasets=[
DatasetInfo(name='gsm8k', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='competition_math', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='cmmlu', weight=1, task_type='math', tags=['zh'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']}),
DatasetInfo(name='ceval', weight=1, task_type='math', tags=['zh'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']}),
]),
CollectionSchema(name='reasoning', weight=1, datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
DatasetInfo(name='race', weight=1, task_type='reasoning', tags=['en']),
]),
])
weight
is the weight of the data mixing schema, used for weighted sampling. The default is 1.0, and all data will be normalized during sampling. (The value must be greater than 0)datasets
can contain CollectionSchema, enabling the nesting of datasets. During evaluation, the name of theCollectionSchema
will be recursively added to the tags of each sample.
Using the Schema#
To view the created schema:
print(simple_schema)
{
"name": "reasoning",
"datasets": [
{
"name": "arc",
"weight": 1,
"task_type": "reasoning",
"tags": [
"en",
"reasoning"
],
"args": {}
},
{
"name": "ceval",
"weight": 1,
"task_type": "reasoning",
"tags": [
"zh",
"reasoning"
],
"args": {
"subset_list": [
"logic"
]
}
}
]
}
To view the flatten result of the schema (automatically normalized weights):
print(complex_schema.flatten())
DatasetInfo(name='gsm8k', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='competition_math', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='cmmlu', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']})
DatasetInfo(name='ceval', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']})
DatasetInfo(name='arc', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
DatasetInfo(name='ceval', weight=0.08333333333333333, task_type='reasoning', tags=['zh', 'math&reasoning', 'reasoning'], args={'subset_list': ['logic']})
DatasetInfo(name='race', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
To save the schema:
schema.dump_json('outputs/schema.json')
To load the schema from a JSON file:
schema = CollectionSchema.from_json('outputs/schema.json')