Defining Your Schema#
In a nutshell: Schema = your capability checklist + relative value (weight) for each capability + optional hierarchical grouping.
Core Concepts & Objects#
CollectionSchema: A nestable grouping node that forms hierarchical structures and reusable modules.
DatasetInfo: The final evaluable dataset entry (leaf node), containing name, weight, and adapter parameters.
flatten(): Expands the entire tree to get a normalized list of leaf datasets (with hierarchical path
hierarchy).
Field Overview#
CollectionSchema:
name: str (required) - Group/index nameweight: float (optional, default 1.0) - Relative weight for the entire subgroup (combined with child nodes, then normalized)datasets: list[DatasetInfo | CollectionSchema]
DatasetInfo:
name: str (required) - Dataset registration nameweight: float (required, > 0)task_type: str (optional) - Custom capability labeltags: list[str] (optional) - Custom tags; parent CollectionSchema names are automatically appendedargs: dict (optional) - Pass-through to dataset adapter (e.g., subset_list, local_path, review_timeout, etc.)
Why Set Weights?#
Sampling Phase:
Under weighted sampling strategy, sample proportion ∝ normalized weight.
Normalization for flat (non-nested) structure: $\( \mathbf{p_i} = \frac{\mathbf{w_i}}{\sum_{j=1}^{n} \mathbf{w_j}} \)$
For multi-level nesting (product of weights along the leaf path, then overall normalization): $\( \mathbf{p_i} = \frac{\prod_{k \in \text{path}(i)} \mathbf{w_k}}{\sum_{j=1}^{n} \prod_{k \in \text{path}(j)} \mathbf{w_k}} \)$
Scoring Phase:
Aggregation uses linear weighting: $\( \mathbf{S} = \sum_{i=1}^{n} \boldsymbol{\alpha_i} \, \mathbf{s_i} \)\( where: \)\( \boldsymbol{\alpha_i} = \mathbf{p_i}, \quad \sum_{i=1}^{n} \boldsymbol{\alpha_i} = 1 \)$
Minimal Python Example#
from evalscope.collections import CollectionSchema, DatasetInfo
simple_schema = CollectionSchema(
name='reasoning_index',
datasets=[
DatasetInfo(name='arc', weight=2.0, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=3.0, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
],
)
Nested & Multi-Capability Example#
from evalscope.collections import CollectionSchema, DatasetInfo
complex_schema = CollectionSchema(name='math_index', datasets=[
CollectionSchema(name='math', weight=3.0, datasets=[
DatasetInfo(name='gsm8k', weight=1.0, task_type='math', tags=['en']),
DatasetInfo(name='aime25', weight=1.0, task_type='math', tags=['en']),
]),
CollectionSchema(name='reasoning', weight=1.0, datasets=[
DatasetInfo(name='arc', weight=1.0, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1.0, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
]),
])
Hierarchical Effects:
Parent names are appended to each leaf’s tags/hierarchy, supporting multi-level aggregated analysis.
Parent weight scales the entire group, then jointly normalizes with child nodes.
Flatten / Serialize / Restore#
print(simple_schema.flatten()) # Flatten + normalize
print(complex_schema.flatten()) # View normalized weights after nesting
complex_schema.dump_json('outputs/complex_schema.json')
loaded = CollectionSchema.from_json('outputs/complex_schema.json')
Output Example (automatically normalized weights):
>>> [DatasetInfo(name='arc', weight=0.4, task_type='reasoning', tags=['en'], args={}, hierarchy=['reasoning_index']),
DatasetInfo(name='ceval', weight=0.6, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}, hierarchy=['reasoning_index'])]
>>> [DatasetInfo(name='gsm8k', weight=0.375, task_type='math', tags=['en'], args={}, hierarchy=['math_index', 'math']),
DatasetInfo(name='aime25', weight=0.375, task_type='math', tags=['en'], args={}, hierarchy=['math_index', 'math']),
DatasetInfo(name='arc', weight=0.125, task_type='reasoning', tags=['en'], args={}, hierarchy=['math_index', 'reasoning']),
DatasetInfo(name='ceval', weight=0.125, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}, hierarchy=['math_index', 'reasoning'])]
Common Issues#
Weights must be > 0; directly remove entries with weight = 0.
Deep nesting can easily cause some leaf weights to be “diluted”; verify manually by flattening first.
Too many tags affect downstream visualization dimensions; keep them concise (e.g., en / zh / math / rag).
argsmust match corresponding dataset adapter fields; incorrect fields will be ignored or trigger warnings.Cramming too many capabilities into one Schema hinders iteration; split into multiple versions for comparison.
Reference Cases#
Case A: Enterprise RAG Assistant Index#
Pain Point: High scores on general leaderboards, but inaccurate internal knowledge retrieval.
Approach: Knowledge tasks + long-context retrieval + instruction following.
Configuration Example:
from evalscope.collections import CollectionSchema, DatasetInfo
schema = CollectionSchema(name='rag_index', datasets=[
DatasetInfo(name='chinese_simpleqa', weight=0.3, task_type='knowledge', tags=['rag']),
DatasetInfo(name='aa_lcr', weight=0.3, task_type='long_context', tags=['rag']),
DatasetInfo(name='ifeval', weight=0.4, task_type='instruction_following', tags=['rag']),
])
Case B: Code Completion & Analysis Index#
Pain Point: General leaderboards cannot comprehensively reflect code understanding and generation capabilities.
Approach: Code generation + real issues.
Configuration Example:
from evalscope.collections import CollectionSchema, DatasetInfo
schema = CollectionSchema(name='code_index', datasets=[
DatasetInfo(name='humaneval', weight=2.0, task_type='code', tags=['python']),
DatasetInfo(name='live_code_bench', weight=1.0, task_type='code', tags=['zh', 'finance'], args={'subset_list': ['v5'], 'review_timeout': 6, 'extra_params': {'start_date': '2024-08-01','end_date': '2025-02-28'},}),
])