Sampling Your Index Data#
In a nutshell: Based on the defined Collection Schema, sample multiple datasets using specified strategies to generate a mixed JSONL for unified one-time evaluation.
Output#
A single JSONL file (mixed samples) for subsequent
evalscope eval.Each line represents a standardized sample.
Field Description#
Field |
Description |
|---|---|
index |
Sample sequence number (renumbered in the mixed set) |
prompt |
Original input structure (may include question / context / code, etc.) |
tags |
Merged tags (dataset’s own + upper hierarchy) |
task_type |
Task type or capability classification |
weight |
Leaf node normalized weight (for interpretation, not equal to actual sample repetition weight) |
dataset_name |
Original dataset name |
subset_name |
Subset or difficulty label (can be empty) |
Example:
{
"index": 0,
"prompt": {"question": "What is the capital of France?"},
"tags": ["en", "reasoning"],
"task_type": "question_answering",
"weight": 1.0,
"dataset_name": "arc",
"subset_name": "ARC-Easy"
}
Three Sampling Strategies (Formulas & Use Cases)#
Given:
Total sample size: \(N\)
Number of datasets: \(K\)
Schema weights: \(w_i\)
Original dataset sample size: \(m_i\)
1. Weighted Sampling#
Formula (expected sample count):
$\(
n_i \approx N \cdot \frac{w_i}{\sum_{j=1}^{K} w_j}
\)$
Use case: You’ve expressed business priorities through weights and want “high-weight capabilities” to occupy more resources.
Characteristics: Reinforces values; if a dataset has much higher weight, it will dominate proportionally.
2. Stratified Sampling#
Formula:
$\(
n_i \approx N \cdot \frac{m_i}{\sum_{j=1}^{K} m_j}
\)$
Use case: Preserve original distribution (e.g., let datasets with larger corpus sizes be more representative).
Characteristics: Doesn’t reflect business bias; avoids over-amplifying small datasets. Guarantees at least ≥1 sample per dataset.
3. Uniform Sampling#
Formula:
$\(
n_i \approx \frac{N}{K}
\)$
Use case: Want each capability to have “equal voice,” facilitating model comparison across capabilities.
Characteristics: Ignores original size and weights; makes results easier for comparing individual capability weaknesses.
Strategy Selection Recommendation:
For “business decision index”: Weighted
For “objective coverage / preserving corpus structure”: Stratified
For “capability alignment / diagnosis”: Uniform
Prerequisite Schema Example#
from evalscope.collections import CollectionSchema, DatasetInfo
schema = CollectionSchema(
name='reasoning_index',
datasets=[
DatasetInfo(name='arc', weight=2.0, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=3.0, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
],
)
Weighted Sampling (Emphasizing Business Importance)#
from evalscope.collections import WeightedSampler
from evalscope.utils.io_utils import dump_jsonl_data
sampler = WeightedSampler(schema)
mixed_data = sampler.sample(10) # N=10
dump_jsonl_data(mixed_data, 'outputs/weighted_mixed_data.jsonl')
Expected: arc : ceval ≈ 2 : 3 → 4 : 6
Stratified Sampling (Preserving Source Data Size Distribution)#
from evalscope.collections import StratifiedSampler
from evalscope.utils.io_utils import dump_jsonl_data
sampler = StratifiedSampler(schema)
mixed_data = sampler.sample(10)
dump_jsonl_data(mixed_data, 'outputs/stratified_mixed_data.jsonl')
Expected: Approximately allocated by original sample sizes (example: if arc=2000, ceval=10 → roughly 9 : 1)
Uniform Sampling (Aligning Capability Comparison)#
from evalscope.collections import UniformSampler
from evalscope.utils.io_utils import dump_jsonl_data
sampler = UniformSampler(schema)
mixed_data = sampler.sample(10)
dump_jsonl_data(mixed_data, 'outputs/uniform_mixed_data.jsonl')
Expected: arc : ceval = 5 : 5
Common Issues#
Weights don’t affect Uniform / Stratified allocation (only Weighted uses them).
Dataset with too few samples: Stratified sampling will force at least 1 sample; increase N or switch to Weighted.
The
weightfield in JSONL is the leaf normalized weight, not equal to the sample’s appearance probability (especially for Stratified and Uniform strategies).