Sampling Your Index Data#

In a nutshell: Based on the defined Collection Schema, sample multiple datasets using specified strategies to generate a mixed JSONL for unified one-time evaluation.

Output#

A single JSONL file (mixed samples) for subsequent evalscope eval.
Each line represents a standardized sample.

Field Description#

Field	Description
index	Sample sequence number (renumbered in the mixed set)
prompt	Original input structure (may include question / context / code, etc.)
tags	Merged tags (dataset’s own + upper hierarchy)
task_type	Task type or capability classification
weight	Leaf node normalized weight (for interpretation, not equal to actual sample repetition weight)
dataset_name	Original dataset name
subset_name	Subset or difficulty label (can be empty)

Example:

{
  "index": 0,
  "prompt": {"question": "What is the capital of France?"},
  "tags": ["en", "reasoning"],
  "task_type": "question_answering",
  "weight": 1.0,
  "dataset_name": "arc",
  "subset_name": "ARC-Easy"
}

Three Sampling Strategies (Formulas & Use Cases)#

Given:

Total sample size: $N$
Number of datasets: $K$
Schema weights: $w_i$
Original dataset sample size: $m_i$

1. Weighted Sampling#

Formula (expected sample count): $$ n_i \approx N \cdot \frac{w_i}{\sum_{j=1}^{K} w_j} $$ Use case: You’ve expressed business priorities through weights and want “high-weight capabilities” to occupy more resources.
Characteristics: Reinforces values; if a dataset has much higher weight, it will dominate proportionally.

2. Stratified Sampling#

Formula: $$ n_i \approx N \cdot \frac{m_i}{\sum_{j=1}^{K} m_j} $$ Use case: Preserve original distribution (e.g., let datasets with larger corpus sizes be more representative).
Characteristics: Doesn’t reflect business bias; avoids over-amplifying small datasets. Guarantees at least ≥1 sample per dataset.

3. Uniform Sampling#

Formula: $$ n_i \approx \frac{N}{K} $$ Use case: Want each capability to have “equal voice,” facilitating model comparison across capabilities.
Characteristics: Ignores original size and weights; makes results easier for comparing individual capability weaknesses.

Strategy Selection Recommendation:

For “business decision index”: Weighted
For “objective coverage / preserving corpus structure”: Stratified
For “capability alignment / diagnosis”: Uniform

Prerequisite Schema Example#

from evalscope.collections import CollectionSchema, DatasetInfo  

schema = CollectionSchema(
    name='reasoning_index',
    datasets=[
        DatasetInfo(name='arc', weight=2.0, task_type='reasoning', tags=['en']),
        DatasetInfo(name='ceval', weight=3.0, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
    ],
)

Weighted Sampling (Emphasizing Business Importance)#

from evalscope.collections import WeightedSampler
from evalscope.utils.io_utils import dump_jsonl_data

sampler = WeightedSampler(schema)
mixed_data = sampler.sample(10)  # N=10
dump_jsonl_data(mixed_data, 'outputs/weighted_mixed_data.jsonl')

Expected: arc : ceval ≈ 2 : 3 → 4 : 6

Stratified Sampling (Preserving Source Data Size Distribution)#

from evalscope.collections import StratifiedSampler
from evalscope.utils.io_utils import dump_jsonl_data

sampler = StratifiedSampler(schema)
mixed_data = sampler.sample(10)
dump_jsonl_data(mixed_data, 'outputs/stratified_mixed_data.jsonl')

Expected: Approximately allocated by original sample sizes (example: if arc=2000, ceval=10 → roughly 9 : 1)

Uniform Sampling (Aligning Capability Comparison)#

from evalscope.collections import UniformSampler
from evalscope.utils.io_utils import dump_jsonl_data

sampler = UniformSampler(schema)
mixed_data = sampler.sample(10)
dump_jsonl_data(mixed_data, 'outputs/uniform_mixed_data.jsonl')

Expected: arc : ceval = 5 : 5

Common Issues#

Weights don’t affect Uniform / Stratified allocation (only Weighted uses them).
Dataset with too few samples: Stratified sampling will force at least 1 sample; increase N or switch to Weighted.
The weight field in JSONL is the leaf normalized weight, not equal to the sample’s appearance probability (especially for Stratified and Uniform strategies).