Data-Collection#

Overview#

Data-Collection is a flexible framework for mixing multiple evaluation datasets into a unified evaluation suite. It enables comprehensive model assessment using carefully selected samples from various benchmarks.

Task Description#

  • Task Type: Multi-Dataset Unified Evaluation

  • Input: Mixed samples from multiple benchmark datasets

  • Output: Aggregated scores across tasks, datasets, and categories

  • Flexibility: Supports custom dataset collections

Key Features#

  • Mix multiple benchmarks into one evaluation

  • Hierarchical reporting (subset, dataset, task, tag, category levels)

  • Sample-level weighting support

  • Automatic adapter initialization for each dataset

  • Comprehensive aggregation (micro, macro, weighted averages)

Evaluation Notes#

  • Dataset must be pre-compiled as a collection

  • Supports various task types (MCQ, QA, coding, etc.)

  • Generates multi-level reports for detailed analysis

  • See Collection Guide for usage

Properties#

Property

Value

Benchmark Name

data_collection

Dataset ID

N/A

Paper

N/A

Tags

Custom

Metrics

acc

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Statistics not available.

Sample Example#

Sample example not available.

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets data_collection \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['data_collection'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)