👍 Contribute Benchmark#

EvalScope, as the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own benchmark and share your contributions with the community. Let’s work together to enhance EvalScope and make our tool even better!

Below, we will introduce how to add two types of benchmark evaluations: General Text Reasoning and Multiple Choice, which mainly include three steps: uploading the dataset, registering the dataset, and writing the evaluation task.

Basic Concepts#

Tip

You can skip this section and start directly from Preparing Benchmark Evaluation Dataset. Refer back to the specific implementation when encountering code you don’t understand.

The evaluation process of EvalScope mainly includes the following steps:

Data Preparation: Load and preprocess the dataset through DataAdapter.
Task Definition: Define the configuration of the evaluation task through TaskConfig, including models, datasets, evaluation metrics, etc.
Evaluation Execution: Execute the evaluation task through the run_task function and output the evaluation results.

Among them, DataAdapter is the core component of benchmark evaluation that we need to focus on.

DataAdapter Architecture and Call Flow#

DataAdapter adopts a Pipeline architecture, supporting custom behavior through hook methods. Taking DefaultDataAdapter as an example, the complete evaluation process is as follows:

1. Data Loading Phase
   load_dataset() 
   ├── load() 
   │   ├── load_from_remote() / load_from_disk()
   │   │   ├── load_subsets()
   │   │   │   └── load_subset() / load_fewshot_subset()
   │   │   │       └── record_to_sample() [User Implementation]
   │   │   └── _post_process_samples()
   │   │       └── process_sample_input()
   │   │           ├── sample_to_fewshot() [User Implementation]
   │   │           ├── format_fewshot_template() [Optional User Implementation]
   │   │           └── format_prompt_template() [Optional User Implementation]
   │   └── Returns DatasetDict

2. Model Inference Phase (Per Sample)
   run_inference()
   ├── _on_inference_start() [Hook Method]
   ├── _on_inference() [Hook Method]
   └── _on_inference_end() [Hook Method]
       └── Returns TaskState

3. Metric Calculation Phase (Per Sample)
   calculate_metrics()
   ├── filter_prediction()
   │   └── extract_answer() [Optional User Implementation]
   ├── match_score() / llm_match_score()
   └── Returns SampleScore

4. Result Aggregation Phase
   aggregate_scores()
   └── Returns List[AggScore]

5. Report Generation Phase
   generate_report()
   ├── _on_generate_report() [Hook Method]
   └── _on_generate_report_end() [Hook Method]
       └── Returns Report

Core Data Structures#

1. Sample Object#

Represents a single evaluation sample, including input, target answer, and metadata:

@dataclass
class Sample:
    input: Any                    # Input content (question text or list of chat messages)
    target: str                   # Target answer (correct answer)
    choices: Optional[List[str]] = None    # Choices (used for multiple choice questions)
    subset_key: Optional[str] = None       # Subset division key (used for grouping by category)
    metadata: Optional[Dict] = None        # Metadata (reasoning process, ID, etc.)
    tools: Optional[List] = None           # Tool call information

2. TaskState Object#

Represents the complete state of a single inference task:

@dataclass
class TaskState:
    model: str                    # Model name
    sample: Sample               # Input sample
    messages: List[ChatMessage]  # Chat message history
    output: ModelOutput          # Model raw output
    completed: bool              # Whether the task is completed
    sample_id: Optional[str] = None      # Sample ID
    group_id: Optional[str] = None       # Group ID
    metadata: Optional[Dict] = None      # Task metadata

3. ModelOutput Object#

Represents the raw output of the model:

@dataclass
class ModelOutput:
    completion: str              # Text generated by the model
    message: ChatMessage         # Formatted chat message
    # Other model-specific fields...

4. Score Object#

Represents the scoring result of a single sample:

@dataclass
class Score:
    value: Dict[str, float]      # Scores for each metric {"acc": 1.0, "f1": 0.8}
    extracted_prediction: str    # Extracted prediction answer
    prediction: str              # Raw prediction text
    metadata: Dict = None        # Scoring metadata

5. SampleScore Object#

Encapsulates the complete scoring information of a single sample:

@dataclass
class SampleScore:
    score: Score                 # Scoring object
    sample_id: Optional[str]     # Unique identifier for the sample
    group_id: Optional[str]      # Group identifier
    sample_metadata: Optional[Dict] = None  # Sample metadata

6. AggScore Object#

Represents aggregated scoring statistics:

@dataclass
class AggScore:
    metric: str                  # Metric name
    value: float                 # Aggregated value (e.g., average score)
    subset: str                  # Subset name
    num_samples: int             # Number of samples
    agg_method: str              # Aggregation method (mean, median, etc.)
    metadata: Dict = None        # Aggregation metadata

7. DatasetDict Object#

Manages multiple dataset subsets:

class DatasetDict(dict):
    """Dataset dictionary, keys are subset names, values are Dataset objects"""
    
    @classmethod
    def from_dataset(cls, dataset, subset_list=None, limit=None, repeats=1):
        """Create a multi-subset dataset dictionary from a single dataset"""
        pass

Core Methods of DataAdapter#

Based on the above call flow, here are the key methods that need to be implemented by the user or can be optionally overridden:

Methods That Must Be Implemented#

record_to_sample(record: Dict[str, Any]) -> Sample

Purpose: Convert raw data records into standard Sample objects
Input: Raw record dictionary from the dataset
Output: Standardized Sample object
Example:

def record_to_sample(self, record: Dict[str, Any]) -> Sample:
    return Sample(
        input=record['question'],
        target=record['answer'],
        metadata={'reasoning': record.get('explanation', '')}
    )

Methods That Can Be Optionally Implemented#

sample_to_fewshot(sample: Sample) -> str
- Purpose: Convert sample into a few-shot example string
- Input: Sample object
- Output: Formatted few-shot example text
- Call Timing: When constructing few-shot prompts
extract_answer(prediction: str, task_state: TaskState) -> str
- Purpose: Extract the final answer from the model’s raw output
- Input: Model prediction text and task state
- Output: Extracted answer string
- Call Timing: Before calculating metrics for answer cleaning
format_prompt_template(sample: Sample) -> str
- Purpose: Format the basic prompt template
- Input: Sample object
- Output: Formatted prompt text
- Default Implementation: Uses prompt_template.format(question=sample.input)
format_fewshot_template(fewshot: str, sample: Sample) -> str
- Purpose: Format the prompt template containing few-shot examples
- Input: Few-shot example string and Sample object
- Output: Complete few-shot prompt
- Default Implementation: Uses few_shot_prompt_template.format()
sample_filter(sample: Sample) -> bool
- Purpose: Filter dataset samples
- Input: Sample object
- Output: Whether to retain the sample
- Default Implementation: Returns True (retains all samples)

Hook Method System#

DataAdapter provides a hook method system, supporting custom logic insertion at key points:

Inference Phase Hooks#

_on_inference_start(model, sample): Before inference starts
_on_inference(model, sample): Execute inference
_on_inference_end(model, sample, model_output, output_dir): After inference ends

Report Generation Hooks#

_on_generate_report(scores, model_name): Generate report
_on_generate_report_end(report, output_dir): After report generation

Adapter Types#

EvalScope provides two main adapter base classes:

DefaultDataAdapter: Basic adapter for general text reasoning tasks
- Suitable for open-ended question answering, mathematical reasoning, code generation, etc.
- Requires custom answer extraction logic
MultiChoiceAdapter: Specialized adapter for multiple choice tasks
- Inherits from DefaultDataAdapter
- Built-in choice formatting and answer extraction logic
- Supports single-choice and multiple-choice modes

Principles for choosing adapter types:

If the task involves selecting answers from fixed options → Use MultiChoiceAdapter
If the task requires generating open-ended answers → Use DefaultDataAdapter

1. Preparing Benchmark Evaluation Dataset#

You have two ways to prepare the benchmark evaluation dataset:

Upload to ModelScope (Recommended): Upload the dataset to the ModelScope platform, so other users can easily load your dataset, making it more convenient to use and benefiting more users from your contribution. If you need to upload to ModelScope, refer to the Dataset Upload Tutorial.
Local Use: You can also directly use the local dataset for evaluation, suitable for datasets that are still in development or contain sensitive information.

Regardless of the method chosen, ensure that the data format is correct and can be loaded. If using a local dataset, you can test with the following code:

from modelscope import MsDataset

dataset = MsDataset.load("/path/to/your/dataset")  # Replace with your dataset

2. Creating File Structure#

First, Fork EvalScope repository, i.e., create your own EvalScope repository copy, and clone it locally.

git clone https://github.com/your_username/evalscope.git
cd evalscope

Then, add benchmark evaluation in the evalscope/benchmarks/ directory, with the structure as follows:

evalscope/benchmarks/
├── benchmark_name
│   ├── __init__.py
│   ├── benchmark_name_adapter.py
│   └── ...

Specifically for GSM8K and MMLU-Pro, the structure is as follows:

evalscope/benchmarks/
├── gsm8k
│   ├── __init__.py
│   ├── gsm8k_adapter.py
├── mmlu_pro
│   ├── __init__.py
│   ├── mmlu_pro_adapter.py
│   └── ...

3. Writing Evaluation Logic#

Below, we will take GSM8K and MMLU-Pro as examples to introduce two types of evaluation tasks: General Text Reasoning and Multiple Choice.

General Text Reasoning#

General text reasoning tasks usually require the model to analyze and reason about the given problem and then generate an answer. Taking GSM8K (mathematical reasoning) as an example:

We need to register Benchmark and implement the GSM8KAdapter class in gsm8k_adapter.py:

from typing import Any, Dict
from evalscope.api.benchmark import BenchmarkMeta, DefaultDataAdapter
from evalscope.api.dataset import Sample
from evalscope.api.evaluator import TaskState
from evalscope.api.registry import register_benchmark
from evalscope.constants import Tags

# Define prompt template
PROMPT_TEMPLATE = """
Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command.

Reasoning:
""".lstrip()

# Register benchmark evaluation
@register_benchmark(
    BenchmarkMeta(
        name='gsm8k',                          # Benchmark test name
        pretty_name='GSM8K',                   # Readable name
        dataset_id='AI-ModelScope/gsm8k',      # Dataset ID or local path
        tags=[Tags.MATH, Tags.REASONING],      # Tags
        description='GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.',
        subset_list=['main'],                  # Subset list
        few_shot_num=4,                       # Few-shot example number
        train_split='train',                  # Training set split name
        eval_split='test',                    # Evaluation set split name
        metric_list=['acc'],                  # Evaluation metrics
        prompt_template=PROMPT_TEMPLATE,      # Prompt template
    )
)
class GSM8KAdapter(DefaultDataAdapter):
    
    def record_to_sample(self, record: Dict[str, Any]) -> Sample:
        """Convert raw data records into Sample objects"""
        DELIM = '####'
        question = record['question']
        answer = record['answer'].split(DELIM)
        target = answer.pop().strip()  # Extract final answer
        reasoning = DELIM.join(answer)  # Extract reasoning process
        
        return Sample(
            input=question,
            target=target,
            metadata={'reasoning': reasoning.strip()}
        )
    
    def sample_to_fewshot(self, sample: Sample) -> str:
        """Convert sample into few-shot example"""
        if sample.metadata:
            return (
                f'{sample.input}\n\nReasoning:\n' + 
                f"{sample.metadata['reasoning']}\n\n" + 
                f'ANSWER: {sample.target}'
            )
        else:
            return ''
    
    def extract_answer(self, prediction: str, task_state: TaskState):
        """Extract answer from model prediction"""
        from evalscope.filters.extraction import RegexFilter
        
        # Use regular expression to extract numeric answer
        regex = RegexFilter(regex_pattern=r'(-?[0-9.,]{2,})|(-?[0-9]+)', group_select=-1)
        res = regex(prediction)
        return res.replace(',', '').replace('+', '').strip().strip('.')

Multiple Choice#

Multiple choice tasks require the model to select the correct answer from given options. Taking MMLU-Pro as an example, we need to inherit MultiChoiceAdapter:

from typing import Any, Dict
from evalscope.api.benchmark import BenchmarkMeta, MultiChoiceAdapter
from evalscope.api.dataset import Sample
from evalscope.api.registry import register_benchmark
from evalscope.constants import Tags

# Define prompt template
USER_PROMPT_TEMPLATE = """Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

Question:
{question}
Options:
{choices}
""".lstrip()

SUBSET_LIST = [
    'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology', 
    'health', 'physics', 'business', 'philosophy', 'economics', 'other', 
    'psychology', 'history'
]

@register_benchmark(
    BenchmarkMeta(
        name='mmlu_pro',
        pretty_name='MMLU-Pro',
        tags=[Tags.MULTIPLE_CHOICE, Tags.KNOWLEDGE],
        description='MMLU-Pro is a benchmark for evaluating language models on multiple-choice questions across various subjects.',
        dataset_id='modelscope/MMLU-Pro',
        subset_list=SUBSET_LIST,
        metric_list=['acc'],
        few_shot_num=5,
        train_split='validation',
        eval_split='test',
        prompt_template=USER_PROMPT_TEMPLATE,
    )
)
class MMLUProAdapter(MultiChoiceAdapter):
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.reformat_subset = True  # Enable subset division
    
    def record_to_sample(self, record: Dict[str, Any]) -> Sample:
        """Convert raw data records into Sample objects"""
        return Sample(
            input=record['question'],
            choices=record['options'],      # Choice list
            target=record['answer'],        # Correct answer (e.g., 'A')
            subset_key=record['category'].lower(),  # Key for subset division
            metadata={
                'cot_content': record['cot_content'],
                'subject': record['category'].lower(),
                'question_id': record['question_id'],
            },
        )
    
    def sample_to_fewshot(self, sample: Sample) -> str:
        """Convert sample into few-shot example"""
        q_str = f"""Question:\n{str(sample.input)}"""
        options = sample.choices if sample.choices is not None else []
        
        # Format choices
        opt_str_list = []
        for i, opt in enumerate(options):
            opt_str_list.append(f"""{chr(65 + i)} {opt}""")
        opt_str = f"""Options:\n{'\n'.join(opt_str_list)}"""
        
        # Handle answer and reasoning process
        ans_str = sample.metadata['cot_content'] if sample.metadata is not None else ''
        ans_str = ans_str.replace('The answer is', 'ANSWER:')
        ans_opt = ans_str.split('ANSWER:')[-1].split('.')[0].strip().strip('(').strip(')')
        ans_str = ans_str.replace(f'ANSWER: ({ans_opt})', f'ANSWER: {ans_opt}')
        
        final_str = '\n'.join([q_str, opt_str, ans_str])
        return final_str

Key Differences Explanation#

General Text Reasoning vs Multiple Choice:

Inherited Base Class:
- General Text Reasoning: Inherits DefaultDataAdapter
- Multiple Choice: Inherits MultiChoiceAdapter
Sample Object Structure:
- General Text Reasoning: Mainly includes input and target
- Multiple Choice: Additionally includes choices (choice list)
Answer Extraction Method:
- General Text Reasoning: Requires custom extract_answer() method
- Multiple Choice: MultiChoiceAdapter provides standard answer extraction logic
Prompt Template:
- General Text Reasoning: Focuses more on guiding the reasoning process
- Multiple Choice: Focuses on displaying choices and answer format

4. Running Evaluation#

Debug the code to see if it can run normally.

GSM8K Example:

from evalscope import run_task, TaskConfig

task_cfg = TaskConfig(
    model='Qwen/Qwen2.5-0.5B-Instruct',
    datasets=['gsm8k'],
    limit=10,
    debug=True
)
run_task(task_cfg=task_cfg)

MMLU-Pro Example:

from evalscope import run_task, TaskConfig

task_cfg = TaskConfig(
    model='Qwen/Qwen2.5-0.5B-Instruct',
    datasets=['mmlu_pro'],
    limit=10,
    dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}},
    debug=True
)
run_task(task_cfg=task_cfg)

Output Example:

+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model                 | Dataset   | Metric          | Subset           |   Num |   Score | Cat.0   |
+=======================+===========+=================+==================+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | gsm8k     | mean_acc        | main             |    10 |     0.3 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro  | mean_acc        | computer science |    10 |     0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro  | mean_acc        | math             |    10 |     0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+

5. Benchmark Evaluation Document Generation#

After completing the benchmark evaluation implementation, you can use the tools provided by EvalScope to generate standard documents. This will ensure that your benchmark evaluation has a consistent document format and can be easily understood and used by other users.

To generate Chinese and English documents, run the following command, which will generate documents based on registration information:

pip install -e '.[docs]'
make docs

6. Submitting PR#

After completing the implementation of these methods and document generation, your benchmark evaluation is ready! You can submit a PR. Before submitting, please run the following command, which will automatically format the code:

make lint

Ensure there are no formatting issues, and we will merge your contribution as soon as possible, allowing more users to use the benchmark evaluation you contributed. If you don’t know how to submit a PR, you can check our Guide. Give it a try 🚀