πŸ‘ Contribute Benchmark#

EvalScope, the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own benchmark evaluation and share your contribution with the community. Let’s work together to improve EvalScope and make our tool even better!

Below, using MMLU-Pro as an example, we will introduce how to add a benchmark evaluation, primarily including three steps: uploading the dataset, registering the dataset, and writing the evaluation task.

Upload Benchmark Evaluation Dataset#

Upload the benchmark evaluation dataset to ModelScope, which allows users to load the dataset with one click, benefiting more users. Of course, if the dataset already exists, you can skip this step.

See also

For example: modelscope/MMLU-Pro, refer to the dataset upload tutorial.

Ensure that the data can be loaded by ModelScope, test the code as follows:

from modelscope import MsDataset

dataset = MsDataset.load("modelscope/MMLU-Pro")  # Replace with your dataset

Register Benchmark Evaluation#

Add the benchmark evaluation in EvalScope.

Create File Structure#

First, Fork EvalScope repository, which creates a personal copy of the EvalScope repository, and then clone it locally.

Then, add the benchmark evaluation under the evalscope/benchmarks/ directory, with the following structure:

evalscope/benchmarks/
β”œβ”€β”€ benchmark_name
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ benchmark_name_adapter.py
β”‚   └── ...

For MMLU-Pro, the structure is as follows:

evalscope/benchmarks/
β”œβ”€β”€ mmlu_pro
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ mmlu_pro_adapter.py
β”‚   └── ...

Register Benchmark#

We need to register the Benchmark in benchmark_name_adapter.py, enabling EvalScope to load the benchmark test we added. Using MMLU-Pro as an example, it mainly includes the following:

  • Import Benchmark and DataAdapter

  • Register Benchmark, specifying:

    • name: Name of the benchmark test

    • dataset_id: Benchmark test dataset ID for loading the dataset

    • model_adapter: Default model adapter for the benchmark test, supporting two types:

      • OutputType.GENERATION: General text generation model evaluation, returns text generated by the model via input prompt

      • OutputType.MULTIPLE_CHOICE: Multiple-choice evaluation, calculates option probability through logits, returns the option with the highest probability

    • output_types: Output types of the benchmark test, supporting multiple selections:

      • OutputType.GENERATION: General text generation model evaluation

      • OutputType.MULTIPLE_CHOICE: Multiple-choice evaluation output logits

    • subset_list: Sub-datasets of the benchmark test dataset

    • metric_list: Evaluation metrics for the benchmark test

    • few_shot_num: Number of In Context Learning samples for evaluation

    • train_split: Training set of the benchmark test, used for sampling ICL examples

    • eval_split: Evaluation set of the benchmark test

    • prompt_template: Prompt template for the benchmark test

  • Create MMLUProAdapter class, inherited from DataAdapter.

Tip

subset_list, train_split, eval_split can be obtained from the dataset preview, for example MMLU-Pro preview

MMLU-Pro preview

Example code is as follows:

from evalscope.benchmarks import Benchmark, DataAdapter
from evalscope.constants import EvalType, OutputType

SUBSET_LIST = [
    'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology', 'health', 'physics', 'business',
    'philosophy', 'economics', 'other', 'psychology', 'history'
]  # customize your subset list

@Benchmark.register(
    name='mmlu_pro',
    pretty_name='MMLU-Pro',
    dataset_id='modelscope/MMLU-Pro',
    model_adapter=OutputType.GENERATION,
    output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
    subset_list=SUBSET_LIST,
    metric_list=['AverageAccuracy'],
    few_shot_num=5,
    train_split='validation',
    eval_split='test',
    prompt_template=
    'The following are multiple choice questions (with answers) about {subset_name}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n{query}',  # noqa: E501
)
class MMLUProAdapter(DataAdapter):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

Write Evaluation Logic#

After completing the DataAdapter, you can add evaluation tasks in EvalScope. The following methods need to be implemented:

  • gen_prompt: Generate model input prompt.

  • get_gold_answer: Parse the standard answer from the dataset.

  • parse_pred_result: Parse the model output, return different answer parsing methods based on different eval_types.

  • match: Match the model output with the dataset standard answer and score.

Note

If the default load logic does not meet the requirements, you can override the load method, for example: implement classification of the dataset based on specified fields.

Complete example code is as follows:

class MMLUProAdapter(DataAdapter):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    
    def load(self, **kwargs):
        # default load all data
        kwargs['subset_list'] = ['default']
        data_dict = super().load(**kwargs)
        # use `category` as subset key
        return self.reformat_subset(data_dict, subset_key='category')
    
    def gen_prompt(self, input_d: Dict, subset_name: str, few_shot_list: list, **kwargs) -> Any:
        if self.few_shot_num > 0:
            prefix = self.format_fewshot_examples(few_shot_list)
        else:
            prefix = ''
        query = prefix + 'Q: ' + input_d['question'] + '\n' + \
            self.__form_options(input_d['options']) + '\n'

        full_prompt = self.prompt_template.format(subset_name=subset_name, query=query)
        return self.gen_prompt_data(full_prompt)
    
    def format_fewshot_examples(self, few_shot_list):
        # load few-shot prompts for each category
        prompts = ''
        for index, d in enumerate(few_shot_list):
            prompts += 'Q: ' + d['question'] + '\n' + \
                self.__form_options(d['options']) + '\n' + \
                d['cot_content'] + '\n\n'
        return prompts
    
    
    def __form_options(self, options: list):
        option_str = 'Options are:\n'
        for opt, choice in zip(options, self.choices):
            option_str += f'({choice}): {opt}' + '\n'
        return option_str
    
    def get_gold_answer(self, input_d: dict) -> str:
        """
        Parse the raw input labels (gold).

        Args:
            input_d: input raw data. Depending on the dataset.

        Returns:
            The parsed input. e.g. gold answer ... Depending on the dataset.
        """
        return input_d['answer']


    def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str:
        """
        Parse the predicted result and extract proper answer.

        Args:
            result: Predicted answer from the model. Usually a string for chat.
            raw_input_d: The raw input. Depending on the dataset.
            eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint'

        Returns:
            The parsed answer. Depending on the dataset. Usually a string for chat.
        """
        if self.model_adapter == OutputType.MULTIPLE_CHOICE:
            return result
        else:
            return ResponseParser.parse_first_option(result)


    def match(self, gold: str, pred: str) -> float:
        """
        Match the gold answer and the predicted answer.

        Args:
            gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions.
                        e.g. 'A', extracted from get_gold_answer method.
            pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions.
                        e.g. 'B', extracted from parse_pred_result method.

        Returns:
            The match result. Usually a score (float) for chat/multiple-choice-questions.
        """
        return exact_match(gold=gold, pred=pred)

Run Evaluation#

Debug the code to see if it can run normally.

from evalscope import run_task, TaskConfig
task_cfg = TaskConfig(
    model='Qwen/Qwen2.5-0.5B-Instruct',
    datasets=['mmlu_pro'],
    limit=10,
    dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}},
    debug=True
)
run_task(task_cfg=task_cfg)

Output is as follows:

+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model                 | Dataset   | Metric          | Subset           |   Num |   Score | Cat.0   |
+=======================+===========+=================+==================+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | mmlu_pro  | AverageAccuracy | computer science |     10 |       0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro  | AverageAccuracy | math             |     10 |       0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+ 

If everything runs smoothly, you can submit a PR. We will review and merge your contribution as soon as possible, allowing more users to benefit from the benchmark evaluation you’ve contributed. If you’re unsure how to submit a PR, you can check out our guide. Give it a try! πŸš€