πŸ‘ Contribute Benchmark#

EvalScope, as the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better!

Below, using MMLU-Pro as an example, we will introduce how to add benchmark evaluations, which includes three main steps: uploading datasets, registering datasets, and writing evaluation tasks.

Upload Benchmark Evaluation Dataset#

Upload the benchmark evaluation dataset to ModelScope, allowing users to load the dataset with a single click and benefit more users. Of course, if the dataset already exists, you can skip this step.

See also

For example: modelscope/MMLU-Pro, refer to the dataset upload tutorial.

Please ensure the data can be loaded by ModelScope. The test code is as follows:

from modelscope import MsDataset

dataset = MsDataset.load("modelscope/MMLU-Pro")  # Replace with your dataset

Register Benchmark Evaluation#

Add benchmark evaluations in EvalScope.

Create File Structure#

First, Fork EvalScope repository to create your own copy of the EvalScope repository and clone it to your local machine.

Then, add the benchmark evaluation in the evalscope/benchmarks/ directory, with the structure as follows:

evalscope/benchmarks/
β”œβ”€β”€ benchmark_name
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ benchmark_name_adapter.py
β”‚   └── ...

Specifically for MMLU-Pro, the structure is as follows:

evalscope/benchmarks/
β”œβ”€β”€ mmlu_pro
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ mmlu_pro_adapter.py
β”‚   └── ...

Register Benchmark#

We need to register the Benchmark in benchmark_name_adapter.py, so that EvalScope can load the benchmark tests we added. Taking MMLU-Pro as an example, it mainly includes the following content:

  • Import Benchmark and DataAdapter

  • Register Benchmark, specifying:

    • name: Benchmark test name

    • dataset_id: Benchmark test dataset ID, used to load the benchmark test dataset

    • model_adapter: Benchmark test model adapter. This model adapter is used for locally loading model inference, supporting three types:

      • ChatGenerationModelAdapter: General text generation model evaluation, returns the text generated by the model through input prompts

      • MultiChoiceModelAdapter: Multiple-choice question evaluation, calculates the probabilities of options through logits, returning the option with the highest probability

      • ContinuationLogitsModelAdapter: Multiple-choice text evaluation, calculates the log likelihood of each context-continuation pair, returning a list of log likelihood values

    • subset_list: Sub-datasets of the benchmark test dataset

    • metric_list: Evaluation metrics for the benchmark test

    • few_shot_num: Number of samples for In Context Learning evaluation

    • train_split: Benchmark test training set, used to sample ICL examples

    • eval_split: Benchmark test evaluation set

    • prompt_template: Benchmark test prompt template

  • Create MMLUProAdapter class, inheriting from DataAdapter.

Tip

subset_list, train_split, eval_split can be obtained from the dataset preview, for example, MMLU-Pro preview

MMLU-Pro Preview

The sample code is as follows:

from evalscope.benchmarks import Benchmark, DataAdapter
from evalscope.metrics import WeightedAverageAccuracy
from evalscope.models import ChatGenerationModelAdapter


@Benchmark.register(
    name='mmlu_pro',
    dataset_id='modelscope/mmlu-pro',
    model_adapter=ChatGenerationModelAdapter,
    subset_list=['default'],
    metric_list=[WeightedAverageAccuracy],
    few_shot_num=0,
    train_split='validation',
    eval_split='test',
    prompt_template='You are a knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`.',
)
class MMLUProAdapter(DataAdapter):

    def __init__(self, **kwargs):

        super().__init__(**kwargs)

Writing Evaluation Logic#

Once the DataAdapter is written, you can add evaluation tasks in EvalScope. The following methods need to be implemented:

  • gen_prompt: Generate model input prompt

    • For the class ChatGenerationModelAdapter, the output format is: {'data': [full_prompt], 'system_prompt': (str, optional)} where full_prompt: str is the constructed prompt for each data sample.

    • For the class MultiChoiceModelAdapter, the output format is: {'data': [full_prompt], 'multi_choices': self.choices} where full_prompt: str is the constructed prompt for each data sample.

    • For the class ContinuationEvalModelAdapter, the output format is: {'data': ctx_continuation_pair_list, 'multi_choices': self.choices} where ctx_continuation_pair_list: list is the list of context-continuation pairs.

Note

If the logic provided by gen_prompt does not meet expectations, you can override the gen_prompts method to customize the conversion logic from dataset to prompt.

  • get_gold_answer: Parse the standard answer from the dataset

  • parse_pred_result: Parse the model output, which can return different answer parsing methods based on different eval_type

  • match: Match the model output and the standard answer from the dataset, providing a score

The complete example code is as follows:

class MMLUProAdapter(DataAdapter):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
        self.categories = ['computer science', 'math', 'chemistry', 'engineering', 'law', 'biology',
                            'health', 'physics', 'business', 'philosophy', 'economics', 'other',
                            'psychology', 'history']
        
    
    def gen_prompts(self, data_dict: dict, **kwargs) -> Dict[str, list]:
        """
        Generate model prompt from raw input, unify the prompt format for MMLU-Pro benchmark.
        Return a dict with category as key and list of prompts as value.
        """
        
        data_dict = data_dict[self.subset_list[0]]  # Only one subset for MMLU-Pro
        fewshot_prompts = self.get_fewshot_examples(data_dict)
        
        #  Use the category as key to group the prompts
        res_dict = defaultdict(list)
        # generate prompts for each test sample
        for entry in data_dict[self.eval_split]:
            prefix = fewshot_prompts[entry['category']]
            query = prefix + 'Q: ' + entry['question'] + '\n' + \
                self.__form_options(entry['options']) + '\n'
            
            prompt_d = {
                'data': [query],
                'system_prompt': self.prompt_template,
                AnswerKeys.RAW_INPUT: entry
            }
            
            res_dict[entry['category']].append(prompt_d)
        return res_dict
    
    def get_fewshot_examples(self, data_dict: dict):
        # load 5-shot prompts for each category
        prompts = {c: '' for c in self.categories}
        for d in data_dict[self.train_split]:
            prompts[d['category']] += 'Q:' + ' ' + d['question'] + '\n' + \
                self.__form_options(d['options']) + '\n' + \
                d['cot_content'] + '\n\n'
        return prompts
    
    
    def __form_options(self, options: list):
        option_str = 'Options are:\n'
        for opt, choice in zip(options, self.choices):
            option_str += f'({choice}): {opt}' + '\n'
        return option_str
    
    def get_gold_answer(self, input_d: dict) -> str:
        """
        Parse the raw input labels (gold).

        Args:
            input_d: input raw data. Depending on the dataset.

        Returns:
            The parsed input. e.g. gold answer ... Depending on the dataset.
        """
        return input_d['answer']


    def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str:
        """
        Parse the predicted result and extract proper answer.

        Args:
            result: Predicted answer from the model. Usually a string for chat.
            raw_input_d: The raw input. Depending on the dataset.
            eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint'

        Returns:
            The parsed answer. Depending on the dataset. Usually a string for chat.
        """
        return ResponseParser.parse_first_option(result)


    def match(self, gold: str, pred: str) -> float:
        """
        Match the gold answer and the predicted answer.

        Args:
            gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions.
                        e.g. 'A', extracted from get_gold_answer method.
            pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions.
                        e.g. 'B', extracted from parse_pred_result method.

        Returns:
            The match result. Usually a score (float) for chat/multiple-choice-questions.
        """
        return exact_match(gold=gold, pred=pred)

Running Evaluation#

Debug the code to see if it runs properly.

from evalscope import run_task
task_cfg = {'model': 'qwen/Qwen2-0.5B-Instruct',
            'datasets': ['mmlu_pro'],
            'limit': 2,
            'debug': True}
run_task(task_cfg=task_cfg)

The output will be as follows:

+---------------------+-------------------------------------------+
| Model               | mmlu-pro                                  |
+=====================+===========================================+
| Qwen2-0.5B-Instruct | (mmlu-pro/WeightedAverageAccuracy) 0.1429 |
+---------------------+-------------------------------------------+ 

If everything runs smoothly, you can submit a PR to let more users use the benchmark evaluation you contributed. Give it a try! πŸš€