π Contribute Benchmark#
EvalScope, the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own benchmark evaluation and share your contribution with the community. Letβs work together to improve EvalScope and make our tool even better!
Below, using MMLU-Pro
as an example, we will introduce how to add a benchmark evaluation, primarily including three steps: uploading the dataset, registering the dataset, and writing the evaluation task.
Upload Benchmark Evaluation Dataset#
Upload the benchmark evaluation dataset to ModelScope, which allows users to load the dataset with one click, benefiting more users. Of course, if the dataset already exists, you can skip this step.
See also
For example: modelscope/MMLU-Pro, refer to the dataset upload tutorial.
Ensure that the data can be loaded by ModelScope, test the code as follows:
from modelscope import MsDataset
dataset = MsDataset.load("modelscope/MMLU-Pro") # Replace with your dataset
Register Benchmark Evaluation#
Add the benchmark evaluation in EvalScope.
Create File Structure#
First, Fork EvalScope repository, which creates a personal copy of the EvalScope repository, and then clone it locally.
Then, add the benchmark evaluation under the evalscope/benchmarks/
directory, with the following structure:
evalscope/benchmarks/
βββ benchmark_name
β βββ __init__.py
β βββ benchmark_name_adapter.py
β βββ ...
For MMLU-Pro
, the structure is as follows:
evalscope/benchmarks/
βββ mmlu_pro
β βββ __init__.py
β βββ mmlu_pro_adapter.py
β βββ ...
Register Benchmark
#
We need to register the Benchmark
in benchmark_name_adapter.py
, enabling EvalScope to load the benchmark test we added. Using MMLU-Pro
as an example, it mainly includes the following:
Import
Benchmark
andDataAdapter
Register
Benchmark
, specifying:name
: Name of the benchmark testdataset_id
: Benchmark test dataset ID for loading the datasetmodel_adapter
: Default model adapter for the benchmark test, supporting two types:OutputType.GENERATION
: General text generation model evaluation, returns text generated by the model via input promptOutputType.MULTIPLE_CHOICE
: Multiple-choice evaluation, calculates option probability through logits, returns the option with the highest probability
output_types
: Output types of the benchmark test, supporting multiple selections:OutputType.GENERATION
: General text generation model evaluationOutputType.MULTIPLE_CHOICE
: Multiple-choice evaluation output logits
subset_list
: Sub-datasets of the benchmark test datasetmetric_list
: Evaluation metrics for the benchmark testfew_shot_num
: Number of In Context Learning samples for evaluationtrain_split
: Training set of the benchmark test, used for sampling ICL exampleseval_split
: Evaluation set of the benchmark testprompt_template
: Prompt template for the benchmark test
Create
MMLUProAdapter
class, inherited fromDataAdapter
.
Tip
subset_list
, train_split
, eval_split
can be obtained from the dataset preview, for example MMLU-Pro preview
Example code is as follows:
from evalscope.benchmarks import Benchmark, DataAdapter
from evalscope.constants import EvalType, OutputType
SUBSET_LIST = [
'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology', 'health', 'physics', 'business',
'philosophy', 'economics', 'other', 'psychology', 'history'
] # customize your subset list
@Benchmark.register(
name='mmlu_pro',
pretty_name='MMLU-Pro',
dataset_id='modelscope/MMLU-Pro',
model_adapter=OutputType.GENERATION,
output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
subset_list=SUBSET_LIST,
metric_list=['AverageAccuracy'],
few_shot_num=5,
train_split='validation',
eval_split='test',
prompt_template=
'The following are multiple choice questions (with answers) about {subset_name}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n{query}', # noqa: E501
)
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
Write Evaluation Logic#
After completing the DataAdapter
, you can add evaluation tasks in EvalScope. The following methods need to be implemented:
gen_prompt
: Generate model input prompt.get_gold_answer
: Parse the standard answer from the dataset.parse_pred_result
: Parse the model output, return different answer parsing methods based on different eval_types.match
: Match the model output with the dataset standard answer and score.
Note
If the default load
logic does not meet the requirements, you can override the load
method, for example: implement classification of the dataset based on specified fields.
Complete example code is as follows:
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
def load(self, **kwargs):
# default load all data
kwargs['subset_list'] = ['default']
data_dict = super().load(**kwargs)
# use `category` as subset key
return self.reformat_subset(data_dict, subset_key='category')
def gen_prompt(self, input_d: Dict, subset_name: str, few_shot_list: list, **kwargs) -> Any:
if self.few_shot_num > 0:
prefix = self.format_fewshot_examples(few_shot_list)
else:
prefix = ''
query = prefix + 'Q: ' + input_d['question'] + '\n' + \
self.__form_options(input_d['options']) + '\n'
full_prompt = self.prompt_template.format(subset_name=subset_name, query=query)
return self.gen_prompt_data(full_prompt)
def format_fewshot_examples(self, few_shot_list):
# load few-shot prompts for each category
prompts = ''
for index, d in enumerate(few_shot_list):
prompts += 'Q: ' + d['question'] + '\n' + \
self.__form_options(d['options']) + '\n' + \
d['cot_content'] + '\n\n'
return prompts
def __form_options(self, options: list):
option_str = 'Options are:\n'
for opt, choice in zip(options, self.choices):
option_str += f'({choice}): {opt}' + '\n'
return option_str
def get_gold_answer(self, input_d: dict) -> str:
"""
Parse the raw input labels (gold).
Args:
input_d: input raw data. Depending on the dataset.
Returns:
The parsed input. e.g. gold answer ... Depending on the dataset.
"""
return input_d['answer']
def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str:
"""
Parse the predicted result and extract proper answer.
Args:
result: Predicted answer from the model. Usually a string for chat.
raw_input_d: The raw input. Depending on the dataset.
eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint'
Returns:
The parsed answer. Depending on the dataset. Usually a string for chat.
"""
if self.model_adapter == OutputType.MULTIPLE_CHOICE:
return result
else:
return ResponseParser.parse_first_option(result)
def match(self, gold: str, pred: str) -> float:
"""
Match the gold answer and the predicted answer.
Args:
gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions.
e.g. 'A', extracted from get_gold_answer method.
pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions.
e.g. 'B', extracted from parse_pred_result method.
Returns:
The match result. Usually a score (float) for chat/multiple-choice-questions.
"""
return exact_match(gold=gold, pred=pred)
Run Evaluation#
Debug the code to see if it can run normally.
from evalscope import run_task, TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['mmlu_pro'],
limit=10,
dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}},
debug=True
)
run_task(task_cfg=task_cfg)
Output is as follows:
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=======================+===========+=================+==================+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | computer science | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | math | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
If everything runs smoothly, you can submit a PR. We will review and merge your contribution as soon as possible, allowing more users to benefit from the benchmark evaluation youβve contributed. If youβre unsure how to submit a PR, you can check out our guide. Give it a try! π