π Contribute Benchmark#
EvalScope, as the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own evaluation benchmarks and share your contributions with the community. Letβs work together to support the growth of EvalScope and make our tools even better!
Below, using MMLU-Pro
as an example, we will introduce how to add benchmark evaluations, which includes three main steps: uploading datasets, registering datasets, and writing evaluation tasks.
Upload Benchmark Evaluation Dataset#
Upload the benchmark evaluation dataset to ModelScope, allowing users to load the dataset with a single click and benefit more users. Of course, if the dataset already exists, you can skip this step.
See also
For example: modelscope/MMLU-Pro, refer to the dataset upload tutorial.
Please ensure the data can be loaded by ModelScope. The test code is as follows:
from modelscope import MsDataset
dataset = MsDataset.load("modelscope/MMLU-Pro") # Replace with your dataset
Register Benchmark Evaluation#
Add benchmark evaluations in EvalScope.
Create File Structure#
First, Fork EvalScope repository to create your own copy of the EvalScope repository and clone it to your local machine.
Then, add the benchmark evaluation in the evalscope/benchmarks/
directory, with the structure as follows:
evalscope/benchmarks/
βββ benchmark_name
β βββ __init__.py
β βββ benchmark_name_adapter.py
β βββ ...
Specifically for MMLU-Pro
, the structure is as follows:
evalscope/benchmarks/
βββ mmlu_pro
β βββ __init__.py
β βββ mmlu_pro_adapter.py
β βββ ...
Register Benchmark
#
We need to register the Benchmark
in benchmark_name_adapter.py
, so that EvalScope can load the benchmark tests we added. Taking MMLU-Pro
as an example, it mainly includes the following content:
Import
Benchmark
andDataAdapter
Register
Benchmark
, specifying:name
: Benchmark test namedataset_id
: Benchmark test dataset ID, used to load the benchmark test datasetmodel_adapter
: Benchmark test model adapter. This model adapter is used for locally loading model inference, supporting three types:ChatGenerationModelAdapter
: General text generation model evaluation, returns the text generated by the model through input promptsMultiChoiceModelAdapter
: Multiple-choice question evaluation, calculates the probabilities of options through logits, returning the option with the highest probabilityContinuationLogitsModelAdapter
: Multiple-choice text evaluation, calculates the log likelihood of each context-continuation pair, returning a list of log likelihood values
subset_list
: Sub-datasets of the benchmark test datasetmetric_list
: Evaluation metrics for the benchmark testfew_shot_num
: Number of samples for In Context Learning evaluationtrain_split
: Benchmark test training set, used to sample ICL exampleseval_split
: Benchmark test evaluation setprompt_template
: Benchmark test prompt template
Create
MMLUProAdapter
class, inheriting fromDataAdapter
.
Tip
subset_list
, train_split
, eval_split
can be obtained from the dataset preview, for example, MMLU-Pro preview
The sample code is as follows:
from evalscope.benchmarks import Benchmark, DataAdapter
from evalscope.metrics import WeightedAverageAccuracy
from evalscope.models import ChatGenerationModelAdapter
@Benchmark.register(
name='mmlu_pro',
dataset_id='modelscope/mmlu-pro',
model_adapter=ChatGenerationModelAdapter,
subset_list=['default'],
metric_list=[WeightedAverageAccuracy],
few_shot_num=0,
train_split='validation',
eval_split='test',
prompt_template='You are a knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`.',
)
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
Writing Evaluation Logic#
Once the DataAdapter
is written, you can add evaluation tasks in EvalScope. The following methods need to be implemented:
gen_prompt
: Generate model input promptFor the class
ChatGenerationModelAdapter
, the output format is:{'data': [full_prompt], 'system_prompt': (str, optional)}
wherefull_prompt: str
is the constructed prompt for each data sample.For the class
MultiChoiceModelAdapter
, the output format is:{'data': [full_prompt], 'multi_choices': self.choices}
wherefull_prompt: str
is the constructed prompt for each data sample.For the class
ContinuationEvalModelAdapter
, the output format is:{'data': ctx_continuation_pair_list, 'multi_choices': self.choices}
wherectx_continuation_pair_list: list
is the list of context-continuation pairs.
Note
If the logic provided by gen_prompt
does not meet expectations, you can override the gen_prompts
method to customize the conversion logic from dataset to prompt.
get_gold_answer
: Parse the standard answer from the datasetparse_pred_result
: Parse the model output, which can return different answer parsing methods based on differenteval_type
match
: Match the model output and the standard answer from the dataset, providing a score
The complete example code is as follows:
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
self.categories = ['computer science', 'math', 'chemistry', 'engineering', 'law', 'biology',
'health', 'physics', 'business', 'philosophy', 'economics', 'other',
'psychology', 'history']
def gen_prompts(self, data_dict: dict, **kwargs) -> Dict[str, list]:
"""
Generate model prompt from raw input, unify the prompt format for MMLU-Pro benchmark.
Return a dict with category as key and list of prompts as value.
"""
data_dict = data_dict[self.subset_list[0]] # Only one subset for MMLU-Pro
fewshot_prompts = self.get_fewshot_examples(data_dict)
# Use the category as key to group the prompts
res_dict = defaultdict(list)
# generate prompts for each test sample
for entry in data_dict[self.eval_split]:
prefix = fewshot_prompts[entry['category']]
query = prefix + 'Q: ' + entry['question'] + '\n' + \
self.__form_options(entry['options']) + '\n'
prompt_d = {
'data': [query],
'system_prompt': self.prompt_template,
AnswerKeys.RAW_INPUT: entry
}
res_dict[entry['category']].append(prompt_d)
return res_dict
def get_fewshot_examples(self, data_dict: dict):
# load 5-shot prompts for each category
prompts = {c: '' for c in self.categories}
for d in data_dict[self.train_split]:
prompts[d['category']] += 'Q:' + ' ' + d['question'] + '\n' + \
self.__form_options(d['options']) + '\n' + \
d['cot_content'] + '\n\n'
return prompts
def __form_options(self, options: list):
option_str = 'Options are:\n'
for opt, choice in zip(options, self.choices):
option_str += f'({choice}): {opt}' + '\n'
return option_str
def get_gold_answer(self, input_d: dict) -> str:
"""
Parse the raw input labels (gold).
Args:
input_d: input raw data. Depending on the dataset.
Returns:
The parsed input. e.g. gold answer ... Depending on the dataset.
"""
return input_d['answer']
def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str:
"""
Parse the predicted result and extract proper answer.
Args:
result: Predicted answer from the model. Usually a string for chat.
raw_input_d: The raw input. Depending on the dataset.
eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint'
Returns:
The parsed answer. Depending on the dataset. Usually a string for chat.
"""
return ResponseParser.parse_first_option(result)
def match(self, gold: str, pred: str) -> float:
"""
Match the gold answer and the predicted answer.
Args:
gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions.
e.g. 'A', extracted from get_gold_answer method.
pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions.
e.g. 'B', extracted from parse_pred_result method.
Returns:
The match result. Usually a score (float) for chat/multiple-choice-questions.
"""
return exact_match(gold=gold, pred=pred)
Running Evaluation#
Debug the code to see if it runs properly.
from evalscope import run_task
task_cfg = {'model': 'qwen/Qwen2-0.5B-Instruct',
'datasets': ['mmlu_pro'],
'limit': 2,
'debug': True}
run_task(task_cfg=task_cfg)
The output will be as follows:
+---------------------+-------------------------------------------+
| Model | mmlu-pro |
+=====================+===========================================+
| Qwen2-0.5B-Instruct | (mmlu-pro/WeightedAverageAccuracy) 0.1429 |
+---------------------+-------------------------------------------+
If everything runs smoothly, you can submit a PR to let more users use the benchmark evaluation you contributed. Give it a try! π