👍 贡献基准评测#
EvalScope作为ModelScope的官方评测工具,其基准评测功能正在持续优化中!我们诚邀您参考本教程,轻松添加自己的评测基准,并与广大社区成员分享您的贡献。一起助力EvalScope的成长,让我们的工具更加出色!
下面以MMLU-Pro
为例,介绍如何添加基准评测,主要包含上传数据集、注册数据集、编写评测任务三个步骤。
上传基准评测数据集#
上传基准评测数据集到ModelScope,这可以让用户一键加载数据集,让更多用户受益。当然,如果数据集已经存在,可以跳过这一步。
参见
例如:modelscope/MMLU-Pro,参考数据集上传教程。
请确保数据可以被modelscope加载,测试代码如下:
from modelscope import MsDataset
dataset = MsDataset.load("modelscope/MMLU-Pro") # 替换为你的数据集
注册基准评测#
在EvalScope中添加基准评测。
创建文件结构#
首先Fork EvalScope 仓库,即创建一个自己的EvalScope仓库副本,将其clone到本地。
然后,在evalscope/benchmarks/
目录下添加基准评测,结构如下:
evalscope/benchmarks/
├── benchmark_name
│ ├── __init__.py
│ ├── benchmark_name_adapter.py
│ └── ...
具体到MMLU-Pro
,结构如下:
evalscope/benchmarks/
├── mmlu_pro
│ ├── __init__.py
│ ├── mmlu_pro_adapter.py
│ └── ...
注册Benchmark
#
我们需要在benchmark_name_adapter.py
中注册Benchmark
,使得EvalScope能够加载我们添加的基准测试。以MMLU-Pro
为例,主要包含以下内容:
导入
Benchmark
和DataAdapter
注册
Benchmark
,指定:name
:基准测试名称dataset_id
:基准测试数据集ID,用于加载基准测试数据集model_adapter
:基准测试模型默认适配器。支持两种:OutputType.GENERATION
:通用文本生成模型评测,通过输入prompt,返回模型生成的文本OutputType.MULTIPLE_CHOICE
:多选题评测,通过logits来计算选项的概率,返回最大概率选项
output_types
:基准测试输出类型,支持多选:OutputType.GENERATION
:通用文本生成模型评测OutputType.MULTIPLE_CHOICE
:多选题评测输出logits
subset_list
:基准测试数据集的子数据集metric_list
:基准测试评估指标few_shot_num
:评测的In Context Learning样本数量train_split
:基准测试训练集,用于采样ICL样例eval_split
:基准测试评估集prompt_template
:基准测试提示模板
创建
MMLUProAdapter
类,继承自DataAdapter
。
代码示例如下:
from evalscope.benchmarks import Benchmark, DataAdapter
from evalscope.constants import EvalType, OutputType
SUBSET_LIST = [
'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology', 'health', 'physics', 'business',
'philosophy', 'economics', 'other', 'psychology', 'history'
] # 自定义的子数据集列表
@Benchmark.register(
name='mmlu_pro',
pretty_name='MMLU-Pro',
dataset_id='modelscope/MMLU-Pro',
model_adapter=OutputType.GENERATION,
output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
subset_list=SUBSET_LIST,
metric_list=['AverageAccuracy'],
few_shot_num=5,
train_split='validation',
eval_split='test',
prompt_template=
'The following are multiple choice questions (with answers) about {subset_name}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n{query}', # noqa: E501
)
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
编写评测逻辑#
完成DataAdapter
的编写,即可在EvalScope中添加评测任务。需要实现如下方法:
gen_prompt
:生成模型输入prompt。get_gold_answer
:解析数据集的标准答案。parse_pred_result
:解析模型输出,可以根据不同的eval_type返回不同的答案解析方式。match
:匹配模型输出和数据集标准答案,给出打分。
备注
若默认load
逻辑不符合需求,可以重写load
方法,例如:可以实现根据指定的字段对数据集划分子数据集。
完整示例代码如下:
class MMLUProAdapter(DataAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
def load(self, **kwargs):
# default load all data
kwargs['subset_list'] = ['default']
data_dict = super().load(**kwargs)
# use `category` as subset key
return self.reformat_subset(data_dict, subset_key='category')
def gen_prompt(self, input_d: Dict, subset_name: str, few_shot_list: list, **kwargs) -> Any:
if self.few_shot_num > 0:
prefix = self.format_fewshot_examples(few_shot_list)
else:
prefix = ''
query = prefix + 'Q: ' + input_d['question'] + '\n' + \
self.__form_options(input_d['options']) + '\n'
full_prompt = self.prompt_template.format(subset_name=subset_name, query=query)
return self.gen_prompt_data(full_prompt)
def format_fewshot_examples(self, few_shot_list):
# load few-shot prompts for each category
prompts = ''
for index, d in enumerate(few_shot_list):
prompts += 'Q: ' + d['question'] + '\n' + \
self.__form_options(d['options']) + '\n' + \
d['cot_content'] + '\n\n'
return prompts
def __form_options(self, options: list):
option_str = 'Options are:\n'
for opt, choice in zip(options, self.choices):
option_str += f'({choice}): {opt}' + '\n'
return option_str
def get_gold_answer(self, input_d: dict) -> str:
"""
Parse the raw input labels (gold).
Args:
input_d: input raw data. Depending on the dataset.
Returns:
The parsed input. e.g. gold answer ... Depending on the dataset.
"""
return input_d['answer']
def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str:
"""
Parse the predicted result and extract proper answer.
Args:
result: Predicted answer from the model. Usually a string for chat.
raw_input_d: The raw input. Depending on the dataset.
eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint'
Returns:
The parsed answer. Depending on the dataset. Usually a string for chat.
"""
if self.model_adapter == OutputType.MULTIPLE_CHOICE:
return result
else:
return ResponseParser.parse_first_option(result)
def match(self, gold: str, pred: str) -> float:
"""
Match the gold answer and the predicted answer.
Args:
gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions.
e.g. 'A', extracted from get_gold_answer method.
pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions.
e.g. 'B', extracted from parse_pred_result method.
Returns:
The match result. Usually a score (float) for chat/multiple-choice-questions.
"""
return exact_match(gold=gold, pred=pred)
运行评测#
调试代码,看看是否能正常运行。
from evalscope import run_task, TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['mmlu_pro'],
limit=10,
dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}},
debug=True
)
run_task(task_cfg=task_cfg)
输出如下:
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=======================+===========+=================+==================+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | computer science | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | math | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
运行没问题的话,就可以提交PR了,我们将尽快合并你的贡献,让更多用户来使用你贡献的基准评测。如果你不知道如何提交PR,可以查看我们的指南,快来试一试吧🚀