π Contribute Benchmark#
EvalScope, as the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own benchmark and share your contributions with the community. Letβs work together to enhance EvalScope and make our tool even better!
Below, we will introduce how to add two types of benchmark evaluations: General Text Reasoning and Multiple Choice, which mainly include three steps: uploading the dataset, registering the dataset, and writing the evaluation task.
Basic Concepts#
Tip
You can skip this section and start directly from Preparing Benchmark Evaluation Dataset. Refer back to the specific implementation when encountering code you donβt understand.
The evaluation process of EvalScope mainly includes the following steps:
Data Preparation: Load and preprocess the dataset through
DataAdapter.Task Definition: Define the configuration of the evaluation task through
TaskConfig, including models, datasets, evaluation metrics, etc.Evaluation Execution: Execute the evaluation task through the
run_taskfunction and output the evaluation results.
Among them, DataAdapter is the core component of benchmark evaluation that we need to focus on.
DataAdapter Architecture and Call Flow#
DataAdapter adopts a Pipeline architecture, supporting custom behavior through hook methods. Taking DefaultDataAdapter as an example, the complete evaluation process is as follows:
1. Data Loading Phase
load_dataset()
βββ load()
β βββ load_from_remote() / load_from_disk()
β β βββ load_subsets()
β β β βββ load_subset() / load_fewshot_subset()
β β β βββ record_to_sample() [User Implementation]
β β βββ _post_process_samples()
β β βββ process_sample_input()
β β βββ sample_to_fewshot() [User Implementation]
β β βββ format_fewshot_template() [Optional User Implementation]
β β βββ format_prompt_template() [Optional User Implementation]
β βββ Returns DatasetDict
2. Model Inference Phase (Per Sample)
run_inference()
βββ _on_inference_start() [Hook Method]
βββ _on_inference() [Hook Method]
βββ _on_inference_end() [Hook Method]
βββ Returns TaskState
3. Metric Calculation Phase (Per Sample)
calculate_metrics()
βββ filter_prediction()
β βββ extract_answer() [Optional User Implementation]
βββ match_score() / llm_match_score()
βββ Returns SampleScore
4. Result Aggregation Phase
aggregate_scores()
βββ Returns List[AggScore]
5. Report Generation Phase
generate_report()
βββ _on_generate_report() [Hook Method]
βββ _on_generate_report_end() [Hook Method]
βββ Returns Report
Core Data Structures#
1. Sample Object#
Represents a single evaluation sample, including input, target answer, and metadata:
@dataclass
class Sample:
input: Any # Input content (question text or list of chat messages)
target: str # Target answer (correct answer)
choices: Optional[List[str]] = None # Choices (used for multiple choice questions)
subset_key: Optional[str] = None # Subset division key (used for grouping by category)
metadata: Optional[Dict] = None # Metadata (reasoning process, ID, etc.)
tools: Optional[List] = None # Tool call information
2. TaskState Object#
Represents the complete state of a single inference task:
@dataclass
class TaskState:
model: str # Model name
sample: Sample # Input sample
messages: List[ChatMessage] # Chat message history
output: ModelOutput # Model raw output
completed: bool # Whether the task is completed
sample_id: Optional[str] = None # Sample ID
group_id: Optional[str] = None # Group ID
metadata: Optional[Dict] = None # Task metadata
3. ModelOutput Object#
Represents the raw output of the model:
@dataclass
class ModelOutput:
completion: str # Text generated by the model
message: ChatMessage # Formatted chat message
# Other model-specific fields...
4. Score Object#
Represents the scoring result of a single sample:
@dataclass
class Score:
value: Dict[str, float] # Scores for each metric {"acc": 1.0, "f1": 0.8}
extracted_prediction: str # Extracted prediction answer
prediction: str # Raw prediction text
metadata: Dict = None # Scoring metadata
5. SampleScore Object#
Encapsulates the complete scoring information of a single sample:
@dataclass
class SampleScore:
score: Score # Scoring object
sample_id: Optional[str] # Unique identifier for the sample
group_id: Optional[str] # Group identifier
sample_metadata: Optional[Dict] = None # Sample metadata
6. AggScore Object#
Represents aggregated scoring statistics:
@dataclass
class AggScore:
metric: str # Metric name
value: float # Aggregated value (e.g., average score)
subset: str # Subset name
num_samples: int # Number of samples
agg_method: str # Aggregation method (mean, median, etc.)
metadata: Dict = None # Aggregation metadata
7. DatasetDict Object#
Manages multiple dataset subsets:
class DatasetDict(dict):
"""Dataset dictionary, keys are subset names, values are Dataset objects"""
@classmethod
def from_dataset(cls, dataset, subset_list=None, limit=None, repeats=1):
"""Create a multi-subset dataset dictionary from a single dataset"""
pass
Core Methods of DataAdapter#
Based on the above call flow, here are the key methods that need to be implemented by the user or can be optionally overridden:
Methods That Must Be Implemented#
record_to_sample(record: Dict[str, Any]) -> SamplePurpose: Convert raw data records into standard Sample objects
Input: Raw record dictionary from the dataset
Output: Standardized Sample object
Example:
def record_to_sample(self, record: Dict[str, Any]) -> Sample: return Sample( input=record['question'], target=record['answer'], metadata={'reasoning': record.get('explanation', '')} )
Methods That Can Be Optionally Implemented#
sample_to_fewshot(sample: Sample) -> strPurpose: Convert sample into a few-shot example string
Input: Sample object
Output: Formatted few-shot example text
Call Timing: When constructing few-shot prompts
extract_answer(prediction: str, task_state: TaskState) -> strPurpose: Extract the final answer from the modelβs raw output
Input: Model prediction text and task state
Output: Extracted answer string
Call Timing: Before calculating metrics for answer cleaning
format_prompt_template(sample: Sample) -> strPurpose: Format the basic prompt template
Input: Sample object
Output: Formatted prompt text
Default Implementation: Uses
prompt_template.format(question=sample.input)
format_fewshot_template(fewshot: str, sample: Sample) -> strPurpose: Format the prompt template containing few-shot examples
Input: Few-shot example string and Sample object
Output: Complete few-shot prompt
Default Implementation: Uses
few_shot_prompt_template.format()
sample_filter(sample: Sample) -> boolPurpose: Filter dataset samples
Input: Sample object
Output: Whether to retain the sample
Default Implementation: Returns True (retains all samples)
Hook Method System#
DataAdapter provides a hook method system, supporting custom logic insertion at key points:
Inference Phase Hooks#
_on_inference_start(model, sample): Before inference starts_on_inference(model, sample): Execute inference_on_inference_end(model, sample, model_output, output_dir): After inference ends
Report Generation Hooks#
_on_generate_report(scores, model_name): Generate report_on_generate_report_end(report, output_dir): After report generation
Adapter Types#
EvalScope provides two main adapter base classes:
DefaultDataAdapter: Basic adapter for general text reasoning tasksSuitable for open-ended question answering, mathematical reasoning, code generation, etc.
Requires custom answer extraction logic
MultiChoiceAdapter: Specialized adapter for multiple choice tasksInherits from
DefaultDataAdapterBuilt-in choice formatting and answer extraction logic
Supports single-choice and multiple-choice modes
Principles for choosing adapter types:
If the task involves selecting answers from fixed options β Use
MultiChoiceAdapterIf the task requires generating open-ended answers β Use
DefaultDataAdapter
1. Preparing Benchmark Evaluation Dataset#
You have two ways to prepare the benchmark evaluation dataset:
Upload to ModelScope (Recommended): Upload the dataset to the ModelScope platform, so other users can easily load your dataset, making it more convenient to use and benefiting more users from your contribution. If you need to upload to ModelScope, refer to the Dataset Upload Tutorial.
Local Use: You can also directly use the local dataset for evaluation, suitable for datasets that are still in development or contain sensitive information.
Regardless of the method chosen, ensure that the data format is correct and can be loaded. If using a local dataset, you can test with the following code:
from modelscope import MsDataset
dataset = MsDataset.load("/path/to/your/dataset") # Replace with your dataset
2. Creating File Structure#
First, Fork EvalScope repository, i.e., create your own EvalScope repository copy, and clone it locally.
git clone https://github.com/your_username/evalscope.git
cd evalscope
Then, add benchmark evaluation in the evalscope/benchmarks/ directory, with the structure as follows:
evalscope/benchmarks/
βββ benchmark_name
β βββ __init__.py
β βββ benchmark_name_adapter.py
β βββ ...
Specifically for GSM8K and MMLU-Pro, the structure is as follows:
evalscope/benchmarks/
βββ gsm8k
β βββ __init__.py
β βββ gsm8k_adapter.py
βββ mmlu_pro
β βββ __init__.py
β βββ mmlu_pro_adapter.py
β βββ ...
3. Writing Evaluation Logic#
Below, we will take GSM8K and MMLU-Pro as examples to introduce two types of evaluation tasks: General Text Reasoning and Multiple Choice.
General Text Reasoning#
General text reasoning tasks usually require the model to analyze and reason about the given problem and then generate an answer. Taking GSM8K (mathematical reasoning) as an example:
We need to register Benchmark and implement the GSM8KAdapter class in gsm8k_adapter.py:
from typing import Any, Dict
from evalscope.api.benchmark import BenchmarkMeta, DefaultDataAdapter
from evalscope.api.dataset import Sample
from evalscope.api.evaluator import TaskState
from evalscope.api.registry import register_benchmark
from evalscope.constants import Tags
# Define prompt template
PROMPT_TEMPLATE = """
Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.
{question}
Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command.
Reasoning:
""".lstrip()
# Register benchmark evaluation
@register_benchmark(
BenchmarkMeta(
name='gsm8k', # Benchmark test name
pretty_name='GSM8K', # Readable name
dataset_id='AI-ModelScope/gsm8k', # Dataset ID or local path
tags=[Tags.MATH, Tags.REASONING], # Tags
description='GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.',
subset_list=['main'], # Subset list
few_shot_num=4, # Few-shot example number
train_split='train', # Training set split name
eval_split='test', # Evaluation set split name
metric_list=['acc'], # Evaluation metrics
prompt_template=PROMPT_TEMPLATE, # Prompt template
)
)
class GSM8KAdapter(DefaultDataAdapter):
def record_to_sample(self, record: Dict[str, Any]) -> Sample:
"""Convert raw data records into Sample objects"""
DELIM = '####'
question = record['question']
answer = record['answer'].split(DELIM)
target = answer.pop().strip() # Extract final answer
reasoning = DELIM.join(answer) # Extract reasoning process
return Sample(
input=question,
target=target,
metadata={'reasoning': reasoning.strip()}
)
def sample_to_fewshot(self, sample: Sample) -> str:
"""Convert sample into few-shot example"""
if sample.metadata:
return (
f'{sample.input}\n\nReasoning:\n' +
f"{sample.metadata['reasoning']}\n\n" +
f'ANSWER: {sample.target}'
)
else:
return ''
def extract_answer(self, prediction: str, task_state: TaskState):
"""Extract answer from model prediction"""
from evalscope.filters.extraction import RegexFilter
# Use regular expression to extract numeric answer
regex = RegexFilter(regex_pattern=r'(-?[0-9.,]{2,})|(-?[0-9]+)', group_select=-1)
res = regex(prediction)
return res.replace(',', '').replace('+', '').strip().strip('.')
Multiple Choice#
Multiple choice tasks require the model to select the correct answer from given options. Taking MMLU-Pro as an example, we need to inherit MultiChoiceAdapter:
from typing import Any, Dict
from evalscope.api.benchmark import BenchmarkMeta, MultiChoiceAdapter
from evalscope.api.dataset import Sample
from evalscope.api.registry import register_benchmark
from evalscope.constants import Tags
# Define prompt template
USER_PROMPT_TEMPLATE = """Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.
Question:
{question}
Options:
{choices}
""".lstrip()
SUBSET_LIST = [
'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology',
'health', 'physics', 'business', 'philosophy', 'economics', 'other',
'psychology', 'history'
]
@register_benchmark(
BenchmarkMeta(
name='mmlu_pro',
pretty_name='MMLU-Pro',
tags=[Tags.MULTIPLE_CHOICE, Tags.KNOWLEDGE],
description='MMLU-Pro is a benchmark for evaluating language models on multiple-choice questions across various subjects.',
dataset_id='modelscope/MMLU-Pro',
subset_list=SUBSET_LIST,
metric_list=['acc'],
few_shot_num=5,
train_split='validation',
eval_split='test',
prompt_template=USER_PROMPT_TEMPLATE,
)
)
class MMLUProAdapter(MultiChoiceAdapter):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.reformat_subset = True # Enable subset division
def record_to_sample(self, record: Dict[str, Any]) -> Sample:
"""Convert raw data records into Sample objects"""
return Sample(
input=record['question'],
choices=record['options'], # Choice list
target=record['answer'], # Correct answer (e.g., 'A')
subset_key=record['category'].lower(), # Key for subset division
metadata={
'cot_content': record['cot_content'],
'subject': record['category'].lower(),
'question_id': record['question_id'],
},
)
def sample_to_fewshot(self, sample: Sample) -> str:
"""Convert sample into few-shot example"""
q_str = f"""Question:\n{str(sample.input)}"""
options = sample.choices if sample.choices is not None else []
# Format choices
opt_str_list = []
for i, opt in enumerate(options):
opt_str_list.append(f"""{chr(65 + i)} {opt}""")
opt_str = f"""Options:\n{'\n'.join(opt_str_list)}"""
# Handle answer and reasoning process
ans_str = sample.metadata['cot_content'] if sample.metadata is not None else ''
ans_str = ans_str.replace('The answer is', 'ANSWER:')
ans_opt = ans_str.split('ANSWER:')[-1].split('.')[0].strip().strip('(').strip(')')
ans_str = ans_str.replace(f'ANSWER: ({ans_opt})', f'ANSWER: {ans_opt}')
final_str = '\n'.join([q_str, opt_str, ans_str])
return final_str
Key Differences Explanation#
General Text Reasoning vs Multiple Choice:
Inherited Base Class:
General Text Reasoning: Inherits
DefaultDataAdapterMultiple Choice: Inherits
MultiChoiceAdapter
Sample Object Structure:
General Text Reasoning: Mainly includes
inputandtargetMultiple Choice: Additionally includes
choices(choice list)
Answer Extraction Method:
General Text Reasoning: Requires custom
extract_answer()methodMultiple Choice:
MultiChoiceAdapterprovides standard answer extraction logic
Prompt Template:
General Text Reasoning: Focuses more on guiding the reasoning process
Multiple Choice: Focuses on displaying choices and answer format
4. Running Evaluation#
Debug the code to see if it can run normally.
GSM8K Example:
from evalscope import run_task, TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['gsm8k'],
limit=10,
debug=True
)
run_task(task_cfg=task_cfg)
MMLU-Pro Example:
from evalscope import run_task, TaskConfig
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['mmlu_pro'],
limit=10,
dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}},
debug=True
)
run_task(task_cfg=task_cfg)
Output Example:
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=======================+===========+=================+==================+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | gsm8k | mean_acc | main | 10 | 0.3 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro | mean_acc | computer science | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | mmlu_pro | mean_acc | math | 10 | 0.1 | default |
+-----------------------+-----------+-----------------+------------------+-------+---------+---------+
5. Benchmark Evaluation Document Generation#
After completing the benchmark evaluation implementation, you can use the tools provided by EvalScope to generate standard documents. This will ensure that your benchmark evaluation has a consistent document format and can be easily understood and used by other users.
To generate Chinese and English documents, run the following command, which will generate documents based on registration information:
pip install -e '.[docs]'
make docs
6. Submitting PR#
After completing the implementation of these methods and document generation, your benchmark evaluation is ready! You can submit a PR. Before submitting, please run the following command, which will automatically format the code:
make lint
Ensure there are no formatting issues, and we will merge your contribution as soon as possible, allowing more users to use the benchmark evaluation you contributed. If you donβt know how to submit a PR, you can check our Guide. Give it a try π