Large Language Model#
This framework supports two predefined dataset formats: multiple-choice questions (MCQ) and question-answering (QA). The workflow is as follows:
Multiple-Choice Question Format (MCQ)#
This format is suitable for scenarios where users are dealing with multiple-choice questions, and the evaluation metric is accuracy.
1. Data Preparation#
Prepare a CSV file in the MCQ format. This directory should contain two files:
mcq/
βββ example_dev.csv # Name must be *_dev.csv for few-shot evaluation; this CSV can be empty for zero-shot evaluation.
βββ example_val.csv # Name must be *_val.csv for the actual evaluation data.
The CSV file must follow this format:
id,question,A,B,C,D,answer,explanation
1,Generally, the amino acids that make up animal proteins are____,4,22,20,19,C,1. Currently, there are 20 known amino acids that constitute animal proteins.
2,Among the substances present in the blood, which of the following is not a metabolic end product____?,Urea,Uric acid,Pyruvate,Carbon dioxide,C,"Metabolic end products refer to substances produced during metabolism in the organism that cannot be reused and need to be excreted. Pyruvate is a product of carbohydrate metabolism that can be further metabolized for energy or to synthesize other substances, and is not a metabolic end product."
Where:
id
is the evaluation sequence numberquestion
is the questionA
,B
,C
,D
are the options (leave empty if there are fewer than four options)answer
is the correct optionexplanation
is the explanation (optional)
2. Configuration File#
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='qwen/Qwen2-0.5B-Instruct',
datasets=['ceval'], # Dataset format, fixed as 'ceval' for multiple-choice questions
dataset_args={
'ceval': {
"local_path": "custom_eval/text/mcq", # Custom dataset path
"subset_list": [
"example" # Evaluation dataset name, the * in the above *_dev.csv
]
}
},
)
run_task(task_cfg=task_cfg)
Run Result:
+---------------------+---------------+
| Model | mcq |
+=====================+===============+
| Qwen2-0.5B-Instruct | (mcq/acc) 0.5 |
+---------------------+---------------+
Question-Answering Format (QA)#
This format is suitable for scenarios where users are dealing with question-answer pairs, and the evaluation metrics are ROUGE
and BLEU
.
1. Data Preparation#
Prepare a JSONL file in the QA format. This directory should contain one file:
qa/
βββ example.jsonl
The JSONL file must follow this format:
{"query": "What is the capital of China?", "response": "The capital of China is Beijing."}
{"query": "What is the tallest mountain in the world?", "response": "It is Mount Everest."}
{"query": "Why can't you find penguins in the Arctic?", "response": "Because most penguins live in the Antarctic."}
2. Configuration File#
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='qwen/Qwen2-0.5B-Instruct',
datasets=['general_qa'], # Dataset format, fixed as 'general_qa' for question-answering
dataset_args={
'general_qa': {
"local_path": "custom_eval/text/qa", # Custom dataset path
"subset_list": [
"example" # Evaluation dataset name, the * in the above *.jsonl
]
}
},
)
run_task(task_cfg=task_cfg)
Run Result:
+---------------------+------------------------------------+
| Model | qa |
+=====================+====================================+
| Qwen2-0.5B-Instruct | (qa/rouge-1-r) 0.888888888888889 |
| | (qa/rouge-1-p) 0.2386966558963222 |
| | (qa/rouge-1-f) 0.3434493794481033 |
| | (qa/rouge-2-r) 0.6166666666666667 |
| | (qa/rouge-2-p) 0.14595543345543344 |
| | (qa/rouge-2-f) 0.20751474380718113 |
| | (qa/rouge-l-r) 0.888888888888889 |
| | (qa/rouge-l-p) 0.23344334652802393 |
| | (qa/rouge-l-f) 0.33456027373987435 |
| | (qa/bleu-1) 0.23344334652802393 |
| | (qa/bleu-2) 0.14571148341640142 |
| | (qa/bleu-3) 0.0625 |
| | (qa/bleu-4) 0.05555555555555555 |
+---------------------+------------------------------------+
(Optional) Custom Evaluation Using the ms-swift Framework#
See also
Supports two patterns of evaluation sets: Multiple-choice format CEval
and Question-answering format General-QA
Reference: ms-swift Custom Evaluation Sets