MMLU#
这是一项大规模的多任务测试,包含来自多个知识领域的选择题。该测试涵盖人文学科、社会科学、硬科学以及其他一些人们学习的重要领域,共涉及57个任务,包括基础数学、美国历史、计算机科学、法律等。要在此测试中获得高准确率,模型必须具备广泛的世界知识和问题解决能力。数据集链接
实验设置#
Split: test
Total num: 13985
0-shot
实验结果#
Model |
Revision |
Precision |
Humanities |
STEM |
SocialScience |
Other |
WeightedAvg |
Target |
Delta |
---|---|---|---|---|---|---|---|---|---|
v1.0.2 |
fp16 |
0.4111 |
0.3807 |
0.5233 |
0.504 |
0.4506 |
- |
||
v1.0.4 |
fp16 |
0.4439 |
0.374 |
0.5524 |
0.5458 |
0.4762 |
- |
||
v1.0.12 |
fp16 |
0.3834 |
0.3413 |
0.4708 |
0.4445 |
0.4077 |
0.4546(CoT) |
-4.69% |
|
v1.0.1 |
fp16 |
0.5435 |
0.5087 |
0.7227 |
0.6471 |
0.5992 |
0.614 |
-1.48% |
|
v1.0.1 |
fp16 |
0.4005 |
0.3547 |
0.4953 |
0.4796 |
0.4297 |
- |
||
v1.0.2 |
fp16 |
0.4371 |
0.3887 |
0.5579 |
0.5437 |
0.4778 |
- |
||
v1.0.2 |
fp16 |
0.3146 |
0.3037 |
0.4134 |
0.3885 |
0.3509 |
- |
||
v1.0.6 |
bf16 |
0.5326 |
0.5397 |
0.7184 |
0.6859 |
0.6102 |
- |
||
v1.1.6 |
bf16 |
0.387 |
0.4 |
0.5403 |
0.5139 |
0.4527 |
- |
||
v1.1.6 |
int8 |
0.4322 |
0.4277 |
0.6088 |
0.5778 |
0.5035 |
- |
目标 (Target) -- 模型在数据集上的官方声明得分
差值 (Delta) -- 加权平均得分与目标得分之间的差异
Settings: (Split: test, Total num: 13985, 5-shot)#
Model |
Revision |
Precision |
Humanities |
STEM |
SocialScience |
Other |
WeightedAvg |
Avg |
Target |
Delta |
---|---|---|---|---|---|---|---|---|---|---|
Baichuan2-7B-Base |
v1.0.2 |
fp16 |
0.4295 |
0.398 |
0.5736 |
0.5325 |
0.4781 |
0.4918 |
0.5416 (official) |
-4.98% |
Baichuan2-7B-Chat |
v1.0.4 |
fp16 |
0.4344 |
0.3937 |
0.5814 |
0.5462 |
0.4837 |
0.5029 |
0.5293 (official) |
-2.64% |
chatglm2-6b |
v1.0.12 |
fp16 |
0.3941 |
0.376 |
0.4897 |
0.4706 |
0.4288 |
0.4442 |
- |
- |
chatglm3-6b-base |
v1.0.1 |
fp16 |
0.5356 |
0.4847 |
0.7175 |
0.6273 |
0.5857 |
0.5995 |
- |
- |
internlm-chat-7b |
v1.0.1 |
fp16 |
0.4171 |
0.3903 |
0.5772 |
0.5493 |
0.4769 |
0.4876 |
- |
- |
Llama-2-13b-ms |
v1.0.2 |
fp16 |
0.484 |
0.4133 |
0.6157 |
0.5809 |
0.5201 |
0.5327 |
0.548 (official) |
-1.53% |
Llama-2-7b-ms |
v1.0.2 |
fp16 |
0.3747 |
0.3363 |
0.4372 |
0.4514 |
0.3979 |
0.4089 |
0.453 (official) |
-4.41% |
Qwen-14B-Chat |
v1.0.6 |
bf16 |
0.574 |
0.553 |
0.7403 |
0.684 |
0.6313 |
0.6414 |
0.646 (official) |
-0.46% |
Qwen-7B |
v1.1.6 |
bf16 |
0.4587 |
0.426 |
0.6078 |
0.5629 |
0.5084 |
0.5151 |
0.567 (official) |
-5.2% |
Qwen-7B-Chat-Int8 |
v1.1.6 |
int8 |
0.4697 |
0.4383 |
0.6284 |
0.5967 |
0.5271 |
0.5347 |
0.554 (official) |
-1.93% |