MMLU#
This is a large-scale multi-task assessment comprised of multiple-choice questions from various knowledge domains. The test covers humanities, social sciences, hard sciences, and other significant areas of study, encompassing 57 tasks, including basic mathematics, American history, computer science, law, among others. To achieve a high accuracy rate on this test, models must possess a broad knowledge of the world and problem-solving abilities. Dataset Link
Experimental Setup#
Split: test
Total number: 13985
0-shot
Experimental Results#
Model |
Revision |
Precision |
Humanities |
STEM |
Social Science |
Other |
Weighted Avg |
Target |
Delta |
---|---|---|---|---|---|---|---|---|---|
v1.0.2 |
fp16 |
0.4111 |
0.3807 |
0.5233 |
0.504 |
0.4506 |
- |
||
v1.0.4 |
fp16 |
0.4439 |
0.374 |
0.5524 |
0.5458 |
0.4762 |
- |
||
v1.0.12 |
fp16 |
0.3834 |
0.3413 |
0.4708 |
0.4445 |
0.4077 |
0.4546 (CoT) |
-4.69% |
|
v1.0.1 |
fp16 |
0.5435 |
0.5087 |
0.7227 |
0.6471 |
0.5992 |
0.614 |
-1.48% |
|
v1.0.1 |
fp16 |
0.4005 |
0.3547 |
0.4953 |
0.4796 |
0.4297 |
- |
||
v1.0.2 |
fp16 |
0.4371 |
0.3887 |
0.5579 |
0.5437 |
0.4778 |
- |
||
v1.0.2 |
fp16 |
0.3146 |
0.3037 |
0.4134 |
0.3885 |
0.3509 |
- |
||
v1.0.6 |
bf16 |
0.5326 |
0.5397 |
0.7184 |
0.6859 |
0.6102 |
- |
||
v1.1.6 |
bf16 |
0.387 |
0.4 |
0.5403 |
0.5139 |
0.4527 |
- |
||
v1.1.6 |
int8 |
0.4322 |
0.4277 |
0.6088 |
0.5778 |
0.5035 |
- |
Target – The official declared score of the model on the dataset
Delta – The difference between the weighted average score and the target score
Settings: (Split: test, Total number: 13985, 5-shot)#
Model |
Revision |
Precision |
Humanities |
STEM |
Social Science |
Other |
Weighted Avg |
Avg |
Target |
Delta |
---|---|---|---|---|---|---|---|---|---|---|
Baichuan2-7B-Base |
v1.0.2 |
fp16 |
0.4295 |
0.398 |
0.5736 |
0.5325 |
0.4781 |
0.4918 |
0.5416 (official) |
-4.98% |
Baichuan2-7B-Chat |
v1.0.4 |
fp16 |
0.4344 |
0.3937 |
0.5814 |
0.5462 |
0.4837 |
0.5029 |
0.5293 (official) |
-2.64% |
chatglm2-6b |
v1.0.12 |
fp16 |
0.3941 |
0.376 |
0.4897 |
0.4706 |
0.4288 |
0.4442 |
- |
- |
chatglm3-6b-base |
v1.0.1 |
fp16 |
0.5356 |
0.4847 |
0.7175 |
0.6273 |
0.5857 |
0.5995 |
- |
- |
internlm-chat-7b |
v1.0.1 |
fp16 |
0.4171 |
0.3903 |
0.5772 |
0.5493 |
0.4769 |
0.4876 |
- |
- |
Llama-2-13b-ms |
v1.0.2 |
fp16 |
0.484 |
0.4133 |
0.6157 |
0.5809 |
0.5201 |
0.5327 |
0.548 (official) |
-1.53% |
Llama-2-7b-ms |
v1.0.2 |
fp16 |
0.3747 |
0.3363 |
0.4372 |
0.4514 |
0.3979 |
0.4089 |
0.453 (official) |
-4.41% |
Qwen-14B-Chat |
v1.0.6 |
bf16 |
0.574 |
0.553 |
0.7403 |
0.684 |
0.6313 |
0.6414 |
0.646 (official) |
-0.46% |
Qwen-7B |
v1.1.6 |
bf16 |
0.4587 |
0.426 |
0.6078 |
0.5629 |
0.5084 |
0.5151 |
0.567 (official) |
-5.2% |
Qwen-7B-Chat-Int8 |
v1.1.6 |
int8 |
0.4697 |
0.4383 |
0.6284 |
0.5967 |
0.5271 |
0.5347 |
0.554 (official) |
-1.93% |