LLM评测集#

以下是支持的LLM评测集列表,点击数据集名称可查看详细信息。

数据集名称

标准名称

任务类别

aa_lcr

AA-LCR

Knowledge, LongContext, Reasoning

aime24

AIME-2024

Math, Reasoning

aime25

AIME-2025

Math, Reasoning

alpaca_eval

AlpacaEval2.0

Arena, InstructionFollowing

amc

AMC

Math, Reasoning

anat_em

AnatEM

Knowledge, NER

arc

ARC

MCQ, Reasoning

arena_hard

ArenaHard

Arena, InstructionFollowing

bbh

BBH

Reasoning

bc2gm

BC2GM

Knowledge, NER

bc4chemd

BC4CHEMD

Knowledge, NER

bc5cdr

BC5CDR

Knowledge, NER

biomix_qa

BioMixQA

Knowledge, MCQ, Medical

broad_twitter_corpus

BroadTwitterCorpus

Knowledge, NER

ceval

C-Eval

Chinese, Knowledge, MCQ

chinese_simpleqa

Chinese-SimpleQA

Chinese, Knowledge, QA

cl_bench

CL-bench

InstructionFollowing, Reasoning

cmmlu

C-MMLU

Chinese, Knowledge, MCQ

coin_flip

CoinFlip

Reasoning, Yes/No

commonsense_qa

CommonsenseQA

Commonsense, MCQ, Reasoning

competition_math

Competition-MATH

Math, Reasoning

conll2003

CoNLL2003

Knowledge, NER

conllpp

CoNLL++

Knowledge, NER

copious

Copious

Knowledge, NER

cross_ner

CrossNER

Knowledge, NER

data_collection

Data-Collection

Custom

docmath

DocMath

LongContext, Math, Reasoning

drivel_binary

DrivelologyBinaryClassification

Yes/No

drivel_multilabel

DrivelologyMultilabelClassification

MCQ

drivel_selection

DrivelologyNarrativeSelection

MCQ

drivel_writing

DrivelologyNarrativeWriting

Knowledge, Reasoning

drop

DROP

Reasoning

eq_bench

EQ-Bench

InstructionFollowing

fin_ner

FinNER

Knowledge, NER

frames

FRAMES

LongContext, Reasoning

general_arena

GeneralArena

Arena, Custom

general_mcq

General-MCQ

Custom, MCQ

general_qa

General-QA

Custom, QA

genia_ner

GeniaNER

Knowledge, NER

gpqa_diamond

GPQA-Diamond

Knowledge, MCQ

gsm8k

GSM8K

Math, Reasoning

halueval

HaluEval

Hallucination, Knowledge, Yes/No

harvey_ner

HarveyNER

Knowledge, NER

health_bench

HealthBench

Knowledge, Medical, QA

hellaswag

HellaSwag

Commonsense, Knowledge, MCQ

hle

Humanity's-Last-Exam

Knowledge, QA

hmmt25

HMMT25

Math, Reasoning

humaneval

HumanEval

Coding

humaneval_plus

HumanEvalPlus

Coding

ifbench

IFBench

InstructionFollowing

ifeval

IFEval

InstructionFollowing

iquiz

IQuiz

Chinese, Knowledge, MCQ

jnlpba

JNLPBA

Knowledge, NER

jnlpba_rare

JNLPBA-Rare

Knowledge, NER

live_code_bench

Live-Code-Bench

Coding

logi_qa

LogiQA

MCQ, Reasoning

maritime_bench

MaritimeBench

Chinese, Knowledge, MCQ

math_500

MATH-500

Math, Reasoning

math_qa

MathQA

MCQ, Math, Reasoning

mbpp

MBPP

Coding

mbpp_plus

MBPP-Plus

Coding

med_mcqa

Med-MCQA

Knowledge, MCQ

mgsm

MGSM

Math, MultiLingual, Reasoning

minerva_math

Minerva-Math

Math, Reasoning

mit_movie_trivia

MIT-Movie-Trivia

Knowledge, NER

mit_restaurant

MIT-Restaurant

Knowledge, NER

mmlu

MMLU

Knowledge, MCQ

mmlu_pro

MMLU-Pro

Knowledge, MCQ

mmlu_redux

MMLU-Redux

Knowledge, MCQ

mri_mcqa

MRI-MCQA

Knowledge, MCQ, Medical

multi_if

Multi-IF

InstructionFollowing, MultiLingual, MultiTurn

multi_nerd

MultiNERD

Knowledge, NER

multiple_humaneval

MultiPL-E HumanEval

Coding

multiple_mbpp

MultiPL-E MBPP

Coding

music_trivia

MusicTrivia

Knowledge, MCQ

musr

MuSR

MCQ, Reasoning

ncbi

NCBI

Knowledge, NER

needle_haystack

Needle-in-a-Haystack

LongContext, Retrieval

ontonotes5

OntoNotes5

Knowledge, NER

openai_mrcr

OpenAI MRCR

LongContext, Retrieval

piqa

PIQA

Commonsense, MCQ, Reasoning

poly_math

PolyMath

Math, MultiLingual, Reasoning

process_bench

ProcessBench

Math, Reasoning

pubmedqa

PubMedQA

Knowledge, Yes/No

qasc

QASC

Knowledge, MCQ

race

RACE

MCQ, Reasoning

refcoco

RefCOCO

Grounding, ImageCaptioning, Knowledge, MultiModal

scicode

SciCode

Coding

sciq

SciQ

Knowledge, MCQ, ReadingComprehension

simple_qa

SimpleQA

Knowledge, QA

siqa

SIQA

Commonsense, MCQ, Reasoning

super_gpqa

SuperGPQA

Knowledge, MCQ

swe_bench_lite

SWE-bench_Lite

Coding

swe_bench_verified

SWE-bench_Verified

Coding

swe_bench_verified_mini

SWE-bench_Verified_mini

Coding

terminal_bench_v2

Terminal-Bench-2.0

Coding

trivia_qa

TriviaQA

QA, ReadingComprehension

truthful_qa

TruthfulQA

Knowledge

tweebank_ner

TweeBankNER

Knowledge, NER

tweet_ner_7

TweetNER7

Knowledge, NER

winogrande

Winogrande

MCQ, Reasoning

wmt24pp

WMT2024++

MachineTranslation, MultiLingual

wnut2017

WNUT2017

Knowledge, NER

zebralogicbench

ZebraLogicBench

Reasoning