LLM评测集#

以下是支持的LLM评测集列表，点击数据集名称可查看详细信息。

数据集名称	标准名称	任务类别
`aa_lcr`	AA-LCR	`Knowledge`, `LongContext`, `Reasoning`
`aime24`	AIME-2024	`Math`, `Reasoning`
`aime25`	AIME-2025	`Math`, `Reasoning`
`aime26`	AIME-2026	`Math`, `Reasoning`
`alpaca_eval`	AlpacaEval2.0	`Arena`, `InstructionFollowing`
`amc`	AMC	`Math`, `Reasoning`
`anat_em`	AnatEM	`Knowledge`, `NER`
`arc`	ARC	`MCQ`, `Reasoning`
`arena_hard`	ArenaHard	`Arena`, `InstructionFollowing`
`bbh`	BBH	`Reasoning`
`bc2gm`	BC2GM	`Knowledge`, `NER`
`bc4chemd`	BC4CHEMD	`Knowledge`, `NER`
`bc5cdr`	BC5CDR	`Knowledge`, `NER`
`biomix_qa`	BioMixQA	`Knowledge`, `MCQ`, `Medical`
`broad_twitter_corpus`	BroadTwitterCorpus	`Knowledge`, `NER`
`ceval`	C-Eval	`Chinese`, `Knowledge`, `MCQ`
`chinese_simpleqa`	Chinese-SimpleQA	`Chinese`, `Knowledge`, `QA`
`cl_bench`	CL-bench	`InstructionFollowing`, `Reasoning`
`cmmlu`	C-MMLU	`Chinese`, `Knowledge`, `MCQ`
`coin_flip`	CoinFlip	`Reasoning`, `Yes/No`
`commonsense_qa`	CommonsenseQA	`Commonsense`, `MCQ`, `Reasoning`
`competition_math`	Competition-MATH	`Math`, `Reasoning`
`conll2003`	CoNLL2003	`Knowledge`, `NER`
`conllpp`	CoNLL++	`Knowledge`, `NER`
`copious`	Copious	`Knowledge`, `NER`
`cross_ner`	CrossNER	`Knowledge`, `NER`
`data_collection`	Data-Collection	`Custom`
`docmath`	DocMath	`LongContext`, `Math`, `Reasoning`
`drivel_binary`	DrivelologyBinaryClassification	`Yes/No`
`drivel_multilabel`	DrivelologyMultilabelClassification	`MCQ`
`drivel_selection`	DrivelologyNarrativeSelection	`MCQ`
`drivel_writing`	DrivelologyNarrativeWriting	`Knowledge`, `Reasoning`
`drop`	DROP	`Reasoning`
`eq_bench`	EQ-Bench	`InstructionFollowing`
`fin_ner`	FinNER	`Knowledge`, `NER`
`frames`	FRAMES	`LongContext`, `Reasoning`
`general_arena`	GeneralArena	`Arena`, `Custom`
`general_mcq`	General-MCQ	`Custom`, `MCQ`
`general_qa`	General-QA	`Custom`, `QA`
`genia_ner`	GeniaNER	`Knowledge`, `NER`
`gpqa_diamond`	GPQA-Diamond	`Knowledge`, `MCQ`
`gsm8k`	GSM8K	`Math`, `Reasoning`
`halueval`	HaluEval	`Hallucination`, `Knowledge`, `Yes/No`
`harvey_ner`	HarveyNER	`Knowledge`, `NER`
`health_bench`	HealthBench	`Knowledge`, `Medical`, `QA`
`hellaswag`	HellaSwag	`Commonsense`, `Knowledge`, `MCQ`
`hle`	Humanity's-Last-Exam	`Knowledge`, `QA`
`hmmt25`	HMMT25	`Math`, `Reasoning`
`humaneval`	HumanEval	`Coding`
`humaneval_plus`	HumanEvalPlus	`Coding`
`ifbench`	IFBench	`InstructionFollowing`
`ifeval`	IFEval	`InstructionFollowing`
`iquiz`	IQuiz	`Chinese`, `Knowledge`, `MCQ`
`jnlpba`	JNLPBA	`Knowledge`, `NER`
`jnlpba_rare`	JNLPBA-Rare	`Knowledge`, `NER`
`live_code_bench`	Live-Code-Bench	`Coding`
`logi_qa`	LogiQA	`MCQ`, `Reasoning`
`longbench_v2`	LongBench-v2	`LongContext`, `MCQ`, `ReadingComprehension`
`maritime_bench`	MaritimeBench	`Chinese`, `Knowledge`, `MCQ`
`math_500`	MATH-500	`Math`, `Reasoning`
`math_qa`	MathQA	`MCQ`, `Math`, `Reasoning`
`mbpp`	MBPP	`Coding`
`mbpp_plus`	MBPP-Plus	`Coding`
`med_mcqa`	Med-MCQA	`Knowledge`, `MCQ`
`mgsm`	MGSM	`Math`, `MultiLingual`, `Reasoning`
`minerva_math`	Minerva-Math	`Math`, `Reasoning`
`mit_movie_trivia`	MIT-Movie-Trivia	`Knowledge`, `NER`
`mit_restaurant`	MIT-Restaurant	`Knowledge`, `NER`
`mmlu`	MMLU	`Knowledge`, `MCQ`
`mmlu_pro`	MMLU-Pro	`Knowledge`, `MCQ`
`mmlu_redux`	MMLU-Redux	`Knowledge`, `MCQ`
`mmmlu`	MMMLU	`Knowledge`, `MCQ`, `MultiLingual`
`mri_mcqa`	MRI-MCQA	`Knowledge`, `MCQ`, `Medical`
`multi_if`	Multi-IF	`InstructionFollowing`, `MultiLingual`, `MultiTurn`
`multi_nerd`	MultiNERD	`Knowledge`, `NER`
`multiple_humaneval`	MultiPL-E HumanEval	`Coding`
`multiple_mbpp`	MultiPL-E MBPP	`Coding`
`music_trivia`	MusicTrivia	`Knowledge`, `MCQ`
`musr`	MuSR	`MCQ`, `Reasoning`
`ncbi`	NCBI	`Knowledge`, `NER`
`needle_haystack`	Needle-in-a-Haystack	`LongContext`, `Retrieval`
`ontonotes5`	OntoNotes5	`Knowledge`, `NER`
`openai_mrcr`	OpenAI MRCR	`LongContext`, `Retrieval`
`piqa`	PIQA	`Commonsense`, `MCQ`, `Reasoning`
`poly_math`	PolyMath	`Math`, `MultiLingual`, `Reasoning`
`process_bench`	ProcessBench	`Math`, `Reasoning`
`pubmedqa`	PubMedQA	`Knowledge`, `Yes/No`
`qasc`	QASC	`Knowledge`, `MCQ`
`race`	RACE	`MCQ`, `Reasoning`
`refcoco`	RefCOCO	`Grounding`, `ImageCaptioning`, `Knowledge`, `MultiModal`
`scicode`	SciCode	`Coding`
`sciq`	SciQ	`Knowledge`, `MCQ`, `ReadingComprehension`
`simple_qa`	SimpleQA	`Knowledge`, `QA`
`siqa`	SIQA	`Commonsense`, `MCQ`, `Reasoning`
`super_gpqa`	SuperGPQA	`Knowledge`, `MCQ`
`swe_bench_lite`	SWE-bench_Lite	`Coding`
`swe_bench_verified`	SWE-bench_Verified	`Coding`
`swe_bench_verified_mini`	SWE-bench_Verified_mini	`Coding`
`terminal_bench_v2`	Terminal-Bench-2.0	`Coding`
`tool_bench`	ToolBench-Static	`FunctionCalling`, `Reasoning`
`trivia_qa`	TriviaQA	`QA`, `ReadingComprehension`
`truthful_qa`	TruthfulQA	`Knowledge`
`tweebank_ner`	TweeBankNER	`Knowledge`, `NER`
`tweet_ner_7`	TweetNER7	`Knowledge`, `NER`
`winogrande`	Winogrande	`MCQ`, `Reasoning`
`wmt24pp`	WMT2024++	`MachineTranslation`, `MultiLingual`
`wnut2017`	WNUT2017	`Knowledge`, `NER`
`zebralogicbench`	ZebraLogicBench	`Reasoning`