aa_lcr
|
AA-LCR |
Knowledge, LongContext, Reasoning
|
aime24
|
AIME-2024 |
Math, Reasoning
|
aime25
|
AIME-2025 |
Math, Reasoning
|
aime26
|
AIME-2026 |
Math, Reasoning
|
alpaca_eval
|
AlpacaEval2.0 |
Arena, InstructionFollowing
|
amc
|
AMC |
Math, Reasoning
|
anat_em
|
AnatEM |
Knowledge, NER
|
arc
|
ARC |
MCQ, Reasoning
|
arena_hard
|
ArenaHard |
Arena, InstructionFollowing
|
bbh
|
BBH |
Reasoning
|
bc2gm
|
BC2GM |
Knowledge, NER
|
bc4chemd
|
BC4CHEMD |
Knowledge, NER
|
bc5cdr
|
BC5CDR |
Knowledge, NER
|
biomix_qa
|
BioMixQA |
Knowledge, MCQ, Medical
|
broad_twitter_corpus
|
BroadTwitterCorpus |
Knowledge, NER
|
ceval
|
C-Eval |
Chinese, Knowledge, MCQ
|
chinese_simpleqa
|
Chinese-SimpleQA |
Chinese, Knowledge, QA
|
cl_bench
|
CL-bench |
InstructionFollowing, Reasoning
|
cmmlu
|
C-MMLU |
Chinese, Knowledge, MCQ
|
coin_flip
|
CoinFlip |
Reasoning, Yes/No
|
commonsense_qa
|
CommonsenseQA |
Commonsense, MCQ, Reasoning
|
competition_math
|
Competition-MATH |
Math, Reasoning
|
conll2003
|
CoNLL2003 |
Knowledge, NER
|
conllpp
|
CoNLL++ |
Knowledge, NER
|
copious
|
Copious |
Knowledge, NER
|
cross_ner
|
CrossNER |
Knowledge, NER
|
data_collection
|
Data-Collection |
Custom
|
docmath
|
DocMath |
LongContext, Math, Reasoning
|
drivel_binary
|
DrivelologyBinaryClassification |
Yes/No
|
drivel_multilabel
|
DrivelologyMultilabelClassification |
MCQ
|
drivel_selection
|
DrivelologyNarrativeSelection |
MCQ
|
drivel_writing
|
DrivelologyNarrativeWriting |
Knowledge, Reasoning
|
drop
|
DROP |
Reasoning
|
eq_bench
|
EQ-Bench |
InstructionFollowing
|
fin_ner
|
FinNER |
Knowledge, NER
|
frames
|
FRAMES |
LongContext, Reasoning
|
general_arena
|
GeneralArena |
Arena, Custom
|
general_mcq
|
General-MCQ |
Custom, MCQ
|
general_qa
|
General-QA |
Custom, QA
|
genia_ner
|
GeniaNER |
Knowledge, NER
|
gpqa_diamond
|
GPQA-Diamond |
Knowledge, MCQ
|
gsm8k
|
GSM8K |
Math, Reasoning
|
halueval
|
HaluEval |
Hallucination, Knowledge, Yes/No
|
harvey_ner
|
HarveyNER |
Knowledge, NER
|
health_bench
|
HealthBench |
Knowledge, Medical, QA
|
hellaswag
|
HellaSwag |
Commonsense, Knowledge, MCQ
|
hle
|
Humanity's-Last-Exam |
Knowledge, QA
|
hmmt25
|
HMMT25 |
Math, Reasoning
|
humaneval
|
HumanEval |
Coding
|
humaneval_plus
|
HumanEvalPlus |
Coding
|
ifbench
|
IFBench |
InstructionFollowing
|
ifeval
|
IFEval |
InstructionFollowing
|
iquiz
|
IQuiz |
Chinese, Knowledge, MCQ
|
jnlpba
|
JNLPBA |
Knowledge, NER
|
jnlpba_rare
|
JNLPBA-Rare |
Knowledge, NER
|
live_code_bench
|
Live-Code-Bench |
Coding
|
logi_qa
|
LogiQA |
MCQ, Reasoning
|
longbench_v2
|
LongBench-v2 |
LongContext, MCQ, ReadingComprehension
|
maritime_bench
|
MaritimeBench |
Chinese, Knowledge, MCQ
|
math_500
|
MATH-500 |
Math, Reasoning
|
math_qa
|
MathQA |
MCQ, Math, Reasoning
|
mbpp
|
MBPP |
Coding
|
mbpp_plus
|
MBPP-Plus |
Coding
|
med_mcqa
|
Med-MCQA |
Knowledge, MCQ
|
mgsm
|
MGSM |
Math, MultiLingual, Reasoning
|
minerva_math
|
Minerva-Math |
Math, Reasoning
|
mit_movie_trivia
|
MIT-Movie-Trivia |
Knowledge, NER
|
mit_restaurant
|
MIT-Restaurant |
Knowledge, NER
|
mmlu
|
MMLU |
Knowledge, MCQ
|
mmlu_pro
|
MMLU-Pro |
Knowledge, MCQ
|
mmlu_redux
|
MMLU-Redux |
Knowledge, MCQ
|
mmmlu
|
MMMLU |
Knowledge, MCQ, MultiLingual
|
mri_mcqa
|
MRI-MCQA |
Knowledge, MCQ, Medical
|
multi_if
|
Multi-IF |
InstructionFollowing, MultiLingual, MultiTurn
|
multi_nerd
|
MultiNERD |
Knowledge, NER
|
multiple_humaneval
|
MultiPL-E HumanEval |
Coding
|
multiple_mbpp
|
MultiPL-E MBPP |
Coding
|
music_trivia
|
MusicTrivia |
Knowledge, MCQ
|
musr
|
MuSR |
MCQ, Reasoning
|
ncbi
|
NCBI |
Knowledge, NER
|
needle_haystack
|
Needle-in-a-Haystack |
LongContext, Retrieval
|
ontonotes5
|
OntoNotes5 |
Knowledge, NER
|
openai_mrcr
|
OpenAI MRCR |
LongContext, Retrieval
|
piqa
|
PIQA |
Commonsense, MCQ, Reasoning
|
poly_math
|
PolyMath |
Math, MultiLingual, Reasoning
|
process_bench
|
ProcessBench |
Math, Reasoning
|
pubmedqa
|
PubMedQA |
Knowledge, Yes/No
|
qasc
|
QASC |
Knowledge, MCQ
|
race
|
RACE |
MCQ, Reasoning
|
refcoco
|
RefCOCO |
Grounding, ImageCaptioning, Knowledge, MultiModal
|
scicode
|
SciCode |
Coding
|
sciq
|
SciQ |
Knowledge, MCQ, ReadingComprehension
|
simple_qa
|
SimpleQA |
Knowledge, QA
|
siqa
|
SIQA |
Commonsense, MCQ, Reasoning
|
super_gpqa
|
SuperGPQA |
Knowledge, MCQ
|
swe_bench_lite
|
SWE-bench_Lite |
Coding
|
swe_bench_verified
|
SWE-bench_Verified |
Coding
|
swe_bench_verified_mini
|
SWE-bench_Verified_mini |
Coding
|
terminal_bench_v2
|
Terminal-Bench-2.0 |
Coding
|
tool_bench
|
ToolBench-Static |
FunctionCalling, Reasoning
|
trivia_qa
|
TriviaQA |
QA, ReadingComprehension
|
truthful_qa
|
TruthfulQA |
Knowledge
|
tweebank_ner
|
TweeBankNER |
Knowledge, NER
|
tweet_ner_7
|
TweetNER7 |
Knowledge, NER
|
winogrande
|
Winogrande |
MCQ, Reasoning
|
wmt24pp
|
WMT2024++ |
MachineTranslation, MultiLingual
|
wnut2017
|
WNUT2017 |
Knowledge, NER
|
zebralogicbench
|
ZebraLogicBench |
Reasoning
|