LLM Benchmarks#

Below is the list of supported LLM benchmarks. Click on a benchmark name to jump to details.

Benchmark Name

Pretty Name

Task Categories

aa_lcr

AA-LCR

Knowledge, LongContext, Reasoning

aime24

AIME-2024

Math, Reasoning

aime25

AIME-2025

Math, Reasoning

alpaca_eval

AlpacaEval2.0

Arena, InstructionFollowing

amc

AMC

Math, Reasoning

arc

ARC

MCQ, Reasoning

arena_hard

ArenaHard

Arena, InstructionFollowing

bbh

BBH

Reasoning

biomix_qa

BioMixQA

Knowledge, MCQ, Medical

broad_twitter_corpus

BroadTwitterCorpus

Knowledge, NER

ceval

C-Eval

Chinese, Knowledge, MCQ

chinese_simpleqa

Chinese-SimpleQA

Chinese, Knowledge, QA

cmmlu

C-MMLU

Chinese, Knowledge, MCQ

coin_flip

CoinFlip

Reasoning, Yes/No

commonsense_qa

CommonsenseQA

Commonsense, MCQ, Reasoning

competition_math

MATH

Math, Reasoning

conll2003

CoNLL2003

Knowledge, NER

copious

Copious

Knowledge, NER

cross_ner

CrossNER

Knowledge, NER

data_collection

Data-Collection

Custom

docmath

DocMath

LongContext, Math, Reasoning

drivel_binary

DrivelologyBinaryClassification

Yes/No

drivel_multilabel

DrivelologyMultilabelClassification

MCQ

drivel_selection

DrivelologyNarrativeSelection

MCQ

drivel_writing

DrivelologyNarrativeWriting

Knowledge, Reasoning

drop

DROP

Reasoning

frames

FRAMES

LongContext, Reasoning

general_arena

GeneralArena

Arena, Custom

general_mcq

General-MCQ

Custom, MCQ

general_qa

General-QA

Custom, QA

genia_ner

GeniaNER

Knowledge, NER

gpqa_diamond

GPQA-Diamond

Knowledge, MCQ

gsm8k

GSM8K

Math, Reasoning

halueval

HaluEval

Hallucination, Knowledge, Yes/No

harvey_ner

HarveyNER

Knowledge, NER

health_bench

HealthBench

Knowledge, Medical, QA

hellaswag

HellaSwag

Commonsense, Knowledge, MCQ

hle

Humanity’s-Last-Exam

Knowledge, QA

humaneval

HumanEval

Coding

ifeval

IFEval

InstructionFollowing

iquiz

IQuiz

Chinese, Knowledge, MCQ

live_code_bench

Live-Code-Bench

Coding

logi_qa

LogiQA

MCQ, Reasoning

maritime_bench

MaritimeBench

Chinese, Knowledge, MCQ

math_500

MATH-500

Math, Reasoning

math_qa

MathQA

MCQ, Math, Reasoning

med_mcqa

Med-MCQA

Knowledge, MCQ

minerva_math

Minerva-Math

Math, Reasoning

mit_movie_trivia

MIT-Movie-Trivia

Knowledge, NER

mit_restaurant

MIT-Restaurant

Knowledge, NER

mmlu

MMLU

Knowledge, MCQ

mmlu_pro

MMLU-Pro

Knowledge, MCQ

mmlu_redux

MMLU-Redux

Knowledge, MCQ

mri_mcqa

MRI-MCQA

Knowledge, MCQ, Medical

multi_if

Multi-IF

InstructionFollowing, MultiLingual, MultiTurn

music_trivia

MusicTrivia

Knowledge, MCQ

musr

MuSR

MCQ, Reasoning

needle_haystack

Needle-in-a-Haystack

LongContext, Retrieval

ontonotes5

OntoNotes5

Knowledge, NER

piqa

PIQA

Commonsense, MCQ, Reasoning

poly_math

PolyMath

Math, MultiLingual, Reasoning

process_bench

ProcessBench

Math, Reasoning

pubmedqa

PubMedQA

Knowledge, Yes/No

qasc

QASC

Knowledge, MCQ

race

RACE

MCQ, Reasoning

sciq

SciQ

Knowledge, MCQ, ReadingComprehension

simple_qa

SimpleQA

Knowledge, QA

siqa

SIQA

Commonsense, MCQ, Reasoning

super_gpqa

SuperGPQA

Knowledge, MCQ

trivia_qa

TriviaQA

QA, ReadingComprehension

truthful_qa

TruthfulQA

Knowledge

winogrande

Winogrande

MCQ, Reasoning

wmt24pp

WMT2024++

MachineTranslation, MultiLingual

wnut2017

WNUT2017

Knowledge, NER


Benchmark Details#

AA-LCR#

Back to Top

  • Dataset Name: aa_lcr

  • Dataset ID: evalscope/AA-LCR

  • Description:

    AA-LCR (Artificial Analysis Long Context Retrieval) is a benchmark for evaluating long-context retrieval and reasoning capabilities of language models across multiple documents.

  • Task Categories: Knowledge, LongContext, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Extra Parameters:

{
    "text_dir": null
}
  • Prompt Template:

View
BEGIN INPUT DOCUMENTS

{documents_text}

END INPUT DOCUMENTS

Answer the following question using the input documents provided above.

START QUESTION

{question}

END QUESTION

AIME-2024#

Back to Top

  • Dataset Name: aime24

  • Dataset ID: HuggingFaceH4/aime_2024

  • Description:

    The AIME 2024 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious high school mathematics competition. This benchmark tests a model’s ability to solve challenging mathematics problems by generating step-by-step solutions and providing the correct final answer.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

AIME-2025#

Back to Top

  • Dataset Name: aime25

  • Dataset ID: opencompass/AIME2025

  • Description:

    The AIME 2025 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious high school mathematics competition. This benchmark tests a model’s ability to solve challenging mathematics problems by generating step-by-step solutions and providing the correct final answer.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: AIME2025-II, AIME2025-I

  • Prompt Template:

View
Solve the following math problem step by step. Put your answer inside \boxed{{}}.

{question}

Remember to put your answer inside \boxed{{}}.

AlpacaEval2.0#

Back to Top

  • Dataset Name: alpaca_eval

  • Dataset ID: AI-ModelScope/alpaca_eval

  • Description:

    Alpaca Eval 2.0 is an enhanced framework for evaluating instruction-following language models, featuring an improved auto-annotator, updated baselines, and continuous preference calculation to provide more accurate and cost-effective model assessments. Currently not support length-controlled winrate; the official Judge model is gpt-4-1106-preview, while the baseline model is gpt-4-turbo.

  • Task Categories: Arena, InstructionFollowing

  • Evaluation Metrics: winrate

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: alpaca_eval_gpt4_baseline

  • Prompt Template:

View
{question}

AMC#

Back to Top

  • Dataset Name: amc

  • Dataset ID: evalscope/amc_22-24

  • Description:

    AMC (American Mathematics Competitions) is a series of mathematics competitions for high school students.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: amc22, amc23, amc24

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

ARC#

Back to Top

  • Dataset Name: arc

  • Dataset ID: allenai/ai2_arc

  • Description:

    The ARC (AI2 Reasoning Challenge) benchmark is designed to evaluate the reasoning capabilities of AI models through multiple-choice questions derived from science exams. It includes two subsets: ARC-Easy and ARC-Challenge, which vary in difficulty.

  • Task Categories: MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: ARC-Challenge, ARC-Easy

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

ArenaHard#

Back to Top

  • Dataset Name: arena_hard

  • Dataset ID: AI-ModelScope/arena-hard-auto-v0.1

  • Description:

    ArenaHard is a benchmark designed to evaluate the performance of large language models in a competitive setting, where models are pitted against each other in a series of tasks to determine their relative strengths and weaknesses. It includes a set of challenging tasks that require reasoning, understanding, and generation capabilities. Currently not support style-controlled winrate; the official Judge model is gpt-4-1106-preview, while the baseline model is gpt-4-0314.

  • Task Categories: Arena, InstructionFollowing

  • Evaluation Metrics: winrate

  • Aggregation Methods: elo

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}

BBH#

Back to Top

  • Dataset Name: bbh

  • Dataset ID: evalscope/bbh

  • Description:

    The BBH (Big Bench Hard) benchmark is a collection of challenging tasks designed to evaluate the reasoning capabilities of AI models. It includes both free-form and multiple-choice tasks, covering a wide range of reasoning skills.

  • Task Categories: Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 3-shot

  • Subsets: boolean_expressions, causal_judgement, date_understanding, disambiguation_qa, dyck_languages, formal_fallacies, geometric_shapes, hyperbaton, logical_deduction_five_objects, logical_deduction_seven_objects, logical_deduction_three_objects, movie_recommendation, multistep_arithmetic_two, navigate, object_counting, penguins_in_a_table, reasoning_about_colored_objects, ruin_names, salient_translation_error_detection, snarks, sports_understanding, temporal_sequences, tracking_shuffled_objects_five_objects, tracking_shuffled_objects_seven_objects, tracking_shuffled_objects_three_objects, web_of_lies, word_sorting

  • Prompt Template:

View
Q: {question}
A: Let's think step by step. Put your final answer in the format of "So the answer is $ANSWER" (without quotes and markdown) where $ANSWER is the answer to the problem.

BioMixQA#

Back to Top

  • Dataset Name: biomix_qa

  • Dataset ID: extraordinarylab/biomix-qa

  • Description:

    BiomixQA is a curated biomedical question-answering dataset. BiomixQA has been utilized to validate the Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework across different LLMs.

  • Task Categories: Knowledge, MCQ, Medical

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

BroadTwitterCorpus#

Back to Top

  • Dataset Name: broad_twitter_corpus

  • Dataset ID: extraordinarylab/broad-twitter-corpus

  • Description:

    BroadTwitterCorpus is a dataset of tweets collected over stratified times, places and social uses. The goal is to represent a broad range of activities, giving a dataset more representative of the language used in this hardest of social media formats to process.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

C-Eval#

Back to Top

  • Dataset Name: ceval

  • Dataset ID: evalscope/ceval

  • Description:

    C-Eval is a benchmark designed to evaluate the performance of AI models on Chinese exams across various subjects, including STEM, social sciences, and humanities. It consists of multiple-choice questions that test knowledge and reasoning abilities in these areas.

  • Task Categories: Chinese, Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: accountant, advanced_mathematics, art_studies, basic_medicine, business_administration, chinese_language_and_literature, civil_servant, clinical_medicine, college_chemistry, college_economics, college_physics, college_programming, computer_architecture, computer_network, discrete_mathematics, education_science, electrical_engineer, environmental_impact_assessment_engineer, fire_engineer, high_school_biology, high_school_chemistry, high_school_chinese, high_school_geography, high_school_history, high_school_mathematics, high_school_physics, high_school_politics, ideological_and_moral_cultivation, law, legal_professional, logic, mao_zedong_thought, marxism, metrology_engineer, middle_school_biology, middle_school_chemistry, middle_school_geography, middle_school_history, middle_school_mathematics, middle_school_physics, middle_school_politics, modern_chinese_history, operating_system, physician, plant_protection, probability_and_statistics, professional_tour_guide, sports_science, tax_accountant, teacher_qualification, urban_and_rural_planner, veterinary_medicine

  • Prompt Template:

View
以下是中国关于{subject}的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:LETTER"(不带引号),其中 LETTER 是 A、B、C、D 中的一个。

问题:{question}
选项:
{choices}

Chinese-SimpleQA#

Back to Top

  • Dataset Name: chinese_simpleqa

  • Dataset ID: AI-ModelScope/Chinese-SimpleQA

  • Description:

    Chinese SimpleQA is a Chinese question-answering dataset designed to evaluate the performance of language models on simple factual questions. It includes a variety of topics and is structured to test the model’s ability to understand and generate correct answers in Chinese.

  • Task Categories: Chinese, Knowledge, QA

  • Evaluation Metrics: is_correct, is_incorrect, is_not_attempted

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: 中华文化, 人文与社会科学, 工程、技术与应用科学, 生活、艺术与文化, 社会, 自然与自然科学

  • Prompt Template:

View
请回答问题:

{question}

C-MMLU#

Back to Top

  • Dataset Name: cmmlu

  • Dataset ID: evalscope/cmmlu

  • Description:

    C-MMLU is a benchmark designed to evaluate the performance of AI models on Chinese language tasks, including reading comprehension, text classification, and more.

  • Task Categories: Chinese, Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: agronomy, anatomy, ancient_chinese, arts, astronomy, business_ethics, chinese_civil_service_exam, chinese_driving_rule, chinese_food_culture, chinese_foreign_policy, chinese_history, chinese_literature, chinese_teacher_qualification, clinical_knowledge, college_actuarial_science, college_education, college_engineering_hydrology, college_law, college_mathematics, college_medical_statistics, college_medicine, computer_science, computer_security, conceptual_physics, construction_project_management, economics, education, electrical_engineering, elementary_chinese, elementary_commonsense, elementary_information_and_technology, elementary_mathematics, ethnology, food_science, genetics, global_facts, high_school_biology, high_school_chemistry, high_school_geography, high_school_mathematics, high_school_physics, high_school_politics, human_sexuality, international_law, journalism, jurisprudence, legal_and_moral_basis, logical, machine_learning, management, marketing, marxist_theory, modern_chinese, nutrition, philosophy, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, security_study, sociology, sports_science, traditional_chinese_medicine, virology, world_history, world_religions

  • Prompt Template:

View
回答下面的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:LETTER"(不带引号),其中 LETTER 是 {letters} 中的一个。请在回答前进行一步步思考。

问题:{question}
选项:
{choices}

CoinFlip#

Back to Top

  • Dataset Name: coin_flip

  • Dataset ID: extraordinarylab/coin-flip

  • Description:

    CoinFlip is a symbolic reasoning dataset that tests an LLM’s ability to track binary state changes through a sequence of actions. Each example describes whether a coin is flipped or not by different person, requiring logical inference to determine the final state (heads or tails).

  • Task Categories: Reasoning, Yes/No

  • Evaluation Metrics: accuracy, f1_score, precision, recall, yes_ratio

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Solve the following coin flip problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer YES or NO to the problem.

Reasoning:

CommonsenseQA#

Back to Top

  • Dataset Name: commonsense_qa

  • Dataset ID: extraordinarylab/commonsense-qa

  • Description:

    CommonsenseQA requires different types of commonsense knowledge to predict the correct answers.

  • Task Categories: Commonsense, MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

MATH#

Back to Top

  • Dataset Name: competition_math

  • Dataset ID: evalscope/competition_math

  • Description:

    The MATH (Mathematics) benchmark is designed to evaluate the mathematical reasoning abilities of AI models through a variety of problem types, including arithmetic, algebra, geometry, and more.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 4-shot

  • Subsets: Level 1, Level 2, Level 3, Level 4, Level 5

  • Prompt Template:

View
Problem:
{question}

Please reason step by step, and put your final answer within \boxed{{}}.

CoNLL2003#

Back to Top

  • Dataset Name: conll2003

  • Dataset ID: evalscope/conll2003

  • Description:

    The ConLL-2003 dataset is for the Named Entity Recognition (NER) task. It was introduced as part of the ConLL-2003 Shared Task conference and contains texts annotated with entities such as people, organizations, places, and various names.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

Copious#

Back to Top

  • Dataset Name: copious

  • Dataset ID: extraordinarylab/copious

  • Description:

    Copious corpus is a gold standard corpus that covers a wide range of biodiversity entities, consisting of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

CrossNER#

Back to Top

  • Dataset Name: cross_ner

  • Dataset ID: extraordinarylab/cross-ner

  • Description:

    CrossNER is a fully-labelled collected of named entity recognition (NER) data spanning over five diverse domains (AI, Literature, Music, Politics, Science).

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: ai, literature, music, politics, science

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

Data-Collection#

Back to Top

  • Dataset Name: data_collection

  • Dataset ID:

  • Description:

    Custom Data collection, mixing multiple evaluation datasets for a unified evaluation, aiming to use less data to achieve a more comprehensive assessment of the model’s capabilities. Usage Reference

  • Task Categories: Custom

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default


DocMath#

Back to Top

  • Dataset Name: docmath

  • Dataset ID: yale-nlp/DocMath-Eval

  • Description:

    DocMath-Eval is a comprehensive benchmark focused on numerical reasoning within specialized domains. It requires the model to comprehend long and specialized documents and perform numerical reasoning to answer the given question.

  • Task Categories: LongContext, Math, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: complong_testmini, compshort_testmini, simplong_testmini, simpshort_testmini

  • Prompt Template:

View
Please read the following text and answer the question below.

<text>
{context}
</text>

{question}

Format your response as follows: "Therefore, the answer is (insert answer here)".

DrivelologyBinaryClassification#

Back to Top

  • Dataset Name: drivel_binary

  • Dataset ID: extraordinarylab/drivel-hub

  • Description:

    Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive.

  • Task Categories: Yes/No

  • Evaluation Metrics: accuracy, f1_score, precision, recall, yes_ratio

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: binary-classification

  • Prompt Template:

View
{question}

DrivelologyMultilabelClassification#

Back to Top

  • Dataset Name: drivel_multilabel

  • Dataset ID: extraordinarylab/drivel-hub

  • Description:

    Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive.

  • Task Categories: MCQ

  • Evaluation Metrics: exact_match, f1_macro, f1_micro, f1_weighted

  • Aggregation Methods: f1_weighted

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: multi-label-classification

  • Prompt Template:

View
{question}

DrivelologyNarrativeSelection#

Back to Top

  • Dataset Name: drivel_selection

  • Dataset ID: extraordinarylab/drivel-hub

  • Description:

    Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive.

  • Task Categories: MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: multiple-choice-english-easy, multiple-choice-english-hard

  • Prompt Template:

View
Tell me the best option in the following options which represents the underlying narrative of the text?
The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

DrivelologyNarrativeWriting#

Back to Top

  • Dataset Name: drivel_writing

  • Dataset ID: extraordinarylab/drivel-hub

  • Description:

    Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive.

  • Task Categories: Knowledge, Reasoning

  • Evaluation Metrics: bert_score, gpt_score

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: narrative-writing-english

  • Prompt Template:

View
You need to first read and understand the text given. Generate a detailed description to illustrate the implicit narrative of the text.

Please provide your response in English, with a clear and comprehensive explanation of the narrative.

Text: {text}

DROP#

Back to Top

  • Dataset Name: drop

  • Dataset ID: AI-ModelScope/DROP

  • Description:

    The DROP (Discrete Reasoning Over Paragraphs) benchmark is designed to evaluate the reading comprehension and reasoning capabilities of AI models. It includes a variety of tasks that require models to read passages and answer questions based on the content.

  • Task Categories: Reasoning

  • Evaluation Metrics: em, f1

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 3-shot

  • Subsets: default

  • Prompt Template:

View
You will be asked to read a passage and answer a question. {drop_examples}
# Your Task

---
{query}

Think step by step, then write a line of the form "Answer: $ANSWER" at the end of your response.

FRAMES#

Back to Top

  • Dataset Name: frames

  • Dataset ID: iic/frames

  • Description:

    FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning.

  • Task Categories: LongContext, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Please read the following text and answer the question below.

<text>
{context}
</text>

{question}

Format your response as follows: "Therefore, the answer is (insert answer here)".

GeneralArena#

Back to Top

  • Dataset Name: general_arena

  • Dataset ID: general_arena

  • Description:

    GeneralArena is a custom benchmark designed to evaluate the performance of large language models in a competitive setting, where models are pitted against each other in custom tasks to determine their relative strengths and weaknesses. You should provide the model outputs in the format of a list of dictionaries, where each dictionary contains the model name and its report path. For detailed instructions on how to use this benchmark, please refer to the Arena User Guide.

  • Task Categories: Arena, Custom

  • Evaluation Metrics: winrate

  • Aggregation Methods: elo

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Extra Parameters:

{
    "models": [
        {
            "name": "qwen-plus",
            "report_path": "outputs/20250627_172550/reports/qwen-plus"
        },
        {
            "name": "qwen2.5-7b",
            "report_path": "outputs/20250627_172817/reports/qwen2.5-7b-instruct"
        }
    ],
    "baseline": "qwen2.5-7b"
}
  • System Prompt:

View
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.

Begin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.

When evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.

Then consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.

Then consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.

After providing your explanation, you must output only one of the following choices as your final verdict with a label:

1. Assistant A is significantly better: [[A>>B]]
2. Assistant A is slightly better: [[A>B]]
3. Tie, relatively the same: [[A=B]]
4. Assistant B is slightly better: [[B>A]]
5. Assistant B is significantly better: [[B>>A]]

Example output: "My final verdict is tie: [[A=B]]".
- **Prompt Template**:
View
<|User Prompt|>
{question}

<|The Start of Assistant A's Answer|>
{answer_1}
<|The End of Assistant A's Answer|>

<|The Start of Assistant B's Answer|>
{answer_2}
<|The End of Assistant B's Answer|>

General-MCQ#

Back to Top

  • Dataset Name: general_mcq

  • Dataset ID: general_mcq

  • Description:

    A general multiple-choice question answering dataset for custom evaluation. For detailed instructions on how to use this benchmark, please refer to the User Guide.

  • Task Categories: Custom, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
回答下面的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:LETTER"(不带引号),其中 LETTER 是 {letters} 中的一个。

问题:{question}
选项:
{choices}

General-QA#

Back to Top

  • Dataset Name: general_qa

  • Dataset ID: general_qa

  • Description:

    A general question answering dataset for custom evaluation. For detailed instructions on how to use this benchmark, please refer to the User Guide.

  • Task Categories: Custom, QA

  • Evaluation Metrics: BLEU, Rouge

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
请回答问题
{question}

GeniaNER#

Back to Top

  • Dataset Name: genia_ner

  • Dataset ID: extraordinarylab/genia-ner

  • Description:

    GeniaNER consisting of 2,000 MEDLINE abstracts has been released with more than 400,000 words and almost 100,000 annotations for biological terms.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

GPQA-Diamond#

Back to Top

  • Dataset Name: gpqa_diamond

  • Dataset ID: AI-ModelScope/gpqa_diamond

  • Description:

    GPQA is a dataset for evaluating the reasoning ability of large language models (LLMs) on complex mathematical problems. It contains questions that require step-by-step reasoning to arrive at the correct answer.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

GSM8K#

Back to Top

  • Dataset Name: gsm8k

  • Dataset ID: AI-ModelScope/gsm8k

  • Description:

    GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 4-shot

  • Subsets: main

  • Prompt Template:

View
Solve the following math problem step by step. The last line of your response should display the answer enclosed within \boxed{{\text{{$ANSWER}}}}.

Example:

Let's solve the problem step by step.

Problem: Eliza's rate per hour for the first 40 hours she works each week is $10. She also receives an overtime pay of 1.2 times her regular hourly rate. If Eliza worked for 45 hours this week, how much are her earnings for this week?

Step 1: Calculate Eliza's earnings for the first 40 hours. Eliza's hourly rate is $10, so her earnings for the first 40 hours are $10/hour x 40 hours = $400.
Step 2: Calculate Eliza's overtime pay rate. Eliza's overtime pay rate is 1.2 times her regular hourly rate, so her overtime pay rate is $10/hour x 1.2 = $12/hour.
Step 3: Calculate Eliza's earnings for the overtime hours. Eliza worked for 45 hours, so her overtime hours are 45 hours - 40 hours = 5 hours. Her earnings for the overtime hours are $12/hour x 5 hours = $60.
Step 4: Calculate Eliza's total earnings for the week. Eliza's total earnings for the week are her earnings for the first 40 hours plus her earnings for the overtime hours, which is $400 + $60 = $460.

Answer:
\boxed{{\text{{460}}}}

question:
{question}

Remember to put your answer on its own line at the end in the form "\boxed{{\text{{$ANSWER}}}}" (without quotes), where $ANSWER is replaced by the actual answer to the problem.

HaluEval#

Back to Top

  • Dataset Name: halueval

  • Dataset ID: evalscope/HaluEval

  • Description:

    HaluEval is a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination.

  • Task Categories: Hallucination, Knowledge, Yes/No

  • Evaluation Metrics: accuracy, f1_score, precision, recall, yes_ratio

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: dialogue_samples, qa_samples, summarization_samples

  • Prompt Template:

View
{question}

HarveyNER#

Back to Top

  • Dataset Name: harvey_ner

  • Dataset ID: extraordinarylab/harvey-ner

  • Description:

    HarveyNER is a dataset with fine-grained locations annotated in tweets. This dataset presents unique challenges and characterizes many complex and long location mentions in informal descriptions.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

HealthBench#

Back to Top

  • Dataset Name: health_bench

  • Dataset ID: openai-mirror/healthbench

  • Description:

    HealthBench: a new benchmark designed to better measure capabilities of AI systems for health. Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses.

  • Task Categories: Knowledge, Medical, QA

  • Evaluation Metrics: accuracy, communication_quality, completeness, context_awareness, instruction_following

  • Aggregation Methods: clipped_mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: communication, complex_responses, context_seeking, emergency_referrals, global_health, health_data_tasks, hedging

  • Extra Parameters:

{
    "version": "# File version, choose from ['Consensus', 'Hard', 'All'], default to Consensus"
}
  • Prompt Template:

View
Answer the question:

{question}

HellaSwag#

Back to Top

  • Dataset Name: hellaswag

  • Dataset ID: evalscope/hellaswag

  • Description:

    HellaSwag is a benchmark for commonsense reasoning in natural language understanding tasks. It consists of multiple-choice questions where the model must select the most plausible continuation of a given context.

  • Task Categories: Commonsense, Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

Humanity’s-Last-Exam#

Back to Top

  • Dataset Name: hle

  • Dataset ID: cais/hle

  • Description:

    Humanity’s Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI. The benchmark classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. To evaluate the performance of model without multi-modality capabilities, please set the extra_params["include_multi_modal"] to False.

  • Task Categories: Knowledge, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: Biology/Medicine, Chemistry, Computer Science/AI, Engineering, Humanities/Social Science, Math, Other, Physics

  • Extra Parameters:

{
    "include_multi_modal": true
}
  • Prompt Template:

View
{question}

HumanEval#

Back to Top

  • Dataset Name: humaneval

  • Dataset ID: opencompass/humaneval

  • Description:

    HumanEval is a benchmark for evaluating the ability of code generation models to write Python functions based on given specifications. It consists of programming tasks with a defined input-output behavior. By default the code is executed in local environment. We recommend using sandbox execution to safely run and evaluate the generated code, please refer to the documentation for more details.

  • Task Categories: Coding

  • Evaluation Metrics:

  • Aggregation Methods: mean_and_pass_at_k

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: openai_humaneval

  • Review Timeout (seconds): 4

  • Prompt Template:

View
Read the following function signature and docstring, and fully implement the function described. Your response should only contain the code for this function.
{question}

IFEval#

Back to Top

  • Dataset Name: ifeval

  • Dataset ID: opencompass/ifeval

  • Description:

    IFEval is a benchmark for evaluating instruction-following language models, focusing on their ability to understand and respond to various prompts. It includes a diverse set of tasks and metrics to assess model performance comprehensively.

  • Task Categories: InstructionFollowing

  • Evaluation Metrics: inst_level_loose, inst_level_strict, prompt_level_loose, prompt_level_strict

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default


IQuiz#

Back to Top

  • Dataset Name: iquiz

  • Dataset ID: AI-ModelScope/IQuiz

  • Description:

    IQuiz is a benchmark for evaluating AI models on IQ and EQ questions. It consists of multiple-choice questions where the model must select the correct answer and provide an explanation.

  • Task Categories: Chinese, Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: EQ, IQ

  • Prompt Template:

View
回答下面的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:LETTER"(不带引号),其中 LETTER 是 {letters} 中的一个。请在回答前进行一步步思考。

问题:{question}
选项:
{choices}

Live-Code-Bench#

Back to Top

  • Dataset Name: live_code_bench

  • Dataset ID: AI-ModelScope/code_generation_lite

  • Description:

    Live Code Bench is a benchmark for evaluating code generation models on real-world coding tasks. It includes a variety of programming problems with test cases to assess the model’s ability to generate correct and efficient code solutions. By default the code is executed in local environment. We recommend using sandbox execution to safely run and evaluate the generated code, please refer to the documentation for more details.

  • Task Categories: Coding

  • Evaluation Metrics:

  • Aggregation Methods: mean_and_pass_at_k

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: release_latest

  • Review Timeout (seconds): 6

  • Extra Parameters:

{
    "start_date": null,
    "end_date": null,
    "debug": false
}
  • Prompt Template:

View
### Question:
{question_content}

{format_prompt} ### Answer: (use the provided format with backticks)

LogiQA#

Back to Top

  • Dataset Name: logi_qa

  • Dataset ID: extraordinarylab/logiqa

  • Description:

    LogiQA is a dataset sourced from expert-written questions for testing human Logical reasoning.

  • Task Categories: MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

MaritimeBench#

Back to Top

  • Dataset Name: maritime_bench

  • Dataset ID: HiDolphin/MaritimeBench

  • Description:

    MaritimeBench is a benchmark for evaluating AI models on maritime-related multiple-choice questions. It consists of questions related to maritime knowledge, where the model must select the correct answer from given options.

  • Task Categories: Chinese, Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
请回答单选题。要求只输出选项,不输出解释,将选项放在[]里,直接输出答案。示例:

题目:在船舶主推进动力装置中,传动轴系在运转中承受以下复杂的应力和负荷,但不包括______。
选项:
A. 电磁力
B. 压拉应力
C. 弯曲应力
D. 扭应力
答:[A]
 当前题目
 {question}
选项:
{choices}

MATH-500#

Back to Top

  • Dataset Name: math_500

  • Dataset ID: AI-ModelScope/MATH-500

  • Description:

    MATH-500 is a benchmark for evaluating mathematical reasoning capabilities of AI models. It consists of 500 diverse math problems across five levels of difficulty, designed to test a model’s ability to solve complex mathematical problems by generating step-by-step solutions and providing the correct final answer.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Level 1, Level 2, Level 3, Level 4, Level 5

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

MathQA#

Back to Top

  • Dataset Name: math_qa

  • Dataset ID: extraordinarylab/math-qa

  • Description:

    MathQA dataset is gathered by using a new representation language to annotate over the AQuA-RAT dataset with fully-specified operational programs.

  • Task Categories: MCQ, Math, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

Med-MCQA#

Back to Top

  • Dataset Name: med_mcqa

  • Dataset ID: extraordinarylab/medmcqa

  • Description:

    MedMCQA is a large-scale MCQA dataset designed to address real-world medical entrance exam questions.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

Minerva-Math#

Back to Top

  • Dataset Name: minerva_math

  • Dataset ID: knoveleng/Minerva-Math

  • Description:

    Minerva-math is a benchmark designed to evaluate the mathematical and quantitative reasoning capabilities of LLMs. It consists of 272 problems sourced primarily from MIT OpenCourseWare courses, covering advanced STEM subjects such as solid-state chemistry, astronomy, differential equations, and special relativity at the university and graduate level.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

MIT-Movie-Trivia#

Back to Top

  • Dataset Name: mit_movie_trivia

  • Dataset ID: extraordinarylab/mit-movie-trivia

  • Description:

    The MIT-Movie-Trivia dataset, originally created for slot filling, is modified by ignoring some slot types (e.g. genre, rating) and merging others (e.g. director and actor in person, and song and movie title in title) in order to keep consistent named entity types across all datasets.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

MIT-Restaurant#

Back to Top

  • Dataset Name: mit_restaurant

  • Dataset ID: extraordinarylab/mit-restaurant

  • Description:

    The MIT-Restaurant dataset is a collection of restaurant review text specifically curated for training and testing Natural Language Processing (NLP) models, particularly for Named Entity Recognition (NER). It contains sentences from real reviews, along with corresponding labels in the BIO format.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

MMLU#

Back to Top

  • Dataset Name: mmlu

  • Dataset ID: cais/mmlu

  • Description:

    The MMLU (Massive Multitask Language Understanding) benchmark is a comprehensive evaluation suite designed to assess the performance of language models across a wide range of subjects and tasks. It includes multiple-choice questions from various domains, such as history, science, mathematics, and more, providing a robust measure of a model’s understanding and reasoning capabilities.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, college_chemistry, college_computer_science, college_mathematics, college_medicine, college_physics, computer_security, conceptual_physics, econometrics, electrical_engineering, elementary_mathematics, formal_logic, global_facts, high_school_biology, high_school_chemistry, high_school_computer_science, high_school_european_history, high_school_geography, high_school_government_and_politics, high_school_macroeconomics, high_school_mathematics, high_school_microeconomics, high_school_physics, high_school_psychology, high_school_statistics, high_school_us_history, high_school_world_history, human_aging, human_sexuality, international_law, jurisprudence, logical_fallacies, machine_learning, management, marketing, medical_genetics, miscellaneous, moral_disputes, moral_scenarios, nutrition, philosophy, prehistory, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, security_studies, sociology, us_foreign_policy, virology, world_religions

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

MMLU-Pro#

Back to Top

  • Dataset Name: mmlu_pro

  • Dataset ID: TIGER-Lab/MMLU-Pro

  • Description:

    MMLU-Pro is a benchmark for evaluating language models on multiple-choice questions across various subjects. It includes questions from different domains, where the model must select the correct answer from given options.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

Question:
{question}
Options:
{choices}

MMLU-Redux#

Back to Top

  • Dataset Name: mmlu_redux

  • Dataset ID: AI-ModelScope/mmlu-redux-2.0

  • Description:

    MMLU-Redux is a benchmark for evaluating language models on multiple-choice questions across various subjects. It includes questions from different domains, where the model must select the correct answer from given options. The bad answers are corrected.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: {'acc': {'allow_inclusion': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, college_chemistry, college_computer_science, college_mathematics, college_medicine, college_physics, computer_security, conceptual_physics, econometrics, electrical_engineering, elementary_mathematics, formal_logic, global_facts, high_school_biology, high_school_chemistry, high_school_computer_science, high_school_european_history, high_school_geography, high_school_government_and_politics, high_school_macroeconomics, high_school_mathematics, high_school_microeconomics, high_school_physics, high_school_psychology, high_school_statistics, high_school_us_history, high_school_world_history, human_aging, human_sexuality, international_law, jurisprudence, logical_fallacies, machine_learning, management, marketing, medical_genetics, miscellaneous, moral_disputes, moral_scenarios, nutrition, philosophy, prehistory, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, security_studies, sociology, us_foreign_policy, virology, world_religions

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

MRI-MCQA#

Back to Top

  • Dataset Name: mri_mcqa

  • Dataset ID: extraordinarylab/mri-mcqa

  • Description:

    MRI-MCQA is a benchmark composed by multiple-choice questions related to Magnetic Resonance Imaging (MRI).

  • Task Categories: Knowledge, MCQ, Medical

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

Multi-IF#

Back to Top

  • Dataset Name: multi_if

  • Dataset ID: facebook/Multi-IF

  • Description:

    Multi-IF is a benchmark designed to evaluate the performance of LLM models’ capabilities in multi-turn instruction following within a multilingual environment.

  • Task Categories: InstructionFollowing, MultiLingual, MultiTurn

  • Evaluation Metrics: inst_level_loose, inst_level_strict, prompt_level_loose, prompt_level_strict

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Chinese, English, French, German, Hindi, Italian, Portuguese, Russian, Spanish, Thai, Vietnamese

  • Extra Parameters:

{
    "max_turns": 3
}

MusicTrivia#

Back to Top

  • Dataset Name: music_trivia

  • Dataset ID: extraordinarylab/music-trivia

  • Description:

    MusicTrivia is a curated dataset of multiple-choice questions covering both classical and modern music topics. It includes questions about composers, musical periods, and popular artists, designed for evaluating factual recall and domain-specific music knowledge.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

MuSR#

Back to Top

  • Dataset Name: musr

  • Dataset ID: AI-ModelScope/MuSR

  • Description:

    MuSR is a benchmark for evaluating AI models on multiple-choice questions related to murder mysteries, object placements, and team allocation.

  • Task Categories: MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: murder_mysteries, object_placements, team_allocation

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

Needle-in-a-Haystack#

Back to Top

  • Dataset Name: needle_haystack

  • Dataset ID: AI-ModelScope/Needle-in-a-Haystack-Corpus

  • Description:

    Needle in a Haystack is a benchmark focused on information retrieval tasks. It requires the model to find specific information within a large corpus of text. Usage Example

  • Task Categories: LongContext, Retrieval

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: chinese, english

  • Extra Parameters:

{
    "retrieval_question": "What is the best thing to do in San Francisco?",
    "needles": [
        "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n"
    ],
    "context_lengths_min": 1000,
    "context_lengths_max": 32000,
    "context_lengths_num_intervals": 10,
    "document_depth_percent_min": 0,
    "document_depth_percent_max": 100,
    "document_depth_percent_intervals": 10,
    "tokenizer_path": "Qwen/Qwen3-0.6B",
    "show_score": false
}
  • System Prompt:

View
You are a helpful AI bot that answers questions for a user. Keep your response short and direct
- **Prompt Template**:
View
Please read the following text and answer the question below.

<text>
{context}
</text>

<question>
{question}
</question>

Don't give information outside the document or repeat your findings.

OntoNotes5#

Back to Top

  • Dataset Name: ontonotes5

  • Dataset ID: extraordinarylab/ontonotes5

  • Description:

    OntoNotes Release 5.0 is a large, multilingual corpus containing text in English, Chinese, and Arabic across various genres like news, weblogs, and broadcast conversations. It is richly annotated with multiple layers of linguistic information, including syntax, predicate-argument structure, word sense, named entities, and coreference to support research and development in natural language processing.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}

PIQA#

Back to Top

  • Dataset Name: piqa

  • Dataset ID: extraordinarylab/piqa

  • Description:

    PIQA addresses the challenging task of reasoning about physical commonsense in natural language.

  • Task Categories: Commonsense, MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

PolyMath#

Back to Top

  • Dataset Name: poly_math

  • Dataset ID: evalscope/PolyMath

  • Description:

    PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels, with 9,000 high-quality problem samples. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.

  • Task Categories: Math, MultiLingual, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: ar, bn, de, en, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi, zh

  • Prompt Template:

View
{question}

ProcessBench#

Back to Top

  • Dataset Name: process_bench

  • Dataset ID: Qwen/ProcessBench

  • Description:

    ProcessBench is a benchmark for evaluating AI models on mathematical reasoning tasks. It includes various subsets such as GSM8K, Math, OlympiadBench, and OmniMath, each with its own set of problems that require step-by-step reasoning to arrive at the correct answer.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: correct_acc, error_acc, simple_f1_score

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: gsm8k, math, olympiadbench, omnimath

  • Prompt Template:

View
CThe following is a math problem and a solution (split into paragraphs, enclosed with tags and indexed from 0):

[Math Problem]

{problem}

[Solution]

{tagged_response}

Your task is to review and critique the solution paragraph by paragraph. Once you identify an error in a paragraph, return the index of the paragraph where the earliest error occurs. Otherwise, return the index of -1 (which typically denotes "not found").

Please put your final answer (i.e., the index) in oxed{{}}.

PubMedQA#

Back to Top

  • Dataset Name: pubmedqa

  • Dataset ID: extraordinarylab/pubmed-qa

  • Description:

    PubMedQA reasons over biomedical research texts to answer the multiple-choice questions.

  • Task Categories: Knowledge, Yes/No

  • Evaluation Metrics: accuracy, f1_score, maybe_ratio, precision, recall, yes_ratio

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}
Please answer YES or NO or MAYBE without an explanation.

QASC#

Back to Top

  • Dataset Name: qasc

  • Dataset ID: extraordinarylab/qasc

  • Description:

    QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

RACE#

Back to Top

  • Dataset Name: race

  • Dataset ID: evalscope/race

  • Description:

    RACE is a benchmark for testing reading comprehension and reasoning abilities of neural models. It is constructed from Chinese middle and high school examinations.

  • Task Categories: MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 3-shot

  • Subsets: high, middle

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

SciQ#

Back to Top

  • Dataset Name: sciq

  • Dataset ID: extraordinarylab/sciq

  • Description:

    The SciQ dataset contains crowdsourced science exam questions about Physics, Chemistry and Biology, among others. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

  • Task Categories: Knowledge, MCQ, ReadingComprehension

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

SimpleQA#

Back to Top

  • Dataset Name: simple_qa

  • Dataset ID: evalscope/SimpleQA

  • Description:

    SimpleQA is a benchmark designed to evaluate the performance of language models on simple question-answering tasks. It includes a set of straightforward questions that require basic reasoning and understanding capabilities.

  • Task Categories: Knowledge, QA

  • Evaluation Metrics: is_correct, is_incorrect, is_not_attempted

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the question:

{question}

SIQA#

Back to Top

  • Dataset Name: siqa

  • Dataset ID: extraordinarylab/siqa

  • Description:

    Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications.

  • Task Categories: Commonsense, MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

SuperGPQA#

Back to Top

  • Dataset Name: super_gpqa

  • Dataset ID: m-a-p/SuperGPQA

  • Description:

    SuperGPQA is a large-scale multiple-choice question answering dataset, designed to evaluate the generalization ability of models across different fields. It contains 100,000+ questions from 50+ fields, with each question having 10 options.

  • Task Categories: Knowledge, MCQ

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Aeronautical and Astronautical Science and Technology, Agricultural Engineering, Animal Husbandry, Applied Economics, Aquaculture, Architecture, Art Studies, Astronomy, Atmospheric Science, Basic Medicine, Biology, Business Administration, Chemical Engineering and Technology, Chemistry, Civil Engineering, Clinical Medicine, Computer Science and Technology, Control Science and Engineering, Crop Science, Education, Electrical Engineering, Electronic Science and Technology, Environmental Science and Engineering, Food Science and Engineering, Forestry Engineering, Forestry, Geography, Geological Resources and Geological Engineering, Geology, Geophysics, History, Hydraulic Engineering, Information and Communication Engineering, Instrument Science and Technology, Journalism and Communication, Language and Literature, Law, Library, Information and Archival Management, Management Science and Engineering, Materials Science and Engineering, Mathematics, Mechanical Engineering, Mechanics, Metallurgical Engineering, Military Science, Mining Engineering, Musicology, Naval Architecture and Ocean Engineering, Nuclear Science and Technology, Oceanography, Optical Engineering, Petroleum and Natural Gas Engineering, Pharmacy, Philosophy, Physical Education, Physical Oceanography, Physics, Political Science, Power Engineering and Engineering Thermophysics, Psychology, Public Administration, Public Health and Preventive Medicine, Sociology, Stomatology, Surveying and Mapping Science and Technology, Systems Science, Textile Science and Engineering, Theoretical Economics, Traditional Chinese Medicine, Transportation Engineering, Veterinary Medicine, Weapon Science and Technology

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Think step by step before answering.

{question}

{choices}

TriviaQA#

Back to Top

  • Dataset Name: trivia_qa

  • Dataset ID: evalscope/trivia_qa

  • Description:

    TriviaQA is a large-scale reading comprehension dataset consisting of question-answer pairs collected from trivia websites. It includes questions with multiple possible answers, making it suitable for evaluating the ability of models to understand and generate answers based on context.

  • Task Categories: QA, ReadingComprehension

  • Evaluation Metrics: {'acc': {'allow_inclusion': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: rc.wikipedia

  • Prompt Template:

View
Read the content and answer the following question.

Content: {content}

Question: {question}

Keep your The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.

TruthfulQA#

Back to Top

  • Dataset Name: truthful_qa

  • Dataset ID: evalscope/truthful_qa

  • Description:

    TruthfulQA is a benchmark designed to evaluate the ability of AI models to answer questions truthfully and accurately. It includes multiple-choice tasks, focusing on the model’s understanding of factual information.

  • Task Categories: Knowledge

  • Evaluation Metrics: multi_choice_acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: multiple_choice

  • Extra Parameters:

{
    "multiple_correct": false
}
  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

Winogrande#

Back to Top

  • Dataset Name: winogrande

  • Dataset ID: AI-ModelScope/winogrande_val

  • Description:

    Winogrande is a benchmark for evaluating AI models on commonsense reasoning tasks, specifically designed to test the ability to resolve ambiguous pronouns in sentences.

  • Task Categories: MCQ, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}.

{question}

{choices}

WMT2024++#

Back to Top

  • Dataset Name: wmt24pp

  • Dataset ID: extraordinarylab/wmt24pp

  • Description:

    WMT2024 news translation benchmark supporting multiple language pairs. Each subset represents a specific translation direction

  • Task Categories: MachineTranslation, MultiLingual

  • Evaluation Metrics: bert_score, bleu, comet

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: en-ar_eg, en-ar_sa, en-bg_bg, en-bn_in, en-ca_es, en-cs_cz, en-da_dk, en-de_de, en-el_gr, en-es_mx, en-et_ee, en-fa_ir, en-fi_fi, en-fil_ph, en-fr_ca, en-fr_fr, en-gu_in, en-he_il, en-hi_in, en-hr_hr, en-hu_hu, en-id_id, en-is_is, en-it_it, en-ja_jp, en-kn_in, en-ko_kr, en-lt_lt, en-lv_lv, en-ml_in, en-mr_in, en-nl_nl, en-no_no, en-pa_in, en-pl_pl, en-pt_br, en-pt_pt, en-ro_ro, en-ru_ru, en-sk_sk, en-sl_si, en-sr_rs, en-sv_se, en-sw_ke, en-sw_tz, en-ta_in, en-te_in, en-th_th, en-tr_tr, en-uk_ua, en-ur_pk, en-vi_vn, en-zh_cn, en-zh_tw, en-zu_za

  • Prompt Template:

View
Translate the following {source_language} sentence into {target_language}:

{source_language}: {source_text}
{target_language}:

WNUT2017#

Back to Top

  • Dataset Name: wnut2017

  • Dataset ID: extraordinarylab/wnut2017

  • Description:

    The WNUT2017 dataset is a collection of user-generated text from various social media platforms, like Twitter and YouTube, specifically designed for a named-entity recognition task.

  • Task Categories: Knowledge, NER

  • Evaluation Metrics: accuracy, f1_score, precision, recall

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 5-shot

  • Subsets: default

  • Prompt Template:

View
You are a named entity recognition system that identifies the following entity types:
{entities}

Process the provided text and mark all named entities with XML-style tags.

For example:
<person>John Smith</person> works at <organization>Google</organization> in <location>Mountain View</location>.

Available entity tags: {entity_list}

INSTRUCTIONS:
1. Wrap your entire response in <response>...</response> tags.
2. Inside these tags, include the original text with entity tags inserted.
3. Do not change the original text in any way (preserve spacing, punctuation, case, etc.).
4. Tag ALL entities you can identify using the exact tag names provided.
5. Do not include explanations, just the tagged text.
6. If entity spans overlap, choose the most specific entity type.
7. Ensure every opening tag has a matching closing tag.

Text to process:
{text}