VLM Benchmarks#

Below is the list of supported VLM benchmarks. Click on a benchmark name to jump to details.

Benchmark Name

Pretty Name

Task Categories

a_okvqa

A-OKVQA

Knowledge, MCQ, MultiModal

ai2d

AI2D

Knowledge, MultiModal, QA

blink

BLINK

Knowledge, MCQ, MultiModal

cc_bench

CCBench

Knowledge, MCQ, MultiModal

chartqa

ChartQA

Knowledge, MultiModal, QA

cmmmu

CMMMU

Chinese, Knowledge, MultiModal, QA

cmmu

CMMU

Knowledge, MCQ, MultiModal, QA

docvqa

DocVQA

Knowledge, MultiModal, QA

general_vmcq

General-VMCQ

Custom, MCQ, MultiModal

general_vqa

General-VQA

Custom, MultiModal, QA

gsm8k_v

GSM8K-V

Math, MultiModal, Reasoning

hallusion_bench

HallusionBench

Hallucination, MultiModal, Yes/No

infovqa

InfoVQA

Knowledge, MultiModal, QA

math_verse

MathVerse

MCQ, Math, MultiModal, Reasoning

math_vision

MathVision

MCQ, Math, MultiModal, Reasoning

math_vista

MathVista

MCQ, Math, MultiModal, Reasoning

micro_vqa

MicroVQA

Knowledge, MCQ, Medical, MultiModal

mm_bench

MMBench

Knowledge, MultiModal, QA

mm_star

MMStar

Knowledge, MCQ, MultiModal

mmmu

MMMU

Knowledge, MultiModal, QA

mmmu_pro

MMMU-PRO

Knowledge, MCQ, MultiModal

ocr_bench

OCRBench

Knowledge, MultiModal, QA

ocr_bench_v2

OCRBench-v2

Knowledge, MultiModal, QA

olympiad_bench

OlympiadBench

Math, Reasoning

omni_bench

OmniBench

Knowledge, MCQ, MultiModal

omni_doc_bench

OmniDocBench

Knowledge, MultiModal, QA

pope

POPE

Hallucination, MultiModal, Yes/No

real_world_qa

RealWorldQA

Knowledge, MultiModal, QA

science_qa

ScienceQA

Knowledge, MCQ, MultiModal

seed_bench_2_plus

SEED-Bench-2-Plus

Knowledge, MCQ, MultiModal, Reasoning

simple_vqa

SimpleVQA

MultiModal, QA, Reasoning

visulogic

VisuLogic

MCQ, Math, MultiModal, Reasoning

vstar_bench

V*Bench

Grounding, MCQ, MultiModal

zerobench

ZeroBench

Knowledge, MultiModal, QA


Benchmark Details#

A-OKVQA#

Back to Top

  • Dataset Name: a_okvqa

  • Dataset ID: HuggingFaceM4/A-OKVQA

  • Description:

    A-OKVQA is a benchmark designed to probe commonsense reasoning and outside knowledge in visual question answering. Unlike basic VQA tasks that rely solely on the image content, A-OKVQA requires models to utilize a broad spectrum of commonsense and factual knowledge about the world to answer its questions. It includes both multiple-choice and open-ended questions, making it a particularly challenging test for assessing the reasoning capabilities of AI systems.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

AI2D#

Back to Top

  • Dataset Name: ai2d

  • Dataset ID: lmms-lab/ai2d

  • Description:

    AI2D is a benchmark dataset for researching the understanding of diagrams by AI. It contains over 5,000 diverse diagrams from science textbooks (e.g., the water cycle, food webs). Each diagram is accompanied by multiple-choice questions that test an AI’s ability to interpret visual elements, text labels, and their relationships. The benchmark is challenging because it requires jointly understanding the layout, symbols, and text to answer questions correctly.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}


CCBench#

Back to Top

  • Dataset Name: cc_bench

  • Dataset ID: lmms-lab/MMBench

  • Description:

    CCBench is an extension of MMBench with newly design questions about Chinese traditional culture, including Calligraphy Painting, Cultural Relic, Food & Clothes, Historical Figures, Scenery & Building, Sketch Reasoning and Traditional Show.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: cc

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

ChartQA#

Back to Top

  • Dataset Name: chartqa

  • Dataset ID: lmms-lab/ChartQA

  • Description:

    ChartQA is a benchmark designed to evaluate question-answering capabilities about charts (e.g., bar charts, line graphs, pie charts), focusing on both visual and logical reasoning.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: relaxed_acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: augmented_test, human_test

  • Prompt Template:

View
{question}

The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the a single word answer to the problem.

CMMMU#

Back to Top

  • Dataset Name: cmmmu

  • Dataset ID: lmms-lab/CMMMU

  • Description:

    CMMMU includes manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.

  • Task Categories: Chinese, Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: 临床医学, 会计, 公共卫生, 农业, 制药, 化学, 历史, 地理, 基础医学, 建筑学, 心理学, 数学, 文献学, 机械工程, 材料, 物理, 生物, 电子学, 社会学, 管理, 经济, 能源和电力, 艺术, 艺术理论, 营销, 计算机科学, 设计, 诊断学与实验室医学, 金融, 音乐


CMMU#

Back to Top

  • Dataset Name: cmmu

  • Dataset ID: evalscope/CMMU

  • Description:

    CMMU is a novel multi-modal benchmark designed to evaluate domain-specific knowledge across seven foundational subjects: math, biology, physics, chemistry, geography, politics, and history.

  • Task Categories: Knowledge, MCQ, MultiModal, QA

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: biology, chemistry, geography, history, math, physics, politics

  • Prompt Template:

View
回答下面的单项选择题,请选出其中的正确答案。你的回答的最后一行应该是这样的格式:"答案:[LETTER]"(不带引号),其中 [LETTER] 是 {letters} 中的一个。请在回答前进行一步步思考。

问题:{question}
选项:
{choices}

DocVQA#

Back to Top

  • Dataset Name: docvqa

  • Dataset ID: lmms-lab/DocVQA

  • Description:

    DocVQA (Document Visual Question Answering) is a benchmark designed to evaluate AI systems on their ability to answer questions based on the content of document images, such as scanned pages, forms, or invoices. Unlike general visual question answering, it requires understanding not just the text extracted by OCR, but also the complex layout, structure, and visual elements of a document.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: anls

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: DocVQA

  • Prompt Template:

View
Answer the question according to the image using a single word or phrase.
{question}
The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the question.

General-VMCQ#

Back to Top

  • Dataset Name: general_vmcq

  • Dataset ID: general_vmcq

  • Description:

    A general visual multiple-choice question answering dataset for custom multimodal evaluation. Format similar to MMMU, not OpenAI message format. Images are plain strings (local/remote path or base64 data URL). For detailed instructions on how to use this benchmark, please refer to the User Guide.

  • Task Categories: Custom, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

General-VQA#

Back to Top

  • Dataset Name: general_vqa

  • Dataset ID: general_vqa

  • Description:

    A general visual question answering dataset for custom multimodal evaluation. Supports OpenAI-compatible message format with images (local paths or base64). For detailed instructions, please refer to the User Guide.

  • Task Categories: Custom, MultiModal, QA

  • Evaluation Metrics: BLEU, Rouge

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default


GSM8K-V#

Back to Top

  • Dataset Name: gsm8k_v

  • Dataset ID: evalscope/GSM8K-V

  • Description:

    GSM8K-V is a purely visual multi-image mathematical reasoning benchmark that systematically maps each GSM8K math word problem into its visual counterpart to enable a clean, within-item comparison across modalities.

  • Task Categories: Math, MultiModal, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
You are an expert at solving mathematical word problems. Please solve the following problem step by step, showing your reasoning.

When providing your final answer:
- If the answer can be expressed as a whole number (integer), provide it as an integer
Problem: {question}

Please think step by step. After your reasoning, output your final answer on a new line starting with "FINAL ANSWER: " followed by the number only.

HallusionBench#

Back to Top

  • Dataset Name: hallusion_bench

  • Dataset ID: lmms-lab/HallusionBench

  • Description:

    HallusionBench is an advanced diagnostic benchmark designed to evaluate image-context reasoning, analyze models’ tendencies for language hallucination and visual illusion in large vision-language models (LVLMs).

  • Task Categories: Hallucination, MultiModal, Yes/No

  • Evaluation Metrics: aAcc, fAcc, qAcc

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}
Please answer YES or NO without an explanation.

InfoVQA#

Back to Top

  • Dataset Name: infovqa

  • Dataset ID: lmms-lab/DocVQA

  • Description:

    InfoVQA (Information Visual Question Answering) is a benchmark designed to evaluate how well AI models can answer questions based on information-dense images, such as charts, graphs, diagrams, maps, and infographics.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: anls

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: InfographicVQA

  • Prompt Template:

View
Answer the question according to the image using a single word or phrase.
{question}
The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the question.

MathVerse#

Back to Top

  • Dataset Name: math_verse

  • Dataset ID: evalscope/MathVerse

  • Description:

    MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.

  • Task Categories: MCQ, Math, MultiModal, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: Text Dominant, Text Lite, Vision Dominant, Vision Intensive, Vision Only

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

MathVision#

Back to Top

  • Dataset Name: math_vision

  • Dataset ID: evalscope/MathVision

  • Description:

    The MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.

  • Task Categories: MCQ, Math, MultiModal, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: level 1, level 2, level 3, level 4, level 5

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}} without units.

MathVista#

Back to Top

  • Dataset Name: math_vista

  • Dataset ID: evalscope/MathVista

  • Description:

    MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.

  • Task Categories: MCQ, Math, MultiModal, Reasoning

  • Evaluation Metrics: {'acc': {'numeric': True}}

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}} without units.

MicroVQA#

Back to Top

  • Dataset Name: micro_vqa

  • Dataset ID: evalscope/MicroVQA

  • Description:

    MicroVQA is expert-curated benchmark for multimodal reasoning for microscopy-based scientific research

  • Task Categories: Knowledge, MCQ, Medical, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

MMBench#

Back to Top

  • Dataset Name: mm_bench

  • Dataset ID: lmms-lab/MMBench

  • Description:

    MMBench is a comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT. It is comprised of 20 ability dimensions defined by MMBench. It also contains chinese version with translated question.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: cn, en

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

MMStar#

Back to Top

  • Dataset Name: mm_star

  • Dataset ID: evalscope/MMStar

  • Description:

    MMStar: an elite vision-indispensible multi-modal benchmark, aiming to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: coarse perception, fine-grained perception, instance reasoning, logical reasoning, math, science & technology

  • Prompt Template:

View
Answer the following multiple choice question.
The last line of your response should be of the following format:
'ANSWER: [LETTER]' (without quotes)
where [LETTER] is one of A,B,C,D. Think step by step before answering.

{question}

MMMU#

Back to Top

  • Dataset Name: mmmu

  • Dataset ID: AI-ModelScope/MMMU

  • Description:

    MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Accounting, Agriculture, Architecture_and_Engineering, Art_Theory, Art, Basic_Medical_Science, Biology, Chemistry, Clinical_Medicine, Computer_Science, Design, Diagnostics_and_Laboratory_Medicine, Economics, Electronics, Energy_and_Power, Finance, Geography, History, Literature, Manage, Marketing, Materials, Math, Mechanical_Engineering, Music, Pharmacy, Physics, Psychology, Public_Health, Sociology

  • Prompt Template:

View
Solve the following problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem, and you do not need to use a \boxed command.

MMMU-PRO#

Back to Top

  • Dataset Name: mmmu_pro

  • Dataset ID: AI-ModelScope/MMMU_Pro

  • Description:

    MMMU-Pro is an enhanced multimodal benchmark designed to rigorously assess the true understanding capabilities of advanced AI models across multiple modalities. It builds upon the original MMMU benchmark by introducing several key improvements that make it more challenging and realistic, ensuring that models are evaluated on their genuine ability to integrate and comprehend both visual and textual information.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Accounting, Agriculture, Architecture_and_Engineering, Art_Theory, Art, Basic_Medical_Science, Biology, Chemistry, Clinical_Medicine, Computer_Science, Design, Diagnostics_and_Laboratory_Medicine, Economics, Electronics, Energy_and_Power, Finance, Geography, History, Literature, Manage, Marketing, Materials, Math, Mechanical_Engineering, Music, Pharmacy, Physics, Psychology, Public_Health, Sociology

  • Extra Parameters:

{
    "dataset_format": {
        "type": "str",
        "description": "Dataset format variant. Choices: ['standard (4 options)', 'standard (10 options)', 'vision'].",
        "value": "standard (4 options)",
        "choices": [
            "standard (4 options)",
            "standard (10 options)",
            "vision"
        ]
    }
}
  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

OCRBench#

Back to Top

  • Dataset Name: ocr_bench

  • Dataset ID: evalscope/OCRBench

  • Description:

    OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Artistic Text Recognition, Digit String Recognition, Doc-oriented VQA, Handwriting Recognition, Handwritten Mathematical Expression Recognition, Irregular Text Recognition, Key Information Extraction, Non-Semantic Text Recognition, Regular Text Recognition, Scene Text-centric VQA

  • Prompt Template:

View
{question}

OCRBench-v2#

Back to Top

  • Dataset Name: ocr_bench_v2

  • Dataset ID: evalscope/OCRBench_v2

  • Description:

    OCRBench v2 is a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10, 000 human-verified question-answering pairs and a high proportion of difficult samples.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: APP agent en, ASCII art classification en, VQA with position en, chart parsing en, cognition VQA cn, cognition VQA en, diagram QA en, document classification en, document parsing cn, document parsing en, fine-grained text recognition en, formula recognition cn, formula recognition en, full-page OCR cn, full-page OCR en, handwritten answer extraction cn, key information extraction cn, key information extraction en, key information mapping en, math QA en, reasoning VQA cn, reasoning VQA en, science QA en, table parsing cn, table parsing en, text counting en, text grounding en, text recognition en, text spotting en, text translation cn

  • Prompt Template:

View
{question}

OlympiadBench#

Back to Top

  • Dataset Name: olympiad_bench

  • Dataset ID: AI-ModelScope/OlympiadBench

  • Description:

    OlympiadBench is an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. In the subsets: OE stands for Open-Ended, TP stands for Theorem Proving, MM stands for Multimodal, TO stands for Text-Only, CEE stands for Chinese Entrance Exam, COMP stands for Comprehensive. Note: The TP subsets can’t be evaluated with auto-judge for now.

  • Task Categories: Math, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: OE_MM_maths_en_COMP, OE_MM_maths_zh_CEE, OE_MM_maths_zh_COMP, OE_MM_physics_en_COMP, OE_MM_physics_zh_CEE, OE_TO_maths_en_COMP, OE_TO_maths_zh_CEE, OE_TO_maths_zh_COMP, OE_TO_physics_en_COMP, OE_TO_physics_zh_CEE, TP_MM_maths_en_COMP, TP_MM_maths_zh_CEE, TP_MM_maths_zh_COMP, TP_MM_physics_en_COMP, TP_TO_maths_en_COMP, TP_TO_maths_zh_CEE, TP_TO_maths_zh_COMP, TP_TO_physics_en_COMP

  • Prompt Template:

View
{question}
Please reason step by step, and put your final answer within \boxed{{}}.

OmniBench#

Back to Top

  • Dataset Name: omni_bench

  • Dataset ID: m-a-p/OmniBench

  • Description:

    OmniBench, a pioneering universal multimodal benchmark designed to rigorously evaluate MLLMs’ capability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Extra Parameters:

{
    "use_image": {
        "type": "bool",
        "description": "Whether to provide the raw image. False uses textual alternative.",
        "value": true
    },
    "use_audio": {
        "type": "bool",
        "description": "Whether to provide the raw audio. False uses textual alternative.",
        "value": true
    }
}
  • Prompt Template:

View
Answer the following multiple choice question based on the image and audio content. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

OmniDocBench#

Back to Top

  • Dataset Name: omni_doc_bench

  • Dataset ID: evalscope/OmniDocBench_tsv

  • Description:

    OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:

    • Diverse Document Types: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc.

    • Rich Annotations: Contains location information for 15 block-level (text paragraphs, titles, tables, etc., over 20k in total) and 4 span-level (text lines, inline formulas, superscripts/subscripts, etc., over 80k in total) document elements, as well as recognition results for each element region (text annotations, LaTeX formula annotations, tables with both LaTeX and HTML annotations). OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute labels at page and block levels, with 5 page attribute labels, 3 text attribute labels and 6 table attribute labels. The evaluation in EvalScope implements the end2end and quick_match methods from the official OmniDocBench-v1.5 repository.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: display_formula, reading_order, table, text_block

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Extra Parameters:

{
    "match_method": {
        "type": "str",
        "description": "Scoring match method used for evaluation.",
        "value": "quick_match",
        "choices": [
            "quick_match",
            "simple_match",
            "no_split"
        ]
    }
}
  • Prompt Template:

View
 You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:

    1. Text Processing:
    - Accurately recognize all text content in the PDF image without guessing or inferring.
    - Convert the recognized text into Markdown format.
    - Maintain the original document structure, including headings, paragraphs, lists, etc.

    2. Mathematical Formula Processing:
    - Convert all mathematical formulas to LaTeX format.
    - Enclose inline formulas with \( \). For example: This is an inline formula \( E = mc^2 \)
    - Enclose block formulas with \\[ \\]. For example: \[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

    3. Table Processing:
    - Convert tables to HTML format.
    - Wrap the entire table with <table> and </table>.

    4. Figure Handling:
    - Ignore figures content in the PDF image. Do not attempt to describe or convert images.

    5. Output Format:
    - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
    - For complex layouts, try to maintain the original document's structure and format as closely as possible.

    Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.

POPE#

Back to Top

  • Dataset Name: pope

  • Dataset ID: lmms-lab/POPE

  • Description:

    POPE (Polling-based Object Probing Evaluation) is a benchmark designed to evaluate object hallucination in large vision-language models (LVLMs). It tests models by having them answer simple yes/no questions about the presence of specific objects in an image. This method helps measure how accurately a model’s responses align with the visual content, with a focus on identifying instances where models claim objects exist that are not actually present. The benchmark employs various sampling strategies, including random, popular, and adversarial sampling, to create a robust set of questions for assessment.

  • Task Categories: Hallucination, MultiModal, Yes/No

  • Evaluation Metrics: accuracy, f1_score, precision, recall, yes_ratio

  • Aggregation Methods: f1

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: adversarial, popular, random

  • Prompt Template:

View
{question}
Please answer YES or NO without an explanation.

RealWorldQA#

Back to Top

  • Dataset Name: real_world_qa

  • Dataset ID: lmms-lab/RealWorldQA

  • Description:

    RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models’ understanding of our physical world.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Read the picture and solve the following problem step by step.The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem, and you do not need to use a \boxed command.

ScienceQA#

Back to Top

  • Dataset Name: science_qa

  • Dataset ID: AI-ModelScope/ScienceQA

  • Description:

    ScienceQA is a multimodal benchmark consisting of multiple-choice science questions derived from elementary and high school curricula. It covers a diverse range of subjects, including natural science, social science, and language science. A key feature of this benchmark is that most questions are accompanied by both image and text contexts, and are annotated with detailed lectures and explanations that support the correct answer, facilitating research into models that can generate chains of thought.

  • Task Categories: Knowledge, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

SEED-Bench-2-Plus#

Back to Top

  • Dataset Name: seed_bench_2_plus

  • Dataset ID: evalscope/SEED-Bench-2-Plus

  • Description:

    SEED-Bench-2-Plus is a large-scale benchmark to evaluate Multimodal Large Language Models (MLLMs). It consists of 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world.

  • Task Categories: Knowledge, MCQ, MultiModal, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: chart, map, web

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

SimpleVQA#

Back to Top

  • Dataset Name: simple_vqa

  • Dataset ID: m-a-p/SimpleVQA

  • Description:

    SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate.

  • Task Categories: MultiModal, QA, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the question:

{question}

VisuLogic#

Back to Top

  • Dataset Name: visulogic

  • Dataset ID: evalscope/VisuLogic

  • Description:

    VisuLogic is a benchmark aimed at evaluating the visual reasoning capabilities of Multi-modal Large Language Models (MLLMs), independent of textual reasoning processes. It features carefully constructed visual reasoning tasks spanning multiple categories, divided into six types based on required reasoning skills (e.g., Quantitative Reasoning, which involves understanding and deducing changes in the quantity of elements in images). Unlike existing benchmarks, VisuLogic is a challenging visual reasoning benchmark that is inherently difficult to articulate using language, providing a more rigorous evaluation of the visual reasoning capabilities of MLLMs.

  • Task Categories: MCQ, Math, MultiModal, Reasoning

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: Attribute Reasoning, Other, Positional Reasoning, Quantitative Reasoning, Spatial Reasoning, Stylistic Reasoning

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.

{question}

V*Bench#

Back to Top

  • Dataset Name: vstar_bench

  • Dataset ID: lmms-lab/vstar-bench

  • Description:

    V*Bench is a benchmark designed for evaluating visual search capabilities within multimodal reasoning systems. It focuses on the ability to actively locate and identify specific visual information in high-resolution images, which is crucial for tasks requiring fine-grained visual understanding. This benchmark helps assess how well models can perform targeted visual queries, often guided by natural language instructions, to find and reason about specific elements in complex visual scenes .

  • Task Categories: Grounding, MCQ, MultiModal

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: No

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A, B, C, D. Think step by step before answering.

{question}

ZeroBench#

Back to Top

  • Dataset Name: zerobench

  • Dataset ID: evalscope/zerobench

  • Description:

    ZeroBench is a challenging visual reasoning benchmark for Large Multimodal Models (LMMs). It consists of a main set of 100 high-quality, manually curated questions covering numerous domains, reasoning types and image type. Questions in ZeroBench have been designed and calibrated to be beyond the capabilities of current frontier models. As such, none of the evaluated models achieves a non-zero pass@1 (with greedy decoding) or 5/5 reliability score.

  • Task Categories: Knowledge, MultiModal, QA

  • Evaluation Metrics: acc

  • Aggregation Methods: mean

  • Requires LLM Judge: Yes

  • Default Shots: 0-shot

  • Subsets: default

  • Prompt Template:

View
{question}



Let's think step by step and give the final answer in curly braces,
like this: {{final answer}}"