ScienceQA#

Overview#

ScienceQA is a multimodal benchmark consisting of multiple-choice science questions derived from elementary and high school curricula. It covers diverse subjects including natural science, social science, and language science, with questions accompanied by both image and text contexts.

Task Description#

Task Type: Multimodal Science Question Answering
Input: Question with optional image context + multiple choices
Output: Correct answer choice letter
Domains: Natural science, social science, language science

Key Features#

Questions sourced from real K-12 science curricula
Most questions include both image and text contexts
Annotated with detailed lectures and explanations
Supports research into chain-of-thought reasoning
Covers multiple grade levels and difficulty ranges
Rich metadata including topic, skill, and category information

Evaluation Notes#

Default evaluation uses the test split
Primary metric: Accuracy on multiple-choice questions
Uses Chain-of-Thought (CoT) prompting for reasoning
Metadata includes solution explanations for analysis
Questions span grades from elementary to high school

Properties#

Property	Value
Benchmark Name	`science_qa`
Dataset ID	AI-ModelScope/ScienceQA
Paper	N/A
Tags	`Knowledge`, `MCQ`, `MultiModal`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	4,241
Prompt Length (Mean)	370.49 chars
Prompt Length (Min/Max)	250 / 1037 chars

Image Statistics:

Metric	Value
Total Images	2,017
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	170x77 - 750x625
Formats	png

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "7586b8fe",
      "content": [
        {
          "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B. Think step by step before answering.\n\nWhich figure of speech is used in this text?\nSing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans.\n—Homer, The Iliad\n\nA) chiasmus\nB) apostrophe"
        }
      ]
    }
  ],
  "choices": [
    "chiasmus",
    "apostrophe"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "hint": "",
    "task": "closed choice",
    "grade": "grade11",
    "subject": "language science",
    "topic": "figurative-language",
    "category": "Literary devices",
    "skill": "Classify the figure of speech: anaphora, antithesis, apostrophe, assonance, chiasmus, understatement",
    "lecture": "Figures of speech are words or phrases that use language in a nonliteral or unusual way. They can make writing more expressive.\nAnaphora is the repetition of the same word or words at the beginning of several phrases or clauses.\nWe are united ... [TRUNCATED] ... but reverses the order of words.\nNever let a fool kiss you or a kiss fool you.\nUnderstatement involves deliberately representing something as less serious or important than it really is.\nAs you know, it can get a little cold in the Antarctic.",
    "solution": "The text uses apostrophe, a direct address to an absent person or a nonhuman entity.\nO goddess is a direct address to a goddess, a nonhuman entity."
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets science_qa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['science_qa'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)