CharXiv#

Overview#

CharXiv is a comprehensive chart understanding benchmark from NeurIPS 2024 that evaluates multimodal large language models on realistic scientific charts from arXiv papers. It tests both low-level chart element perception (descriptive) and high-level reasoning about chart data.

Task Description#

Task Type: Chart Understanding (Descriptive + Reasoning)
Input: Scientific chart image + question
Output: Free-form text answer
Domains: cs, physics, math, eess, q-bio, q-fin, stat, econ

Key Features#

2,323 real scientific charts from arXiv papers across 8 disciplines
Two question types:
- Descriptive (4 per chart): Basic element identification (titles, axes, legends, trends, etc.)
- Reasoning (1 per chart): Higher-order reasoning requiring data synthesis
19 descriptive question templates covering information extraction, enumeration, pattern recognition, counting, and compositionality
4 reasoning answer types: text-in-chart, text-in-general, number-in-chart, number-in-general
Validation set (1,000 charts) and test set (1,323 charts)
Evaluation via LLM judge following the official CharXiv grading protocol

Evaluation Notes#

Default evaluation uses the validation split (1,000 charts, 5,000 questions)
Each chart yields 5 samples: 4 descriptive + 1 reasoning
Primary metric: Accuracy via LLM-as-judge
Subsets: descriptive and reasoning (also by category)
Requires judge_model_args configuration for LLM judge
Paper | GitHub

Properties#

Property	Value
Benchmark Name	`charxiv`
Dataset ID	princeton-nlp/CharXiv
Paper	Paper
Tags	`MultiModal`, `QA`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`validation`

Data Statistics#

Metric	Value
Total Samples	5,000
Prompt Length (Mean)	276.24 chars
Prompt Length (Min/Max)	80 / 687 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`descriptive`	4,000	261.51	156	432
`reasoning`	1,000	335.14	80	687

Image Statistics:

Metric	Value
Total Images	5,000
Images per Sample	min: 1, max: 1, mean: 1
Resolution Range	1023x139 - 1024x1024
Formats	jpeg

Sample Example#

Subset: descriptive

{
  "input": [
    {
      "id": "44ff5b8a",
      "content": [
        {
          "image": "[BASE64_IMAGE: jpeg, ~70.0KB]"
        },
        {
          "text": "For the current plot, what is the spatially highest labeled tick on the y-axis?\n* Your final answer should be the tick value on the y-axis that is explicitly written. Ignore units or scales that are written separately from the tick."
        }
      ]
    }
  ],
  "target": "60",
  "id": 0,
  "group_id": 0,
  "subset_key": "descriptive",
  "metadata": {
    "question_type": "descriptive",
    "question_id": 7,
    "category": "cs",
    "original_id": "2004.10956"
  }
}

Prompt Template#

No prompt template defined.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets charxiv \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['charxiv'],
    dataset_args={
        'charxiv': {
            # subset_list: ['descriptive', 'reasoning']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)