Chinese-SimpleQA#

Overview#

Chinese SimpleQA is a Chinese question-answering dataset designed to evaluate the performance of language models on simple factual questions. It tests the model’s ability to understand and generate correct answers in Chinese across various knowledge domains.

Task Description#

Task Type: Chinese Factual Question Answering
Input: Simple factual question in Chinese
Output: Factual answer in Chinese
Language: Chinese

Key Features#

Diverse topics covering various knowledge domains
Simple factual questions testing world knowledge
Chinese language evaluation
LLM-as-judge evaluation for answer correctness
Multiple category subsets available

Evaluation Notes#

Default configuration uses 0-shot evaluation
Uses LLM-as-judge for evaluation
Metrics: is_correct, is_incorrect, is_not_attempted
Evaluates factual accuracy without requiring exact match

Properties#

Property	Value
Benchmark Name	`chinese_simpleqa`
Dataset ID	AI-ModelScope/Chinese-SimpleQA
Paper	N/A
Tags	`Chinese`, `Knowledge`, `QA`
Metrics	`is_correct`, `is_incorrect`, `is_not_attempted`
Default Shots	0-shot
Evaluation Split	`train`

Data Statistics#

Metric	Value
Total Samples	3,000
Prompt Length (Mean)	32.45 chars
Prompt Length (Min/Max)	16 / 129 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`中华文化`	326	32.09	18	86
`人文与社会科学`	609	33.94	18	87
`工程、技术与应用科学`	481	33.13	18	91
`生活、艺术与文化`	601	32.4	17	76
`社会`	453	32.33	18	129
`自然与自然科学`	530	30.49	16	83

Sample Example#

Subset: 中华文化

{
  "input": [
    {
      "id": "b6b48177",
      "content": "请回答问题：\n\n伏兔穴所属的经脉是什么？"
    }
  ],
  "target": "足阳明胃经",
  "id": 0,
  "group_id": 0,
  "subset_key": "中华文化",
  "metadata": {
    "id": "97e7f58a3b154facaa3a5c64d678c7bf",
    "primary_category": "中华文化",
    "secondary_category": "中医"
  }
}

Prompt Template#

Prompt Template:

请回答问题：

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets chinese_simpleqa \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['chinese_simpleqa'],
    dataset_args={
        'chinese_simpleqa': {
            # subset_list: ['中华文化', '人文与社会科学', '工程、技术与应用科学']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)