PolyMath#

Overview#

PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels with 9,000 high-quality problem samples. It ensures difficulty comprehensiveness, language diversity, and high-quality translation for discriminative multilingual evaluation.

Task Description#

Task Type: Multilingual Mathematical Reasoning
Input: Math problem in one of 18 languages
Output: Numerical answer in \boxed{} format
Domains: Mathematics across multiple difficulty levels and languages

Key Features#

18 supported languages: en, zh, ar, bn, de, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi
4 difficulty levels: low, medium, high, top
9,000 high-quality problems total
Language-specific instructions for each problem
High-quality human translations ensuring accuracy

Evaluation Notes#

Default evaluation uses the test split
Primary metric: Accuracy with numeric comparison
Additional metric: DW-ACC (Difficulty-Weighted Accuracy)
- Weights: low=1, medium=2, high=4, top=8
- Provides balanced scoring across difficulty levels
Results reported per language and overall

Properties#

Property	Value
Benchmark Name	`poly_math`
Dataset ID	evalscope/PolyMath
Paper	N/A
Tags	`Math`, `MultiLingual`, `Reasoning`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	9,000
Prompt Length (Mean)	342.15 chars
Prompt Length (Min/Max)	52 / 1536 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`en-low`	125	292	142	600
`zh-low`	125	111.96	63	206
`ar-low`	125	259.15	138	536
`bn-low`	125	304.42	160	650
`de-low`	125	333.42	165	698
`es-low`	125	315.7	159	643
`fr-low`	125	331.06	178	634
`id-low`	125	332.51	175	691
`it-low`	125	315.19	164	661
`ja-low`	125	145.06	82	268
`ko-low`	125	163.17	89	342
`ms-low`	125	330.82	165	603
`pt-low`	125	306.37	160	655
`ru-low`	125	312.67	161	628
`sw-low`	125	324.54	169	638
`te-low`	125	311.38	161	575
`th-low`	125	256.28	124	519
`vi-low`	125	302.78	159	583
`en-medium`	125	304.88	107	823
`zh-medium`	125	182.79	52	503
`ar-medium`	125	282.52	98	794
`bn-medium`	125	323.46	110	761
`de-medium`	125	338.46	113	941
`es-medium`	125	322.59	120	785
`fr-medium`	125	330.45	116	766
`id-medium`	125	328.14	114	852
`it-medium`	125	315.01	110	772
`ja-medium`	125	210.79	68	548
`ko-medium`	125	219.33	64	547
`ms-medium`	125	314.84	95	829
`pt-medium`	125	314	111	767
`ru-medium`	125	334.75	120	828
`sw-medium`	125	335	110	899
`te-medium`	125	316.54	102	867
`th-medium`	125	276.01	84	658
`vi-medium`	125	307.78	108	820
`en-high`	125	391.3	120	1434
`zh-high`	125	212.87	70	1155
`ar-high`	125	356.49	115	1313
`bn-high`	125	414.23	132	1464
`de-high`	125	440.82	138	1483
`es-high`	125	422.2	134	1469
`fr-high`	125	428.81	133	1488
`id-high`	125	437.18	128	1536
`it-high`	125	408.41	128	1445
`ja-high`	125	246.59	84	1206
`ko-high`	125	261.16	98	1195
`ms-high`	125	412.78	55	1454
`pt-high`	125	408.39	127	1414
`ru-high`	125	426.44	144	1476
`sw-high`	125	438.1	125	1476
`te-high`	125	405.18	126	1430
`th-high`	125	351.18	108	1345
`vi-high`	125	383.09	124	1442
`en-top`	125	420.59	141	1346
`zh-top`	125	220.16	73	876
`ar-top`	125	378.14	136	1238
`bn-top`	125	443.98	160	1392
`de-top`	125	470.34	169	1432
`es-top`	125	456.15	150	1432
`fr-top`	125	464.7	153	1457
`id-top`	125	469.23	151	1478
`it-top`	125	445.74	146	1400
`ja-top`	125	259.17	85	925
`ko-top`	125	277.8	89	968
`ms-top`	125	458.26	144	1521
`pt-top`	125	444.11	144	1407
`ru-top`	125	466.7	159	1440
`sw-top`	125	469.38	147	1452
`te-top`	125	431.14	147	1323
`th-top`	125	384.58	137	1154
`vi-top`	125	423.55	154	1352

Sample Example#

Subset: en-low

{
  "input": [
    {
      "id": "8ac6f5ab",
      "content": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nNote: Please put the final answer in the $\\boxed\\{\\}$."
    }
  ],
  "target": "18",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "level": "low",
    "language": "en",
    "index": "0"
  }
}

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets poly_math \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['poly_math'],
    dataset_args={
        'poly_math': {
            # subset_list: ['en-low', 'zh-low', 'ar-low']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)