MGSM#

Overview#

MGSM (Multilingual Grade School Math) is a benchmark designed to evaluate multilingual mathematical reasoning capabilities of language models. It extends GSM8K to 11 typologically diverse languages, testing whether models can perform chain-of-thought reasoning across different languages.

Task Description#

Task Type: Multilingual Mathematical Word Problem Solving
Input: Grade school math word problem in one of 11 languages
Output: Step-by-step reasoning with numerical answer
Languages: English, Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu

Key Features#

250 problems per language (translated from GSM8K)
11 typologically diverse languages covering different language families
Tests multilingual chain-of-thought reasoning capabilities
Same problem content across languages for cross-lingual comparison
Designed to evaluate language-agnostic mathematical reasoning

Evaluation Notes#

Default configuration uses 4-shot examples
Answers should be formatted within \boxed{} for proper extraction
Use subset_list to evaluate specific languages (e.g., ['en', 'zh', 'ja'])
Cross-lingual performance comparison supported
Few-shot examples are drawn from the train split in the same language

Properties#

Property	Value
Benchmark Name	`mgsm`
Dataset ID	evalscope/mgsm
Paper	N/A
Tags	`Math`, `MultiLingual`, `Reasoning`
Metrics	`acc`
Default Shots	4-shot
Evaluation Split	`test`
Train Split	`train`

Data Statistics#

Metric	Value
Total Samples	2,750
Prompt Length (Mean)	1742.98 chars
Prompt Length (Min/Max)	791 / 2464 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`en`	250	1790.71	1637	2165
`es`	250	1940.02	1773	2371
`fr`	250	2047	1878	2440
`de`	250	1963.9	1792	2386
`ru`	250	1831.66	1667	2214
`zh`	250	842.16	791	946
`ja`	250	1102.33	1035	1248
`th`	250	1835.53	1699	2135
`sw`	250	1953.48	1780	2354
`bn`	250	1759.28	1601	2106
`te`	250	2106.77	1939	2464

Sample Example#

Subset: en

{
  "input": [
    {
      "id": "d67cb3cf",
      "content": "Here are some examples of how to solve similar problems:\n\nQuestion: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\n\nReasoning:\nStep-by-Step Answer: Roger sta ... [TRUNCATED] ...  every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nPlease reason step by step, and put your final answer within \\boxed{}.\n\n"
    }
  ],
  "target": "18",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "reasoning": null,
    "equation_solution": null
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

Few-shot Template

Here are some examples of how to solve similar problems:

{fewshot}

{question}
Please reason step by step, and put your final answer within \boxed{{}}.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mgsm \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mgsm'],
    dataset_args={
        'mgsm': {
            # subset_list: ['en', 'es', 'fr']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)