MGSM#
Overview#
MGSM (Multilingual Grade School Math) is a benchmark designed to evaluate multilingual mathematical reasoning capabilities of language models. It extends GSM8K to 11 typologically diverse languages, testing whether models can perform chain-of-thought reasoning across different languages.
Task Description#
Task Type: Multilingual Mathematical Word Problem Solving
Input: Grade school math word problem in one of 11 languages
Output: Step-by-step reasoning with numerical answer
Languages: English, Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
Key Features#
250 problems per language (translated from GSM8K)
11 typologically diverse languages covering different language families
Tests multilingual chain-of-thought reasoning capabilities
Same problem content across languages for cross-lingual comparison
Designed to evaluate language-agnostic mathematical reasoning
Evaluation Notes#
Default configuration uses 4-shot examples
Answers should be formatted within
\boxed{}for proper extractionUse
subset_listto evaluate specific languages (e.g.,['en', 'zh', 'ja'])Cross-lingual performance comparison supported
Few-shot examples are drawn from the train split in the same language
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
4-shot |
Evaluation Split |
|
Train Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
2,750 |
Prompt Length (Mean) |
1742.98 chars |
Prompt Length (Min/Max) |
791 / 2464 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
250 |
1790.71 |
1637 |
2165 |
|
250 |
1940.02 |
1773 |
2371 |
|
250 |
2047 |
1878 |
2440 |
|
250 |
1963.9 |
1792 |
2386 |
|
250 |
1831.66 |
1667 |
2214 |
|
250 |
842.16 |
791 |
946 |
|
250 |
1102.33 |
1035 |
1248 |
|
250 |
1835.53 |
1699 |
2135 |
|
250 |
1953.48 |
1780 |
2354 |
|
250 |
1759.28 |
1601 |
2106 |
|
250 |
2106.77 |
1939 |
2464 |
Sample Example#
Subset: en
{
"input": [
{
"id": "d67cb3cf",
"content": "Here are some examples of how to solve similar problems:\n\nQuestion: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\n\nReasoning:\nStep-by-Step Answer: Roger sta ... [TRUNCATED] ... every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nPlease reason step by step, and put your final answer within \\boxed{}.\n\n"
}
],
"target": "18",
"id": 0,
"group_id": 0,
"metadata": {
"reasoning": null,
"equation_solution": null
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Few-shot Template
Here are some examples of how to solve similar problems:
{fewshot}
{question}
Please reason step by step, and put your final answer within \boxed{{}}.
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mgsm \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mgsm'],
dataset_args={
'mgsm': {
# subset_list: ['en', 'es', 'fr'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)