WMT2024++#
Overview#
WMT2024++ is a comprehensive machine translation benchmark based on the WMT 2024 news translation task. It supports 54 language pairs with English as the source language, enabling evaluation of translation quality across diverse target languages.
Task Description#
Task Type: Machine Translation
Input: Source text in English with translation prompt
Output: Translated text in the target language
Language Pairs: 54 pairs (English to 54 target languages)
Key Features#
Extensive multilingual coverage (54 target languages)
News domain text for real-world applicability
Multiple evaluation metrics (BLEU, BERTScore, COMET)
Standardized prompt template for consistent evaluation
Supports batch scoring for efficiency
Evaluation Notes#
Default configuration uses 0-shot evaluation
Metrics: BLEU, BERTScore (XLM-RoBERTa), COMET (wmt22-comet-da)
Evaluates on test split
Language-specific normalization applied
COMET metric requires
unbabel-cometpackageSubsets represent individual language pairs (e.g.,
en-zh_cn,en-de_de)
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
52,800 |
Prompt Length (Mean) |
265.45 chars |
Prompt Length (Min/Max) |
71 / 1047 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
960 |
263.26 |
75 |
1039 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
271.26 |
83 |
1047 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
271.26 |
83 |
1047 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
271.26 |
83 |
1047 |
|
960 |
271.26 |
83 |
1047 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
261.26 |
73 |
1037 |
|
960 |
263.26 |
75 |
1039 |
|
960 |
259.26 |
71 |
1035 |
|
960 |
265.26 |
77 |
1041 |
|
960 |
269.26 |
81 |
1045 |
|
960 |
259.26 |
71 |
1035 |
|
960 |
271.26 |
83 |
1047 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
267.26 |
79 |
1043 |
|
960 |
259.26 |
71 |
1035 |
Sample Example#
Subset: en-ar_eg
{
"input": [
{
"id": "557f3aa1",
"content": [
{
"text": "Translate the following english sentence into arabic:\n\nenglish: Siso's depictions of land, water center new gallery exhibition\narabic:"
}
]
}
],
"target": "رسومات سيسو عن الأرض والمية في معرضه الجديد",
"id": 0,
"group_id": 0,
"subset_key": "en-ar_eg",
"metadata": {
"source_text": "Siso's depictions of land, water center new gallery exhibition",
"target_text": "رسومات سيسو عن الأرض والمية في معرضه الجديد",
"source_language": "en",
"target_language": "ar_eg"
}
}
Prompt Template#
Prompt Template:
Translate the following {source_language} sentence into {target_language}:
{source_language}: {source_text}
{target_language}:
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets wmt24pp \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['wmt24pp'],
dataset_args={
'wmt24pp': {
# subset_list: ['en-ar_eg', 'en-ar_sa', 'en-bg_bg'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)