MMMLU#
Overview#
MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual extension of the MMLU benchmark. It evaluates the multilingual knowledge and reasoning capabilities of language models across 14 languages, covering 57 subjects from the original MMLU benchmark.
Task Description#
Task Type: Multilingual Multiple-Choice Question Answering
Input: Question with four answer choices (A, B, C, D) in one of 14 languages
Output: Single correct answer letter
Languages: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, Chinese
Subjects: 57 subjects from MMLU (STEM, Humanities, Social Sciences, Other)
Key Features#
Multilingual translation of the full MMLU benchmark
14 typologically diverse languages covering major language families
Tests cross-lingual knowledge transfer and multilingual reasoning
Same subject coverage as original MMLU (57 subjects)
Includes low-resource languages (e.g., Swahili, Yoruba)
Evaluation Notes#
Default configuration uses 0-shot evaluation (test split only)
Use
subset_listto evaluate specific languages (e.g.,['ZH_CN', 'JA_JP', 'FR_FR'])Results are grouped by language subset
Cross-lingual performance comparison supported
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
196,588 |
Prompt Length (Mean) |
624.75 chars |
Prompt Length (Min/Max) |
136 / 5975 chars |
Per-Subset Statistics:
Subset |
Samples |
Prompt Mean |
Prompt Min |
Prompt Max |
|---|---|---|---|---|
|
14,042 |
584.94 |
231 |
4735 |
|
14,042 |
654.99 |
247 |
4914 |
|
14,042 |
791.64 |
294 |
5657 |
|
14,042 |
753.18 |
271 |
5791 |
|
14,042 |
777.82 |
278 |
5952 |
|
14,042 |
675.02 |
256 |
5379 |
|
14,042 |
726.51 |
270 |
5539 |
|
14,042 |
761.19 |
277 |
5975 |
|
14,042 |
322.79 |
149 |
2064 |
|
14,042 |
354.35 |
153 |
2345 |
|
14,042 |
706.79 |
258 |
5635 |
|
14,042 |
699.08 |
259 |
5566 |
|
14,042 |
681.01 |
248 |
5644 |
|
14,042 |
257.15 |
136 |
1495 |
Sample Example#
Subset: AR_XY
{
"input": [
{
"id": "e43faf14",
"content": "أجب على سؤال الاختيار من متعدد التالي. يجب أن يكون السطر الأخير من إجابتك بالتنسيق التالي: 'ANSWER: [LETTER]' (بدون علامات اقتباس) حيث [LETTER] هو أحد الحروف A,B,C,D. فكّر خطوة بخطوة قبل الإجابة.\n\nأوجد درجة امتداد الحقل المحدد Q(sqrt(2)، sqrt(3)، sqrt(18)) على Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
}
],
"choices": [
"0",
"4",
"2",
"6"
],
"target": "B",
"id": 0,
"group_id": 0,
"metadata": {
"subject": "abstract_algebra",
"language": "AR_XY"
}
}
Prompt Template#
Prompt Template:
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
{question}
{choices}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets mmmlu \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['mmmlu'],
dataset_args={
'mmmlu': {
# subset_list: ['AR_XY', 'BN_BD', 'DE_DE'] # optional, evaluate specific subsets
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)