MMMLU#

Overview#

MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual extension of the MMLU benchmark. It evaluates the multilingual knowledge and reasoning capabilities of language models across 14 languages, covering 57 subjects from the original MMLU benchmark.

Task Description#

Task Type: Multilingual Multiple-Choice Question Answering
Input: Question with four answer choices (A, B, C, D) in one of 14 languages
Output: Single correct answer letter
Languages: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, Chinese
Subjects: 57 subjects from MMLU (STEM, Humanities, Social Sciences, Other)

Key Features#

Multilingual translation of the full MMLU benchmark
14 typologically diverse languages covering major language families
Tests cross-lingual knowledge transfer and multilingual reasoning
Same subject coverage as original MMLU (57 subjects)
Includes low-resource languages (e.g., Swahili, Yoruba)

Evaluation Notes#

Default configuration uses 0-shot evaluation (test split only)
Use subset_list to evaluate specific languages (e.g., ['ZH_CN', 'JA_JP', 'FR_FR'])
Results are grouped by language subset
Cross-lingual performance comparison supported

Properties#

Property	Value
Benchmark Name	`mmmlu`
Dataset ID	openai-mirror/MMMLU
Paper	N/A
Tags	`Knowledge`, `MCQ`, `MultiLingual`
Metrics	`acc`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	196,588
Prompt Length (Mean)	624.75 chars
Prompt Length (Min/Max)	136 / 5975 chars

Per-Subset Statistics:

Subset	Samples	Prompt Mean	Prompt Min	Prompt Max
`AR_XY`	14,042	584.94	231	4735
`BN_BD`	14,042	654.99	247	4914
`DE_DE`	14,042	791.64	294	5657
`ES_LA`	14,042	753.18	271	5791
`FR_FR`	14,042	777.82	278	5952
`HI_IN`	14,042	675.02	256	5379
`ID_ID`	14,042	726.51	270	5539
`IT_IT`	14,042	761.19	277	5975
`JA_JP`	14,042	322.79	149	2064
`KO_KR`	14,042	354.35	153	2345
`PT_BR`	14,042	706.79	258	5635
`SW_KE`	14,042	699.08	259	5566
`YO_NG`	14,042	681.01	248	5644
`ZH_CN`	14,042	257.15	136	1495

Sample Example#

Subset: AR_XY

{
  "input": [
    {
      "id": "e43faf14",
      "content": "أجب على سؤال الاختيار من متعدد التالي. يجب أن يكون السطر الأخير من إجابتك بالتنسيق التالي: 'ANSWER: [LETTER]' (بدون علامات اقتباس) حيث [LETTER] هو أحد الحروف A,B,C,D. فكّر خطوة بخطوة قبل الإجابة.\n\nأوجد درجة امتداد الحقل المحدد Q(sqrt(2)، sqrt(3)، sqrt(18)) على Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
    }
  ],
  "choices": [
    "0",
    "4",
    "2",
    "6"
  ],
  "target": "B",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "subject": "abstract_algebra",
    "language": "AR_XY"
  }
}

Prompt Template#

Prompt Template:

Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.

{question}

{choices}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets mmmlu \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['mmmlu'],
    dataset_args={
        'mmmlu': {
            # subset_list: ['AR_XY', 'BN_BD', 'DE_DE']  # optional, evaluate specific subsets
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)