MTEB Text Embedding Evaluation#

This framework supports MTEB 2.x (Massive Text Embedding Benchmark) for measuring text embedding model performance across retrieval, reranking, classification, clustering, semantic text similarity, and more. Supports 35+ Chinese datasets and 112+ languages.

Warning

Breaking Changes (v1.9.0)

If you are upgrading from an older version, please note:

Config schema: model -> models, eval.tasks -> eval.task_names, eval.topk -> eval.top_k
Dependencies: Requires mteb>=2.7.0 (was 1.x) and ragas>=0.4.0 (was 0.2.x)
Dataset migration: All Chinese datasets migrated from C-MTEB/ to mteb/ organization
Custom Dataset format changes: Unified to Retrieval format (JSONL), deprecated TSV qrels and Reranking/STS custom task types; qrels changed from qrels/test.tsv to flat qrels.jsonl; column names unified: _id (corpus/queries), query-id + corpus-id + score (qrels)
Removed imports: from evalscope.backend.rag_eval import EmbeddingModel -> use load_model(); from evalscope.backend.rag_eval.cmteb import ... -> use evalscope.backend.rag_eval.mteb

Environment Setup#

pip install evalscope[rag] -U

Scenario 1: Evaluate Local Embedding Models (Quick Start)#

For: locally deployed sentence-transformer models

Configuration and Running#

from evalscope.run import run_task

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "mteb",
        "models": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "max_seq_length": 512,
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {"batch_size": 128},
            }
        ],
        "eval": {
            "task_names": ["TNews", "CLSClusteringS2S", "T2Reranking", "T2Retrieval", "ATEC"],
            "verbosity": 2,
            "overwrite_results": True,
            "top_k": 10,
            "limits": 500,
        },
    },
}

run_task(task_cfg=task_cfg)

Understanding Results#

Results are saved to outputs/<model_name>/<revision>/<task_name>.json. Key fields:

{
  "task_name": "TNews",
  "scores": {
    "validation": [
      {
        "main_score": 0.4744,
        "accuracy": 0.4744,
        "f1": 0.4456,
        "f1_weighted": 0.4754
      }
    ]
  }
}

The main_score is the primary metric for each task (accuracy for classification, ndcg@10 for retrieval).

Key Parameters for This Scenario#

Parameter	Type	Default	Description
`model_name_or_path`	`str`	`""`	Model name or path, supports auto-download from ModelScope
`max_seq_length`	`int`	`512`	Maximum sequence length
`model_kwargs`	`dict`	`{}`	Model loading args, e.g. `{"torch_dtype": "auto"}`
`encode_kwargs`	`dict`	`{"batch_size": 32}`	Encoding args
`task_names`	`List[str]`	`None`	List of tasks to evaluate
`limits`	`Optional[int]`	`None`	Limit sample count (recommended for debugging)

Scenario 2: Evaluate API Models (DashScope/OpenAI, etc.)#

For: Embedding/Reranker services accessed via API

Prerequisites#

Ensure the model service is deployed and accessible. Obtain the API endpoint and key.

Configuration and Running#

import os
from evalscope.run import run_task
from evalscope import TaskConfig

task_cfg = TaskConfig(
    eval_backend='RAGEval',
    eval_config={
        'tool': 'MTEB',
        'models': [
            {
                'model_name': 'text-embedding-v3',
                'api_base': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
                'api_key': os.environ.get('DASHSCOPE_API_KEY', 'EMPTY'),
                'dimensions': 1024,
                'encode_kwargs': {
                    'batch_size': 10,
                },
            }
        ],
        'eval': {
            'task_names': ['T2Retrieval'],
            'verbosity': 2,
            'overwrite_results': True,
            'limits': 30,
        },
    },
)

run_task(task_cfg=task_cfg)

Understanding Results#

{
  "task_name": "T2Retrieval",
  "scores": {
    "dev": [
      {
        "main_score": 0.73143,
        "ndcg_at_10": 0.73143,
        "recall_at_10": 0.73318,
        "precision_at_1": 0.78989
      }
    ]
  }
}

Key Parameters for This Scenario#

Parameter	Type	Default	Description
`model_name`	`str`	-	API model name
`api_base`	`str`	-	API service URL
`api_key`	`str`	-	API key
`dimensions`	`Optional[int]`	`None`	Model output dimensions
`is_cross_encoder`	`bool`	`False`	Set to `True` for API rerankers

Scenario 3: Two-Stage Evaluation (Retrieval + Reranking)#

For: pipelines that first retrieve with an Embedding model, then re-rank with a Reranker

When to Choose Two-Stage#

Single-stage: Directly evaluate embedding models on retrieval/classification/clustering/STS tasks — suitable for most scenarios
Two-stage: When you need to evaluate a reranker model’s effect in a real retrieval pipeline; first retrieves top-K candidates with an embedding model, then re-ranks them with a reranker

Configuration and Running#

Pass two models in models: the first is the embedding model for retrieval, the second is the reranking model (set is_cross_encoder: True).

from evalscope.run import run_task

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "models": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "is_cross_encoder": False,
                "max_seq_length": 512,
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {"batch_size": 64},
            },
            {
                "model_name_or_path": "OpenBMB/MiniCPM-Reranker",
                "is_cross_encoder": True,
                "max_seq_length": 512,
                "prompt": "Generate a retrieval representation for this question",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {"batch_size": 32},
            },
        ],
        "eval": {
            "task_names": ["T2Retrieval"],
            "verbosity": 2,
            "overwrite_results": True,
            "top_k": 5,
            "limits": 100,
        },
    },
}

run_task(task_cfg=task_cfg)

Understanding Results#

Two-stage evaluation outputs results for both stage 1 (retrieval) and stage 2 (reranking):

Stage 1 (outputs/stage1/<embedding_model>/...):

{
  "task_name": "T2Retrieval",
  "scores": {"dev": [{"main_score": 0.73143, "ndcg_at_10": 0.73143}]}
}

Stage 2 (outputs/stage2/<reranker_model>/...):

{
  "task_name": "T2Retrieval",
  "scores": {"dev": [{"main_score": 0.661, "ndcg_at_10": 0.661}]}
}

Two-Stage Specific Parameters#

Parameter	Type	Default	Description
`is_cross_encoder`	`bool`	`False`	Must be `True` for the second model (reranker)
`top_k`	`int`	`10`	Number of candidates retrieved in stage 1, passed to stage 2 for reranking
`prompt`	`Optional[str]`	`None`	Query prefix prompt for retrieval tasks

Scenario 4: Custom Dataset Evaluation#

For: evaluating models with private datasets

Data Preparation#

Custom datasets must follow the Retrieval format (JSONL). For detailed format specifications:

Configuration and Running#

Use custom_tasks in eval to specify your custom dataset path:

from evalscope.run import run_task

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "models": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "max_seq_length": 512,
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {"batch_size": 128},
            }
        ],
        "eval": {
            "custom_tasks": ["path/to/your/custom_dataset"],
            "verbosity": 2,
            "overwrite_results": True,
        },
    },
}

run_task(task_cfg=task_cfg)

Supported Datasets#

Click to expand full dataset list

Name	Hub Link	Description	Type	Category	Number of Test Samples
T2Retrieval	mteb/T2Retrieval	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Retrieval	s2p	24,832
MMarcoRetrieval	mteb/MMarcoRetrieval	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Retrieval	s2p	7,437
DuRetrieval	mteb/DuRetrieval	A large-scale Chinese web search engine paragraph retrieval benchmark	Retrieval	s2p	4,000
CovidRetrieval	mteb/CovidRetrieval	COVID-19 news articles	Retrieval	s2p	949
CmedqaRetrieval	mteb/CmedqaRetrieval	Online medical consultation texts	Retrieval	s2p	3,999
EcomRetrieval	mteb/EcomRetrieval	Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems	Retrieval	s2p	1,000
MedicalRetrieval	mteb/MedicalRetrieval	Paragraph retrieval dataset collected from Alibaba medical search engine systems	Retrieval	s2p	1,000
VideoRetrieval	mteb/VideoRetrieval	Paragraph retrieval dataset collected from Alibaba video search engine systems	Retrieval	s2p	1,000
T2Reranking	mteb/T2Reranking	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Re-ranking	s2p	24,382
MMarcoReranking	mteb/MMarco-reranking	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Re-ranking	s2p	7,437
CMedQAv1	mteb/CMedQAv1-reranking	Chinese community medical Q&A	Re-ranking	s2p	2,000
CMedQAv2	mteb/CMedQAv2-reranking	Chinese community medical Q&A	Re-ranking	s2p	4,000
Ocnli	mteb/OCNLI	Original Chinese natural language inference dataset	Pair Classification	s2s	3,000
Cmnli	mteb/CMNLI	Chinese multi-class natural language inference	Pair Classification	s2s	139,000
CLSClusteringS2S	mteb/CLSClusteringS2S	Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	s2s	10,000
CLSClusteringP2P	mteb/CLSClusteringP2P	Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	p2p	10,000
ThuNewsClusteringS2S	mteb/ThuNewsClusteringS2S	Clustering titles from the THUCNews dataset	Clustering	s2s	10,000
ThuNewsClusteringP2P	mteb/ThuNewsClusteringP2P	Clustering titles + abstracts from the THUCNews dataset	Clustering	p2p	10,000
ATEC	mteb/ATEC	ATEC NLP Sentence Pair Similarity Competition	STS	s2s	20,000
BQ	mteb/BQ	Banking Question Semantic Similarity	STS	s2s	10,000
LCQMC	mteb/LCQMC	Large-scale Chinese Question Matching Corpus	STS	s2s	12,500
PAWSX	mteb/PAWSX	Translated PAWS evaluation pairs	STS	s2s	2,000
STSB	mteb/STSB	Translated STS-B into Chinese	STS	s2s	1,360
AFQMC	mteb/AFQMC	Ant Financial Question Matching Corpus	STS	s2s	3,861
QBQTC	mteb/QBQTC	QQ Browser Query Title Corpus	STS	s2s	5,000
TNews	mteb/TNews-classification	News Short Text Classification	Classification	s2s	10,000
IFlyTek	mteb/IFlyTek-classification	Long Text Classification of Application Descriptions	Classification	s2s	2,600
Waimai	mteb/waimai-classification	Sentiment Analysis of User Reviews on Food Delivery Platforms	Classification	s2s	1,000
OnlineShopping	mteb/OnlineShopping-classification	Sentiment Analysis of User Reviews on Online Shopping Websites	Classification	s2s	1,000
MultilingualSentiment	mteb/MultilingualSentiment-classification	A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative	Classification	s2s	3,000
JDReview	mteb/JDReview-classification	Reviews of iPhone	Classification	s2s	533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

Full Parameter Reference#

Local Model Parameters#

Parameter	Type	Default	Description
`model_name_or_path`	`str`	`""`	Model name or path, supports automatic download from ModelScope
`is_cross_encoder`	`bool`	`False`	Whether the model is a cross encoder; set to `True` for reranking models
`pooling_mode`	`Optional[str]`	`None`	Pooling mode. Options: cls / lasttoken / max / mean / mean_sqrt_len_tokens / weightedmean. Set to cls for `bge` series models
`max_seq_length`	`int`	`512`	Maximum sequence length
`prompt`	`Optional[str]`	`None`	Query prefix prompt for retrieval tasks
`prompts`	`Optional[Dict[str, str]]`	`None`	Per-task prompts (key=task name, value=prompt). Only effective when `prompt` is not set
`model_kwargs`	`dict`	`{}`	Model loading arguments, e.g. `{"torch_dtype": "auto"}`
`encode_kwargs`	`dict`	`{"batch_size": 32}`	Encoding arguments
`hub`	`str`	`"modelscope"`	Model source: `modelscope` or `huggingface`

Remote API Model Parameters#

Parameter	Type	Default	Description
`model_name`	`str`	-	API model name
`is_cross_encoder`	`bool`	`False`	Whether the model is a reranker; set to `True` for API rerankers
`api_base`	`str`	-	API base URL; rerankers call the rerank endpoint — if the URL does not end with `/rerank` or `/reranks`, `/rerank` is appended by default (`/reranks` for DashScope domains)
`api_key`	`str`	-	API key
`dimensions`	`Optional[int]`	`None`	Model output dimensions
`max_seq_length`	`int`	`512`	Maximum sequence length; texts exceeding this limit are automatically truncated before sending to the API
`encode_kwargs`	`dict`	`{"batch_size": 10}`	Encoding arguments

Evaluation Configuration (eval)#

Parameter	Type	Default	Description
`task_names`	`Optional[List[str]]`	`None`	Task name list, see Supported Datasets
`task_types`	`Optional[List[str]]`	`None`	Filter by task type, e.g. `["Retrieval", "STS"]`
`languages`	`Optional[List[str]]`	`None`	Filter by language, e.g. `["cmn-Hans"]`
`custom_tasks`	`Optional[List]`	`None`	Custom task configuration list, see Scenario 4
`top_k`	`int`	`10`	Select top K results for retrieval tasks
`overwrite_results`	`bool`	`True`	Whether to overwrite existing results
`limits`	`Optional[int]`	`None`	Limit on number of samples; for retrieval tasks, limits both queries and corpus (only keeps documents referenced by qrels)
`output_folder`	`str`	`"outputs"`	Output directory for results
`hub`	`str`	`"modelscope"`	Dataset source: `modelscope` or `huggingface`

Top-Level Configuration#

eval_backend: Fixed to RAGEval, indicating the RAGEval evaluation backend
eval_config.tool: Fixed to MTEB
eval_config.models: Model configuration list. Single-stage takes one model; two-stage takes two models (first for retrieval, second for reranking)

FAQ#

Model loading failure / OOM

Ensure sufficient GPU memory; reduce encode_kwargs.batch_size if needed
Use model_kwargs: {"torch_dtype": "float16"} to reduce memory usage
Verify the model_name_or_path is correct

Dataset download failure

Default source is ModelScope; ensure modelscope.cn is reachable
Set hub: "huggingface" to switch to HuggingFace source
For locally available datasets, specify a local path via model_name_or_path

Understanding result metrics

main_score: The primary evaluation metric for the task
Retrieval: ndcg_at_10 (Normalized Discounted Cumulative Gain)
Reranking: map (Mean Average Precision)
Classification: accuracy
STS: cosine_spearman (Spearman correlation of cosine similarity)
Clustering: v_measure