MTEB Text Embedding Evaluation#
This framework supports MTEB 2.x (Massive Text Embedding Benchmark) for measuring text embedding model performance across retrieval, reranking, classification, clustering, semantic text similarity, and more. Supports 35+ Chinese datasets and 112+ languages.
Warning
Breaking Changes (v1.9.0)
If you are upgrading from an older version, please note:
Config schema:
model->models,eval.tasks->eval.task_names,eval.topk->eval.top_kDependencies: Requires
mteb>=2.7.0(was 1.x) andragas>=0.4.0(was 0.2.x)Dataset migration: All Chinese datasets migrated from
C-MTEB/tomteb/organizationCustom Dataset format changes: Unified to Retrieval format (JSONL), deprecated TSV qrels and Reranking/STS custom task types; qrels changed from
qrels/test.tsvto flatqrels.jsonl; column names unified:_id(corpus/queries),query-id+corpus-id+score(qrels)Removed imports:
from evalscope.backend.rag_eval import EmbeddingModel-> useload_model();from evalscope.backend.rag_eval.cmteb import ...-> useevalscope.backend.rag_eval.mteb
Environment Setup#
pip install evalscope[rag] -U
Scenario 1: Evaluate Local Embedding Models (Quick Start)#
For: locally deployed sentence-transformer models
Configuration and Running#
from evalscope.run import run_task
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "mteb",
"models": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"max_seq_length": 512,
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {"batch_size": 128},
}
],
"eval": {
"task_names": ["TNews", "CLSClusteringS2S", "T2Reranking", "T2Retrieval", "ATEC"],
"verbosity": 2,
"overwrite_results": True,
"top_k": 10,
"limits": 500,
},
},
}
run_task(task_cfg=task_cfg)
Understanding Results#
Results are saved to outputs/<model_name>/<revision>/<task_name>.json. Key fields:
{
"task_name": "TNews",
"scores": {
"validation": [
{
"main_score": 0.4744,
"accuracy": 0.4744,
"f1": 0.4456,
"f1_weighted": 0.4754
}
]
}
}
The main_score is the primary metric for each task (accuracy for classification, ndcg@10 for retrieval).
Key Parameters for This Scenario#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model name or path, supports auto-download from ModelScope |
|
|
|
Maximum sequence length |
|
|
|
Model loading args, e.g. |
|
|
|
Encoding args |
|
|
|
List of tasks to evaluate |
|
|
|
Limit sample count (recommended for debugging) |
Scenario 2: Evaluate API Models (DashScope/OpenAI, etc.)#
For: Embedding/Reranker services accessed via API
Prerequisites#
Ensure the model service is deployed and accessible. Obtain the API endpoint and key.
Configuration and Running#
import os
from evalscope.run import run_task
from evalscope import TaskConfig
task_cfg = TaskConfig(
eval_backend='RAGEval',
eval_config={
'tool': 'MTEB',
'models': [
{
'model_name': 'text-embedding-v3',
'api_base': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.environ.get('DASHSCOPE_API_KEY', 'EMPTY'),
'dimensions': 1024,
'encode_kwargs': {
'batch_size': 10,
},
}
],
'eval': {
'task_names': ['T2Retrieval'],
'verbosity': 2,
'overwrite_results': True,
'limits': 30,
},
},
)
run_task(task_cfg=task_cfg)
Understanding Results#
{
"task_name": "T2Retrieval",
"scores": {
"dev": [
{
"main_score": 0.73143,
"ndcg_at_10": 0.73143,
"recall_at_10": 0.73318,
"precision_at_1": 0.78989
}
]
}
}
Key Parameters for This Scenario#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
API model name |
|
|
- |
API service URL |
|
|
- |
API key |
|
|
|
Model output dimensions |
|
|
|
Set to |
Scenario 3: Two-Stage Evaluation (Retrieval + Reranking)#
For: pipelines that first retrieve with an Embedding model, then re-rank with a Reranker
When to Choose Two-Stage#
Single-stage: Directly evaluate embedding models on retrieval/classification/clustering/STS tasks — suitable for most scenarios
Two-stage: When you need to evaluate a reranker model’s effect in a real retrieval pipeline; first retrieves top-K candidates with an embedding model, then re-ranks them with a reranker
Configuration and Running#
Pass two models in models: the first is the embedding model for retrieval, the second is the reranking model (set is_cross_encoder: True).
from evalscope.run import run_task
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "MTEB",
"models": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"is_cross_encoder": False,
"max_seq_length": 512,
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {"batch_size": 64},
},
{
"model_name_or_path": "OpenBMB/MiniCPM-Reranker",
"is_cross_encoder": True,
"max_seq_length": 512,
"prompt": "Generate a retrieval representation for this question",
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {"batch_size": 32},
},
],
"eval": {
"task_names": ["T2Retrieval"],
"verbosity": 2,
"overwrite_results": True,
"top_k": 5,
"limits": 100,
},
},
}
run_task(task_cfg=task_cfg)
Understanding Results#
Two-stage evaluation outputs results for both stage 1 (retrieval) and stage 2 (reranking):
Stage 1 (outputs/stage1/<embedding_model>/...):
{
"task_name": "T2Retrieval",
"scores": {"dev": [{"main_score": 0.73143, "ndcg_at_10": 0.73143}]}
}
Stage 2 (outputs/stage2/<reranker_model>/...):
{
"task_name": "T2Retrieval",
"scores": {"dev": [{"main_score": 0.661, "ndcg_at_10": 0.661}]}
}
Two-Stage Specific Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Must be |
|
|
|
Number of candidates retrieved in stage 1, passed to stage 2 for reranking |
|
|
|
Query prefix prompt for retrieval tasks |
Scenario 4: Custom Dataset Evaluation#
For: evaluating models with private datasets
Data Preparation#
Custom datasets must follow the Retrieval format (JSONL). For detailed format specifications:
See also
Configuration and Running#
Use custom_tasks in eval to specify your custom dataset path:
from evalscope.run import run_task
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "MTEB",
"models": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"max_seq_length": 512,
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {"batch_size": 128},
}
],
"eval": {
"custom_tasks": ["path/to/your/custom_dataset"],
"verbosity": 2,
"overwrite_results": True,
},
},
}
run_task(task_cfg=task_cfg)
Supported Datasets#
Click to expand full dataset list
Name |
Hub Link |
Description |
Type |
Category |
Number of Test Samples |
|---|---|---|---|---|---|
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Retrieval |
s2p |
24,832 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Retrieval |
s2p |
7,437 |
||
A large-scale Chinese web search engine paragraph retrieval benchmark |
Retrieval |
s2p |
4,000 |
||
COVID-19 news articles |
Retrieval |
s2p |
949 |
||
Online medical consultation texts |
Retrieval |
s2p |
3,999 |
||
Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba medical search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba video search engine systems |
Retrieval |
s2p |
1,000 |
||
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Re-ranking |
s2p |
24,382 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Re-ranking |
s2p |
7,437 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
2,000 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
4,000 |
||
Original Chinese natural language inference dataset |
Pair Classification |
s2s |
3,000 |
||
Chinese multi-class natural language inference |
Pair Classification |
s2s |
139,000 |
||
Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
p2p |
10,000 |
||
Clustering titles from the THUCNews dataset |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the THUCNews dataset |
Clustering |
p2p |
10,000 |
||
ATEC NLP Sentence Pair Similarity Competition |
STS |
s2s |
20,000 |
||
Banking Question Semantic Similarity |
STS |
s2s |
10,000 |
||
Large-scale Chinese Question Matching Corpus |
STS |
s2s |
12,500 |
||
Translated PAWS evaluation pairs |
STS |
s2s |
2,000 |
||
Translated STS-B into Chinese |
STS |
s2s |
1,360 |
||
Ant Financial Question Matching Corpus |
STS |
s2s |
3,861 |
||
QQ Browser Query Title Corpus |
STS |
s2s |
5,000 |
||
News Short Text Classification |
Classification |
s2s |
10,000 |
||
Long Text Classification of Application Descriptions |
Classification |
s2s |
2,600 |
||
Sentiment Analysis of User Reviews on Food Delivery Platforms |
Classification |
s2s |
1,000 |
||
Sentiment Analysis of User Reviews on Online Shopping Websites |
Classification |
s2s |
1,000 |
||
A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative |
Classification |
s2s |
3,000 |
||
Reviews of iPhone |
Classification |
s2s |
533 |
For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.
See also
Full Parameter Reference#
Local Model Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model name or path, supports automatic download from ModelScope |
|
|
|
Whether the model is a cross encoder; set to |
|
|
|
Pooling mode. Options: cls / lasttoken / max / mean / mean_sqrt_len_tokens / weightedmean. Set to cls for |
|
|
|
Maximum sequence length |
|
|
|
Query prefix prompt for retrieval tasks |
|
|
|
Per-task prompts (key=task name, value=prompt). Only effective when |
|
|
|
Model loading arguments, e.g. |
|
|
|
Encoding arguments |
|
|
|
Model source: |
Remote API Model Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
API model name |
|
|
|
Whether the model is a reranker; set to |
|
|
- |
API base URL; rerankers call the rerank endpoint — if the URL does not end with |
|
|
- |
API key |
|
|
|
Model output dimensions |
|
|
|
Maximum sequence length; texts exceeding this limit are automatically truncated before sending to the API |
|
|
|
Encoding arguments |
Evaluation Configuration (eval)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Task name list, see Supported Datasets |
|
|
|
Filter by task type, e.g. |
|
|
|
Filter by language, e.g. |
|
|
|
Custom task configuration list, see Scenario 4 |
|
|
|
Select top K results for retrieval tasks |
|
|
|
Whether to overwrite existing results |
|
|
|
Limit on number of samples; for retrieval tasks, limits both queries and corpus (only keeps documents referenced by qrels) |
|
|
|
Output directory for results |
|
|
|
Dataset source: |
Top-Level Configuration#
eval_backend: Fixed toRAGEval, indicating the RAGEval evaluation backendeval_config.tool: Fixed toMTEBeval_config.models: Model configuration list. Single-stage takes one model; two-stage takes two models (first for retrieval, second for reranking)
FAQ#
Model loading failure / OOM
Ensure sufficient GPU memory; reduce
encode_kwargs.batch_sizeif neededUse
model_kwargs: {"torch_dtype": "float16"}to reduce memory usageVerify the
model_name_or_pathis correct
Dataset download failure
Default source is ModelScope; ensure
modelscope.cnis reachableSet
hub: "huggingface"to switch to HuggingFace sourceFor locally available datasets, specify a local path via
model_name_or_path
Understanding result metrics
main_score: The primary evaluation metric for the taskRetrieval:
ndcg_at_10(Normalized Discounted Cumulative Gain)Reranking:
map(Mean Average Precision)Classification:
accuracySTS:
cosine_spearman(Spearman correlation of cosine similarity)Clustering:
v_measure