MTEB#

This framework supports MTEB and CMTEB, with the following details:

MTEB (Massive Text Embedding Benchmark) is a large-scale benchmark designed to measure the performance of text embedding models across diverse embedding tasks. MTEB includes 56 datasets covering 8 tasks and supports over 112 different languages. The goal of this benchmark is to assist developers in finding the best text embedding models suitable for various tasks.
C-MTEB (Chinese Massive Text Embedding Benchmark) is a dedicated evaluation benchmark for Chinese text vectors, built on MTEB, aimed at assessing the performance of Chinese text vector models. C-MTEB collects 35 public datasets and is divided into 6 categories of evaluation tasks, including retrieval, re-ranking, semantic text similarity (STS), classification, pair classification, and clustering.

Supported Datasets#

Here is an overview of the available tasks and datasets in C-MTEB:

Name	Hub Link	Description	Type	Category	Number of Test Samples
T2Retrieval	C-MTEB/T2Retrieval	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Retrieval	s2p	24,832
MMarcoRetrieval	C-MTEB/MMarcoRetrieval	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Retrieval	s2p	7,437
DuRetrieval	C-MTEB/DuRetrieval	A large-scale Chinese web search engine paragraph retrieval benchmark	Retrieval	s2p	4,000
CovidRetrieval	C-MTEB/CovidRetrieval	COVID-19 news articles	Retrieval	s2p	949
CmedqaRetrieval	C-MTEB/CmedqaRetrieval	Online medical consultation texts	Retrieval	s2p	3,999
EcomRetrieval	C-MTEB/EcomRetrieval	Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems	Retrieval	s2p	1,000
MedicalRetrieval	C-MTEB/MedicalRetrieval	Paragraph retrieval dataset collected from Alibaba medical search engine systems	Retrieval	s2p	1,000
VideoRetrieval	C-MTEB/VideoRetrieval	Paragraph retrieval dataset collected from Alibaba video search engine systems	Retrieval	s2p	1,000
T2Reranking	C-MTEB/T2Reranking	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Re-ranking	s2p	24,382
MMarcoReranking	C-MTEB/MMarco-reranking	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Re-ranking	s2p	7,437
CMedQAv1	C-MTEB/CMedQAv1-reranking	Chinese community medical Q&A	Re-ranking	s2p	2,000
CMedQAv2	C-MTEB/CMedQAv2-reranking	Chinese community medical Q&A	Re-ranking	s2p	4,000
Ocnli	C-MTEB/OCNLI	Original Chinese natural language inference dataset	Pair Classification	s2s	3,000
Cmnli	C-MTEB/CMNLI	Chinese multi-class natural language inference	Pair Classification	s2s	139,000
CLSClusteringS2S	C-MTEB/CLSClusteringS2S	Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	s2s	10,000
CLSClusteringP2P	C-MTEB/CLSClusteringP2P	Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	p2p	10,000
ThuNewsClusteringS2S	C-MTEB/ThuNewsClusteringS2S	Clustering titles from the THUCNews dataset	Clustering	s2s	10,000
ThuNewsClusteringP2P	C-MTEB/ThuNewsClusteringP2P	Clustering titles + abstracts from the THUCNews dataset	Clustering	p2p	10,000
ATEC	C-MTEB/ATEC	ATEC NLP Sentence Pair Similarity Competition	STS	s2s	20,000
BQ	C-MTEB/BQ	Banking Question Semantic Similarity	STS	s2s	10,000
LCQMC	C-MTEB/LCQMC	Large-scale Chinese Question Matching Corpus	STS	s2s	12,500
PAWSX	C-MTEB/PAWSX	Translated PAWS evaluation pairs	STS	s2s	2,000
STSB	C-MTEB/STSB	Translated STS-B into Chinese	STS	s2s	1,360
AFQMC	C-MTEB/AFQMC	Ant Financial Question Matching Corpus	STS	s2s	3,861
QBQTC	C-MTEB/QBQTC	QQ Browser Query Title Corpus	STS	s2s	5,000
TNews	C-MTEB/TNews-classification	News Short Text Classification	Classification	s2s	10,000
IFlyTek	C-MTEB/IFlyTek-classification	Long Text Classification of Application Descriptions	Classification	s2s	2,600
Waimai	C-MTEB/waimai-classification	Sentiment Analysis of User Reviews on Food Delivery Platforms	Classification	s2s	1,000
OnlineShopping	C-MTEB/OnlineShopping-classification	Sentiment Analysis of User Reviews on Online Shopping Websites	Classification	s2s	1,000
MultilingualSentiment	C-MTEB/MultilingualSentiment-classification	A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative	Classification	s2s	3,000
JDReview	C-MTEB/JDReview-classification	Reviews of iPhone	Classification	s2s	533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

Environment Setup#

Install dependencies

pip install mteb

Configure Evaluation Parameters#

The framework supports two evaluation modes: single-stage evaluation and two-stage evaluation:

Single-Stage Evaluation: Directly use the model for prediction and compute metrics. Supports tasks such as retrieval, re-ranking, and classification for embedding models.
Two-Stage Evaluation: Use the model for retrieval first, then use the model for re-ranking, and compute metrics. Supports re-ranking models.

Single-stage Evaluation#

Example configuration file:

one_stage_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "pooling_mode": None,
                "max_seq_length": 512,
                "prompt": "",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 128,
                },
            }
        ],
        "eval": {
            "tasks": [
                "TNews",
                "CLSClusteringS2S",
                "T2Reranking",
                "T2Retrieval",
                "ATEC",
            ],
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "topk": 10,
            "limits": 500,
        },
    },
}

Parameter Explanation#

eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.
eval_config: A dictionary containing the following fields:
- tool: The evaluation tool, using MTEB.
- model: A list of model configurations. Only one model can be placed for single-stage evaluation, containing the following fields:
  - model_name_or_path: str The model name or path. Supports automatic downloading from the ModelScope repository.
  - is_cross_encoder: bool Whether the model is a cross-encoder. Default is False.
  - pooling_mode: Optional[str] Pooling mode. Default is mean. Possible values are: “cls”, “lasttoken”, “max”, “mean”, “mean_sqrt_len_tokens”, or “weightedmean”. For the bge series models, please set it to “cls”.
  - max_seq_length: int Maximum sequence length. Default is 512.
  - prompt: str Prompt for the retrieval task in front of the model. Default is an empty string.
  - model_kwargs: dict Keyword arguments for the model. Default value is {"torch_dtype": "auto"}.
  - config_kwargs: Dict[str, Any] Keyword arguments for the configuration. Default is an empty dictionary.
  - encode_kwargs: dict Keyword arguments for encoding. Default value is:
    { "show_progress_bar": True, "batch_size": 32 }
  - hub: str Source of the model, which can be “modelscope” or “huggingface”.
- eval: A dictionary containing the following fields:
  - tasks: List[str] Task names. See Task List.
  - top_k: int Number of top results to select for retrieval tasks.
  - verbosity: int Level of verbosity. Range is 0-3.
  - output_folder: str Output folder. Default is “outputs”.
  - overwrite_results: bool Whether to overwrite results. Default is True.
  - limits: Optional[int] Limit the number of samples. Default is None; it is not recommended to set this for retrieval tasks.
  - hub: str Source of the dataset, which can be “modelscope” or “huggingface”.

Two-stage Evaluation#

Example configuration file: first perform retrieval, then reranking:

two_stage_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "is_cross_encoder": False,
                "max_seq_length": 512,
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 64,
                },
            },
            {
                "model_name_or_path": "OpenBMB/MiniCPM-Reranker",
                "is_cross_encoder": True,
                "max_seq_length": 512,
                "prompt": "Generate a retrieval representation for this question",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 32,
                },
            },
        ],
        "eval": {
            "tasks": ["T2Retrieval"],
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "topk": 5,
            "limits": 100,
        },
    },
}

Parameter Explanation#

The basic parameters are the same as those for single-stage evaluation. The difference lies in the model field, where two models need to be provided. The first model is used for retrieval, and the second model is used for reranking. The reranking model needs to be a cross-encoder, i.e., is_cross_encoder=True.

Model Evaluation#

from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()

one_stage_task_cfg = one_stage_task_cfg
# or
# two_stage_task_cfg = two_stage_task_cfg

# Run task
run_task(task_cfg=one_stage_task_cfg) 
# or 
# run_task(task_cfg=two_stage_task_cfg)

The following is an example of the output:

One-Stage Evaluation

Outputs

outputs/m3e-base/master/TNews.json#

{
  "dataset_revision": "317f262bf1e6126357bbe89e875451e4b0938fe4",
  "evaluation_time": 16.50650382041931,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "validation": [
      {
        "accuracy": 0.4744,
        "f1": 0.44562489526640825,
        "f1_weighted": 0.47540307398330806,
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.4744,
        "scores_per_experiment": [
          {
            "accuracy": 0.48,
            "f1": 0.4536376605217497,
            "f1_weighted": 0.47800277926811163
          },
          {
            "accuracy": 0.48,
            "f1": 0.44713633954639176,
            "f1_weighted": 0.4826984434763292
          },
          {
            "accuracy": 0.462,
            "f1": 0.433365706955334,
            "f1_weighted": 0.4640970055245127
          },
          {
            "accuracy": 0.484,
            "f1": 0.4586732839614161,
            "f1_weighted": 0.4857359110392786
          },
          {
            "accuracy": 0.462,
            "f1": 0.4293797541165097,
            "f1_weighted": 0.4632657330831137
          },
          {
            "accuracy": 0.474,
            "f1": 0.44775120246296396,
            "f1_weighted": 0.4737182842092953
          },
          {
            "accuracy": 0.47,
            "f1": 0.4431197566080463,
            "f1_weighted": 0.4714830140231783
          },
          {
            "accuracy": 0.472,
            "f1": 0.44322381694059326,
            "f1_weighted": 0.47100005556357255
          },
          {
            "accuracy": 0.484,
            "f1": 0.45454749692062835,
            "f1_weighted": 0.4856239367465818
          },
          {
            "accuracy": 0.476,
            "f1": 0.44541393463044954,
            "f1_weighted": 0.47840557689910646
          }
        ]
      }
    ]
  },
  "task_name": "TNews"
}

Two-stage Evaluation

first stage

outputs/stage1/m3e-base/v1/T2Retrieval.json#

{
  "dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
  "evaluation_time": 599.5170171260834,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "dev": [
      {
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.73143,
        "map_at_1": 0.22347,
        "map_at_10": 0.63237,
        "map_at_100": 0.67533,
        "map_at_1000": 0.67651,
        "map_at_20": 0.66282,
        "map_at_3": 0.43874,
        "map_at_5": 0.54049,
        "mrr_at_1": 0.7898912852884447,
        "mrr_at_10": 0.8402654617870331,
        "mrr_at_100": 0.8421827758769684,
        "mrr_at_1000": 0.8422583001072272,
        "mrr_at_20": 0.8415411456315557,
        "mrr_at_3": 0.8307469752761716,
        "mrr_at_5": 0.8368029984218875,
        "nauc_map_at_1000_diff1": 0.17749400860890877,
        "nauc_map_at_1000_max": 0.42844516520725967,
        "nauc_map_at_1000_std": 0.18789871694419072,
        "nauc_map_at_100_diff1": 0.17747467084779375,
        "nauc_map_at_100_max": 0.42732291785494575,
        "nauc_map_at_100_std": 0.18694287087286737,
        "nauc_map_at_10_diff1": 0.19976199493034202,
        "nauc_map_at_10_max": 0.3374436217668296,
        "nauc_map_at_10_std": 0.07951451707732717,
        "nauc_map_at_1_diff1": 0.41727578149080663,
        "nauc_map_at_1_max": -0.1402656422184478,
        "nauc_map_at_1_std": -0.26168722519030313,
        "nauc_map_at_20_diff1": 0.1811898211371171,
        "nauc_map_at_20_max": 0.40563441466210043,
        "nauc_map_at_20_std": 0.15927727170010608,
        "nauc_map_at_3_diff1": 0.31255422845809033,
        "nauc_map_at_3_max": 0.007523677231905161,
        "nauc_map_at_3_std": -0.19578481884353466,
        "nauc_map_at_5_diff1": 0.26073699217160473,
        "nauc_map_at_5_max": 0.14665611579604088,
        "nauc_map_at_5_std": -0.09600383298672226,
        "nauc_mrr_at_1000_diff1": 0.3819666309367981,
        "nauc_mrr_at_1000_max": 0.6285393024619401,
        "nauc_mrr_at_1000_std": 0.3294970299417527,
        "nauc_mrr_at_100_diff1": 0.3819436006743644,
        "nauc_mrr_at_100_max": 0.6286346262471935,
        "nauc_mrr_at_100_std": 0.32963045935037844,
        "nauc_mrr_at_10_diff1": 0.3819124721154632,
        "nauc_mrr_at_10_max": 0.6292778905762176,
        "nauc_mrr_at_10_std": 0.3298187966196067,
        "nauc_mrr_at_1_diff1": 0.3862589251033909,
        "nauc_mrr_at_1_max": 0.589976680174432,
        "nauc_mrr_at_1_std": 0.2780515387897469,
        "nauc_mrr_at_20_diff1": 0.38198959771391816,
        "nauc_mrr_at_20_max": 0.6290569436652999,
        "nauc_mrr_at_20_std": 0.3301570340189363,
        "nauc_mrr_at_3_diff1": 0.3825046940733129,
        "nauc_mrr_at_3_max": 0.6282507269128365,
        "nauc_mrr_at_3_std": 0.3260807934869131,
        "nauc_mrr_at_5_diff1": 0.3816317396711923,
        "nauc_mrr_at_5_max": 0.6288655177904692,
        "nauc_mrr_at_5_std": 0.3298854062538469,
        "nauc_ndcg_at_1000_diff1": 0.21319598381916555,
        "nauc_ndcg_at_1000_max": 0.5328295949130256,
        "nauc_ndcg_at_1000_std": 0.2946773445135694,
        "nauc_ndcg_at_100_diff1": 0.2089807772703975,
        "nauc_ndcg_at_100_max": 0.5239397690321543,
        "nauc_ndcg_at_100_std": 0.29123456982125717,
        "nauc_ndcg_at_10_diff1": 0.20555333230027603,
        "nauc_ndcg_at_10_max": 0.44316027023003046,
        "nauc_ndcg_at_10_std": 0.1921835220940756,
        "nauc_ndcg_at_1_diff1": 0.3862589251033909,
        "nauc_ndcg_at_1_max": 0.589976680174432,
        "nauc_ndcg_at_1_std": 0.2780515387897469,
        "nauc_ndcg_at_20_diff1": 0.20754208582741446,
        "nauc_ndcg_at_20_max": 0.4786092392092643,
        "nauc_ndcg_at_20_std": 0.23536973680564616,
        "nauc_ndcg_at_3_diff1": 0.1902823773882388,
        "nauc_ndcg_at_3_max": 0.5400466380622567,
        "nauc_ndcg_at_3_std": 0.2713874990424778,
        "nauc_ndcg_at_5_diff1": 0.18279298790691637,
        "nauc_ndcg_at_5_max": 0.4916119327522918,
        "nauc_ndcg_at_5_std": 0.2375397192963552,
        "nauc_precision_at_1000_diff1": -0.20510380600112582,
        "nauc_precision_at_1000_max": 0.4958820760698651,
        "nauc_precision_at_1000_std": 0.5402465580496146,
        "nauc_precision_at_100_diff1": -0.1994322347949809,
        "nauc_precision_at_100_max": 0.5206762748551254,
        "nauc_precision_at_100_std": 0.5568154081333078,
        "nauc_precision_at_10_diff1": -0.16707155441197413,
        "nauc_precision_at_10_max": 0.5600612846655972,
        "nauc_precision_at_10_std": 0.49419688804691536,
        "nauc_precision_at_1_diff1": 0.3862589251033909,
        "nauc_precision_at_1_max": 0.589976680174432,
        "nauc_precision_at_1_std": 0.2780515387897469,
        "nauc_precision_at_20_diff1": -0.18471041949530417,
        "nauc_precision_at_20_max": 0.5458950955439645,
        "nauc_precision_at_20_std": 0.5355982267058214,
        "nauc_precision_at_3_diff1": -0.03826790088047189,
        "nauc_precision_at_3_max": 0.5833083970750171,
        "nauc_precision_at_3_std": 0.380196662597275,
        "nauc_precision_at_5_diff1": -0.11789367842600275,
        "nauc_precision_at_5_max": 0.5708494593335263,
        "nauc_precision_at_5_std": 0.42860609671688105,
        "nauc_recall_at_1000_diff1": 0.1341309660059583,
        "nauc_recall_at_1000_max": 0.5923755841077135,
        "nauc_recall_at_1000_std": 0.5980459502693942,
        "nauc_recall_at_100_diff1": 0.12181394285840096,
        "nauc_recall_at_100_max": 0.47090136790318127,
        "nauc_recall_at_100_std": 0.3959369184297595,
        "nauc_recall_at_10_diff1": 0.17356300971546512,
        "nauc_recall_at_10_max": 0.25475707245853674,
        "nauc_recall_at_10_std": 0.041819982320384745,
        "nauc_recall_at_1_diff1": 0.41727578149080663,
        "nauc_recall_at_1_max": -0.1402656422184478,
        "nauc_recall_at_1_std": -0.26168722519030313,
        "nauc_recall_at_20_diff1": 0.14273713155999543,
        "nauc_recall_at_20_max": 0.36251116771924663,
        "nauc_recall_at_20_std": 0.1912123941692314,
        "nauc_recall_at_3_diff1": 0.2873719855400218,
        "nauc_recall_at_3_max": -0.041198403561830285,
        "nauc_recall_at_3_std": -0.21921947922872737,
        "nauc_recall_at_5_diff1": 0.23680082643694844,
        "nauc_recall_at_5_max": 0.06580524171324151,
        "nauc_recall_at_5_std": -0.14104561361502632,
        "ndcg_at_1": 0.78989,
        "ndcg_at_10": 0.73143,
        "ndcg_at_100": 0.78829,
        "ndcg_at_1000": 0.80026,
        "ndcg_at_20": 0.75787,
        "ndcg_at_3": 0.7417,
        "ndcg_at_5": 0.72641,
        "precision_at_1": 0.78989,
        "precision_at_10": 0.37304,
        "precision_at_100": 0.04828,
        "precision_at_1000": 0.00511,
        "precision_at_20": 0.21403,
        "precision_at_3": 0.65461,
        "precision_at_5": 0.54942,
        "recall_at_1": 0.22347,
        "recall_at_10": 0.73318,
        "recall_at_100": 0.91093,
        "recall_at_1000": 0.97197,
        "recall_at_20": 0.81286,
        "recall_at_3": 0.46573,
        "recall_at_5": 0.59383
      }
    ]
  },
  "task_name": "T2Retrieval"
}

second stage

outputs/stage2/jina-reranker-v2-base-multilingual/master/T2Retrieval.json#

{
  "dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
  "evaluation_time": 332.15709686279297,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "dev": [
      {
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.661,
        "map_at_1": 0.24264,
        "map_at_10": 0.56291,
        "map_at_100": 0.56291,
        "map_at_1000": 0.56291,
        "map_at_20": 0.56291,
        "map_at_3": 0.4714,
        "map_at_5": 0.56291,
        "mrr_at_1": 0.841969139049623,
        "mrr_at_10": 0.8689147524694633,
        "mrr_at_100": 0.8689147524694633,
        "mrr_at_1000": 0.8689147524694633,
        "mrr_at_20": 0.8689147524694633,
        "mrr_at_3": 0.8664883979192248,
        "mrr_at_5": 0.8689147524694633,
        "nauc_map_at_1000_diff1": 0.12071580301051653,
        "nauc_map_at_1000_max": 0.2536691069727338,
        "nauc_map_at_1000_std": 0.343624832364704,
        "nauc_map_at_100_diff1": 0.12071580301051653,
        "nauc_map_at_100_max": 0.2536691069727338,
        "nauc_map_at_100_std": 0.343624832364704,
        "nauc_map_at_10_diff1": 0.12071580301051653,
        "nauc_map_at_10_max": 0.2536691069727338,
        "nauc_map_at_10_std": 0.343624832364704,
        "nauc_map_at_1_diff1": 0.47964980727810325,
        "nauc_map_at_1_max": -0.08015044571696166,
        "nauc_map_at_1_std": 0.3507257834956417,
        "nauc_map_at_20_diff1": 0.12071580301051653,
        "nauc_map_at_20_max": 0.2536691069727338,
        "nauc_map_at_20_std": 0.343624832364704,
        "nauc_map_at_3_diff1": 0.23481937699306626,
        "nauc_map_at_3_max": 0.10372745264123306,
        "nauc_map_at_3_std": 0.45345158923063256,
        "nauc_map_at_5_diff1": 0.12071580301051653,
        "nauc_map_at_5_max": 0.2536691069727338,
        "nauc_map_at_5_std": 0.343624832364704,
        "nauc_mrr_at_1000_diff1": 0.23393918304502795,
        "nauc_mrr_at_1000_max": 0.8703379129725659,
        "nauc_mrr_at_1000_std": 0.5785333616122065,
        "nauc_mrr_at_100_diff1": 0.23393918304502795,
        "nauc_mrr_at_100_max": 0.8703379129725659,
        "nauc_mrr_at_100_std": 0.5785333616122065,
        "nauc_mrr_at_10_diff1": 0.23393918304502795,
        "nauc_mrr_at_10_max": 0.8703379129725659,
        "nauc_mrr_at_10_std": 0.5785333616122065,
        "nauc_mrr_at_1_diff1": 0.2520016067648708,
        "nauc_mrr_at_1_max": 0.8560897633767299,
        "nauc_mrr_at_1_std": 0.5642467684745208,
        "nauc_mrr_at_20_diff1": 0.23393918304502795,
        "nauc_mrr_at_20_max": 0.8703379129725659,
        "nauc_mrr_at_20_std": 0.5785333616122065,
        "nauc_mrr_at_3_diff1": 0.2343988881957151,
        "nauc_mrr_at_3_max": 0.8695482778251757,
        "nauc_mrr_at_3_std": 0.5799167198804328,
        "nauc_mrr_at_5_diff1": 0.23393918304502795,
        "nauc_mrr_at_5_max": 0.8703379129725659,
        "nauc_mrr_at_5_std": 0.5785333616122065,
        "nauc_ndcg_at_1000_diff1": 0.11252208055013257,
        "nauc_ndcg_at_1000_max": 0.3417865079349515,
        "nauc_ndcg_at_1000_std": 0.3623961771041499,
        "nauc_ndcg_at_100_diff1": 0.11252208055013257,
        "nauc_ndcg_at_100_max": 0.3417865079349515,
        "nauc_ndcg_at_100_std": 0.3623961771041499,
        "nauc_ndcg_at_10_diff1": 0.10015448775533999,
        "nauc_ndcg_at_10_max": 0.3761759074862075,
        "nauc_ndcg_at_10_std": 0.35152523471339914,
        "nauc_ndcg_at_1_diff1": 0.2524564785684737,
        "nauc_ndcg_at_1_max": 0.8566368743831702,
        "nauc_ndcg_at_1_std": 0.5635391925059349,
        "nauc_ndcg_at_20_diff1": 0.11228113618796766,
        "nauc_ndcg_at_20_max": 0.34274993051851965,
        "nauc_ndcg_at_20_std": 0.36216437469674284,
        "nauc_ndcg_at_3_diff1": -0.062134030685870506,
        "nauc_ndcg_at_3_max": 0.7183183844837573,
        "nauc_ndcg_at_3_std": 0.3352626268658533,
        "nauc_ndcg_at_5_diff1": -0.04476981761624879,
        "nauc_ndcg_at_5_max": 0.6272060974309411,
        "nauc_ndcg_at_5_std": 0.21341258393783158,
        "nauc_precision_at_1000_diff1": -0.3554940965683014,
        "nauc_precision_at_1000_max": 0.605443274008298,
        "nauc_precision_at_1000_std": -0.13073611213585504,
        "nauc_precision_at_100_diff1": -0.35549409656830133,
        "nauc_precision_at_100_max": 0.6054432740082977,
        "nauc_precision_at_100_std": -0.13073611213585531,
        "nauc_precision_at_10_diff1": -0.3554940965683011,
        "nauc_precision_at_10_max": 0.6054432740082981,
        "nauc_precision_at_10_std": -0.1307361121358551,
        "nauc_precision_at_1_diff1": 0.2524564785684737,
        "nauc_precision_at_1_max": 0.8566368743831702,
        "nauc_precision_at_1_std": 0.5635391925059349,
        "nauc_precision_at_20_diff1": -0.3554940965683011,
        "nauc_precision_at_20_max": 0.6054432740082981,
        "nauc_precision_at_20_std": -0.1307361121358551,
        "nauc_precision_at_3_diff1": -0.3377658698816377,
        "nauc_precision_at_3_max": 0.6780151277792397,
        "nauc_precision_at_3_std": 0.12291559606586676,
        "nauc_precision_at_5_diff1": -0.3554940965683011,
        "nauc_precision_at_5_max": 0.6054432740082981,
        "nauc_precision_at_5_std": -0.1307361121358551,
        "nauc_recall_at_1000_diff1": 0.1091970342988605,
        "nauc_recall_at_1000_max": 0.18339955163544436,
        "nauc_recall_at_1000_std": 0.30756376767627086,
        "nauc_recall_at_100_diff1": 0.1091970342988605,
        "nauc_recall_at_100_max": 0.18339955163544436,
        "nauc_recall_at_100_std": 0.30756376767627086,
        "nauc_recall_at_10_diff1": 0.1091970342988605,
        "nauc_recall_at_10_max": 0.18339955163544436,
        "nauc_recall_at_10_std": 0.30756376767627086,
        "nauc_recall_at_1_diff1": 0.47964980727810325,
        "nauc_recall_at_1_max": -0.08015044571696166,
        "nauc_recall_at_1_std": 0.3507257834956417,
        "nauc_recall_at_20_diff1": 0.1091970342988605,
        "nauc_recall_at_20_max": 0.18339955163544436,
        "nauc_recall_at_20_std": 0.30756376767627086,
        "nauc_recall_at_3_diff1": 0.22013063499116758,
        "nauc_recall_at_3_max": 0.054749114965246065,
        "nauc_recall_at_3_std": 0.4258163949018153,
        "nauc_recall_at_5_diff1": 0.1091970342988605,
        "nauc_recall_at_5_max": 0.18339955163544436,
        "nauc_recall_at_5_std": 0.30756376767627086,
        "ndcg_at_1": 0.84188,
        "ndcg_at_10": 0.661,
        "ndcg_at_100": 0.6534,
        "ndcg_at_1000": 0.6534,
        "ndcg_at_20": 0.65358,
        "ndcg_at_3": 0.7826,
        "ndcg_at_5": 0.74517,
        "precision_at_1": 0.84188,
        "precision_at_10": 0.27471,
        "precision_at_100": 0.02747,
        "precision_at_1000": 0.00275,
        "precision_at_20": 0.13736,
        "precision_at_3": 0.68633,
        "precision_at_5": 0.54942,
        "recall_at_1": 0.24264,
        "recall_at_10": 0.59383,
        "recall_at_100": 0.59383,
        "recall_at_1000": 0.59383,
        "recall_at_20": 0.59383,
        "recall_at_3": 0.48934,
        "recall_at_5": 0.59383
      }
    ]
  },
  "task_name": "T2Retrieval"
}

Custom Dataset Evaluation#

Custom Text Retrieval Evaluation#

Construct the Dataset with the following format:

retrieval_data
├── corpus.jsonl
├── queries.jsonl
└── qrels
    └── test.tsv

Where:

corpus.jsonl: Corpus file, each line is a JSON object formatted as {"_id": "xxx", "text": "xxx"}. _id is the corpus ID, and text is the corpus text. For example:

{"_id": "doc1", "text": "Climate change is leading to more extreme weather patterns."}
{"_id": "doc2", "text": "The stock market surged today, led by tech stocks."}
{"_id": "doc3", "text": "AI is transforming various industries by automating tasks and providing insights."}
{"_id": "doc4", "text": "With technological advancements, renewable energy sources like wind and solar are becoming more widespread."}
{"_id": "doc5", "text": "Recent studies show that a balanced diet and regular exercise can significantly improve mental health."}
{"_id": "doc6", "text": "Virtual reality is creating new opportunities in education, entertainment, and training."}
{"_id": "doc7", "text": "Electric vehicles are becoming increasingly popular due to environmental benefits and advances in battery technology."}
{"_id": "doc8", "text": "Space exploration missions are revealing new information about our solar system and beyond."}
{"_id": "doc9", "text": "Blockchain technology has potential applications beyond cryptocurrency, including supply chain management and secure voting systems."}
{"_id": "doc10", "text": "The benefits of remote work include greater flexibility and reduced commuting time."}

queries.jsonl: Query file, each line is a JSON object formatted as {"_id": "xxx", "text": "xxx"}. _id is the query ID, and text is the query text. For example:

{"_id": "query1", "text": "What are the impacts of climate change?"}
{"_id": "query2", "text": "What caused the stock market to rise today?"}
{"_id": "query3", "text": "How is AI changing industries?"}
{"_id": "query4", "text": "What are the advancements in renewable energy?"}
{"_id": "query5", "text": "How does a balanced diet improve mental health?"}
{"_id": "query6", "text": "What new opportunities does virtual reality create?"}
{"_id": "query7", "text": "Why are electric vehicles becoming more popular?"}
{"_id": "query8", "text": "What new information has space exploration revealed?"}
{"_id": "query9", "text": "What are the applications of blockchain technology beyond cryptocurrency?"}
{"_id": "query10", "text": "What are the benefits of remote work?"}

qrels: Evaluation file(s) in TSV format, with columns query-id, doc-id, and score. query-id is the query ID, doc-id is the corpus ID, and score is the relevance score. For example:

query-id  corpus-id  score
query1    doc1       1
query2    doc2       1
query3    doc3       1
query4    doc4       1
query5    doc5       1
query6    doc6       1
query7    doc7       1
query8    doc8       1
query9    doc9       1
query10   doc10      1

Construct the Configuration File

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "pooling_mode": None,  # load from model config
                "max_seq_length": 512,
                "prompt": "",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 128,
                },
            }
        ],
        "eval": {
            "tasks": ["CustomRetrieval"],
            "dataset_path": "custom_eval/text/retrieval",
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "limits": 500,
        },
    },
}

Parameter description, with essential parameters modified from the default configuration:

eval:
- tasks: Evaluation task, must be CustomRetrieval.
- dataset_path: Path to the custom dataset.

Run the Evaluation

from evalscope.run import run_task

run_task(task_cfg=task_cfg)